Skip to site navigation (Press enter)

Apache Pinot Daily Email Digest (2022-04-07)

Pinot Slack Email Digest Thu, 07 Apr 2022 19:00:52 -0700

#general

@mannamra: @mannamra has joined the channel
@nrajendra434: @nrajendra434 has joined the channel
@fding: @fding has joined the channel
@fizza.abid: @fizza.abid has joined the channel
@diana.arnos: Hey there! I have a different type of question this time: If I had to give a presentation to my company advocating for us to start using Pinot as a go-to tool for user-facing real-time analytics, which arguments or points of view you would recommend me to speak about?
@diogo.baeder: For BrandIndex, in YouGov, we had a huge issue with performance when using PostgreSQL, so the biggest factor for us was performance for analytics. But the second factor I'd say is the support for multi-valued columns.
@ken: The big wins for us were (a) using appropriate indices & star trees, we could satisfy performance requirements for ad hoc queries, (b) SQL interface made it easy for the UI layer to build dashboards, and (c) we could bulk build segments (using Flink).
@mayanks: This might also be helpful:
@arekchmura: @arekchmura has joined the channel
@arekchmura: Hi everyone! I was wondering whether the dataset used is available somewhere (Airline data from 1987-2008). I am currently working on my Master's thesis and I would like to run some experiments on that dataset. Thanks
@mayanks: Should be part of the integration tests in the Pinot code base. But there might be better ones out there in the blogs.
@mitchellh: has a link to the source of the dataset.
@mitchellh: also, might be interesting to you.
@arekchmura: Thank you, that will be very helpful!
@abhinav.wagle1: Hi there, Checking community reviews. We are in process of setting up Kubernetes-based deployment of Pinot Cluster. Has anyone seen significant performance gains from using SSDs with instance store instead of EBS for server PODs?
@mayanks: Afaik, most folks end up using EBS and works well. Personally I am unaware of a case where some use case had to move from EBS to SSD for perf.
@abhinav.wagle1: @bagi.priyank: FYI
@abhinav.wagle1: Thanks @mayanks!
@bagi.priyank: right, i am not saying we must use instance store for our use case. i am asking to compare ssd on ebs v/s instance store for our query pattern. we saw considerable improvement in performance with our adhoc queries with instance store during poc
@g.kishore: yes, instance local storage will always be faster than the remote ebs. My suggestion is to have the helm chart have both profiles. Start with ebs but if you need even better performance, you can can chose to dynamically shift between the two modes..
@g.kishore: For e.g. you can have some tables on local and other tables on ebs
@noiarek: @noiarek has joined the channel

#random

@mannamra: @mannamra has joined the channel
@nrajendra434: @nrajendra434 has joined the channel
@fding: @fding has joined the channel
@fizza.abid: @fizza.abid has joined the channel
@arekchmura: @arekchmura has joined the channel
@noiarek: @noiarek has joined the channel

#feat-compound-types

@adam.hutson: @adam.hutson has joined the channel

#feat-text-search

@adam.hutson: @adam.hutson has joined the channel

#feat-rt-seg-complete

@adam.hutson: @adam.hutson has joined the channel

#feat-presto-connector

@adam.hutson: @adam.hutson has joined the channel

#feat-upsert

@adam.hutson: @adam.hutson has joined the channel

#pinot-helix

@adam.hutson: @adam.hutson has joined the channel

#group-by-refactor

@adam.hutson: @adam.hutson has joined the channel

#qps-metric

@adam.hutson: @adam.hutson has joined the channel

#order-by

@adam.hutson: @adam.hutson has joined the channel

#feat-better-schema-evolution

@adam.hutson: @adam.hutson has joined the channel

#fraud

@adam.hutson: @adam.hutson has joined the channel

#pinotadls

@adam.hutson: @adam.hutson has joined the channel

#inconsistent-segment

@adam.hutson: @adam.hutson has joined the channel

#pinot-power-bi

@adam.hutson: @adam.hutson has joined the channel

#twitter

@adam.hutson: @adam.hutson has joined the channel

#apa-16824

@adam.hutson: @adam.hutson has joined the channel

#pinot-website

@adam.hutson: @adam.hutson has joined the channel

#minion-star-tree

@adam.hutson: @adam.hutson has joined the channel

#troubleshooting

@mannamra: @mannamra has joined the channel
@alihaydar.atil: Hello everyone, is it normal for pinot-server to flood this log? i have noticed this after upgrading to 0.10.0 `[Consumer clientId=consumer-null-808, groupId=null] Seeking to offset 59190962 for partition mytopic-0`
@mayanks: Seems like this is coming from the kafka consumer
@nrajendra434: @nrajendra434 has joined the channel
@fding: @fding has joined the channel
@alihaydar.atil: Hello everyone, If i don't set 'maxNumRecordsPerSegment' config for my 'RealtimeToOfflineSegmentsTask', would it truncate my data if i have more records than default value (it says 5.000.000 in docs) for that time window?
@npawar: If you have more than 5m (or whatever value is set in maxNumRecords), it will generate multiple segments in that run, with 5m records per segment. No truncation
@alihaydar.atil: thank you for response :pray:
@fizza.abid: @fizza.abid has joined the channel
@fizza.abid: Hello everyone! I want to connect my s3 data to Apache pinot? Can someone guide me about it? Is is possible through helm or I'll have to create a job for ingestion? Currently, we don't use kafka.
@mark.needham: There's a guide that shows how to import S3 files here --
@fizza.abid: And can you tell where do we need to run this command? I have configured it using helm and deployed on kubernetes.
@tisantos: @fizza.abid you just need to create a Pinot table with the a tableconfig containing the S3 ingestion properties. You can schedule the ingestion via the `schedule` property or you can manually trigger via the controller rest API.
@tisantos: Check the /task/schedule API in swagger
@npawar: @tisantos I believe Mark’s steps point to the LaunchIngestionJob command and not the minion based ingestion.
@tisantos: Ah i believe you're correct. In that case you should be able to ssh into your controller and execute `pinot-admin.sh` script in the /bin directory.
@arekchmura: @arekchmura has joined the channel
@luisfernandez: in the pinot docs we have this about but how is this starting the pinot infra via the IDE? I guess that ultimately my question is how can I attach remote debugger to my local pinot processes
@mayanks: It is suggesting to start the `quickStart` program, which internally starts all pinot components within the same jvm. You can run Pinot and debug it in IDE as you do any application.
@luisfernandez: i got this exception do yo uknow why it may be?
@luisfernandez: ```Instance 0.0.26.108_9000 is not leader of cluster QuickStartCluster due to exception happen when session check org.I0Itec.zkclient.exception.ZkInterruptedException: java.lang.InterruptedException```
@luisfernandez: i was trying to run the empty quick start model
@mayanks: This is a newer feature. @kennybastani any idea on what might be going on?
@kennybastani: @luisfernandez What command are you using to start Pinot?
@luisfernandez: like this sh pinot-admin.sh QuickStart -type EMPTY -dataDir “usr/local/var/lib/pinot/data”
@kennybastani: Do you have ZK running externally?
@luisfernandez: no
@luisfernandez: but i also don’t have zookeeper running locally do i have to run zk manually first i thought this would start zk for me
@kennybastani: Yes, it will
@kennybastani: One sec
@kennybastani: Please run this command
@kennybastani: `netstat -vanp tcp | grep '*.2123\|9000\|8000\|7000'`
@kennybastani: And let me know what the output is
@kennybastani: Also, `ls /usr/local/var/lib/pinot/data/rawdata`
@kennybastani: @luisfernandez Let me know if you got it solved. Happy to jump on a call if you need help with anything.
@diogo.baeder: Hi folks! This could probably be a question more geared towards @ken, but I'll ask broadly anyway: is there any documentation available about how to implement ad-hoc segment replacement, in terms of what this flow would be? I'll follow up in this thread.
@diogo.baeder: What I want to have is a single table that holds data for multiple regions and sectors within these regions. And I also want to be able to partition the data by region and sector. The problem is that with the daily ingestion I would do I would end up with far too many segments and they would be too small, most of them not even with 1MB of data. So I thought about using merge rollups - which some here recommended to me -, however that would probably just merge everything together for each bucket, thus defeating my partitioning per region and sector. Then I thought, I could just implement the rolling up of these segments myself. The problem, though, is that I have no idea how this works; How do I "build a segment"? Do I just create a batch job for each rolled up segment, and then delete the old tiny ones? What's the recommended way to approach this?
@mayanks:
@noiarek: @noiarek has joined the channel
@ysuo: Hi team, my table segments show bad status. Queries on this table return 305 error and segments are not available. I reset all segments and it doesn’t work. What am I gonna do in this case? Thanks.

#pinot-s3

@adam.hutson: @adam.hutson has joined the channel

#pinot-k8s-operator

@adam.hutson: @adam.hutson has joined the channel

#onboarding

@adam.hutson: @adam.hutson has joined the channel

#feat-geo-spatial-index

@adam.hutson: @adam.hutson has joined the channel

#transform-functions

@adam.hutson: @adam.hutson has joined the channel

#custom-aggregators

@adam.hutson: @adam.hutson has joined the channel

#inconsistent-perf

@adam.hutson: @adam.hutson has joined the channel

#docs

@adam.hutson: @adam.hutson has joined the channel

#aggregators

@adam.hutson: @adam.hutson has joined the channel

#tmp

@adam.hutson: @adam.hutson has joined the channel

#query-latency

@adam.hutson: @adam.hutson has joined the channel

#dhill-date-seg

@adam.hutson: @adam.hutson has joined the channel

#enable-generic-offsets

@adam.hutson: @adam.hutson has joined the channel

#pinot-dev

@adam.hutson: @adam.hutson has joined the channel

#community

@adam.hutson: @adam.hutson has joined the channel

#feat-pravega-connector

@adam.hutson: @adam.hutson has joined the channel

#announcements

@adam.hutson: @adam.hutson has joined the channel

#s3-multiple-buckets

@adam.hutson: @adam.hutson has joined the channel

#release-certifier

@adam.hutson: @adam.hutson has joined the channel

#multiple_streams

@adam.hutson: @adam.hutson has joined the channel

#lp-pinot-poc

@adam.hutson: @adam.hutson has joined the channel

#roadmap

@adam.hutson: @adam.hutson has joined the channel

#presto-pinot-connector

@adam.hutson: @adam.hutson has joined the channel

#multi-region-setup

@adam.hutson: @adam.hutson has joined the channel

#metadata-push-api

@adam.hutson: @adam.hutson has joined the channel

#pql-sql-regression

@adam.hutson: @adam.hutson has joined the channel

#latency-during-segment-commit

@adam.hutson: @adam.hutson has joined the channel

#pinot-realtime-table-rebalance

@adam.hutson: @adam.hutson has joined the channel

#release060

@adam.hutson: @adam.hutson has joined the channel

#time-based-segment-pruner

@adam.hutson: @adam.hutson has joined the channel

#discuss-validation

@adam.hutson: @adam.hutson has joined the channel

#segment-cold-storage

@adam.hutson: @adam.hutson has joined the channel

#new-office-space

@adam.hutson: @adam.hutson has joined the channel

#config-tuner

@adam.hutson: @adam.hutson has joined the channel

#test-channel

@adam.hutson: @adam.hutson has joined the channel

#pinot-perf-tuning

@adam.hutson: @adam.hutson has joined the channel

#thirdeye-pinot

@adam.hutson: @adam.hutson has joined the channel

#getting-started

@mannamra: @mannamra has joined the channel
@nrajendra434: @nrajendra434 has joined the channel
@fding: @fding has joined the channel
@fizza.abid: @fizza.abid has joined the channel
@arekchmura: @arekchmura has joined the channel
@luisfernandez: i’m trying to import at least 2 years worth of data I was looking to see if I could get some guidance on how to go about this, I have been taking a look at the ingestion job framework, is this the way to go about this? what are some of the considerations we have to make when doing this backfills. I see that data is divided by folders which are the days and each of these days will be a segment on pinot, is that right? how do we ensure that the data we are ingesting will still perform well? and what are some of the tips that you could give when moving a lot of data?
@xiangfu0: general guideline is to pre-partition data by date, then you will have multiple raw data files per day, and each data file will become one pinot segment, 1:1 mapping.
@xiangfu0: for ingestion, the segment creation and push are external process or you can start a set of nodes of pinot minions to do the job
@xiangfu0: that will not impact your runtime pinot servers
@xiangfu0: for data push, set the push parallelism to ensure you won’t exhaust pinot controller.
@luisfernandez: right as explained here, and in short as you said each of those files will be a segment, how do i know my segment size is okay?
@luisfernandez: for each of the files
@luisfernandez: right now we have a hybrid model and this is hour configs for our current segments in the realtime side of it:
@luisfernandez: ``` "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.segment.size": "250M"```
@luisfernandez: another question that i had is how these configs: ```"ingestionConfig": { "batchIngestionConfig": { "segmentIngestionType": "APPEND", "segmentIngestionFrequency": "HOURLY" } },``` impact the offline table
@mayanks: ```segmentIngestionType - Used for data retention segmentIngestionFrequency - Used to compute time-boundary for hybrid tables```
@luisfernandez: thank you mayank
@luisfernandez: also to explain our current setup we have this: ```realtime table with 7 days retention, offline table with 2 years retention (realtime data is eventually moved here)``` we want to backfill the offline table with data that is on the system that we are moving away from to pinot into the offline table, is this the way people usually do it or do we usually create another offline table that does backfilling only
@mayanks: You can backfill a hybrid table.
@noiarek: @noiarek has joined the channel

#feat-partial-upsert

@adam.hutson: @adam.hutson has joined the channel

#pinot_website_improvement_suggestions

@adam.hutson: @adam.hutson has joined the channel

#segment-write-api

@adam.hutson: @adam.hutson has joined the channel

#releases

@adam.hutson: @adam.hutson has joined the channel

#metrics-plugin-impl

@adam.hutson: @adam.hutson has joined the channel

#debug_upsert

@adam.hutson: @adam.hutson has joined the channel

#flink-pinot-connector

@adam.hutson: @adam.hutson has joined the channel

#pinot-rack-awareness

@adam.hutson: @adam.hutson has joined the channel

#minion-improvements

@adam.hutson: @adam.hutson has joined the channel

#fix-numerical-predicate

@adam.hutson: @adam.hutson has joined the channel

#complex-type-support

@adam.hutson: @adam.hutson has joined the channel

#fix_llc_segment_upload

@adam.hutson: @adam.hutson has joined the channel

#product-launch

@adam.hutson: @adam.hutson has joined the channel

#pinot-docsrus

@adam.hutson: @adam.hutson has joined the channel

#pinot-trino

@adam.hutson: @adam.hutson has joined the channel

#kinesis_help

@adam.hutson: @adam.hutson has joined the channel

#udf-type-matching

@adam.hutson: @adam.hutson has joined the channel

#jobs

@adam.hutson: @adam.hutson has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org