Apache Pinot Daily Email Digest (2020-11-30)

Pinot Slack Email Digest Mon, 30 Nov 2020 18:00:26 -0800

#general

@punnoose19: @punnoose19 has joined the channel
@vikas.rajoria29: @vikas.rajoria29 has joined the channel
@taranrishit1234: @taranrishit1234 has joined the channel
@taranrishit1234: Hello im unable to start pinot due to the below error- Administrator@EC2AMAZ-6IDI4LG /cygdrive/c/users/Administrator/documents/apache-inot-incubating-0.6.0-bin/apache-pinot-incubating-0.6.0-bin $ bin/pinot-admin.sh StartController -zkAddress localhost:2191 -controlerPort 9000 Unrecognized VM option 'PrintGCDateStamps' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. im stuck at second step in this -> can somebody help me?
@fx19880617: I think this is because your machine is using Java 11
@fx19880617: export JAVA_OPTS="-Xms4G -Xmx8G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-controller.log" bin/pinot-admin.sh StartController \ -zkAddress localhost:2191 \ -controllerPort 9000
@fx19880617: You can try to remove those flags from the JAVA_OPTS
@amitchopra: Hi, I saw there is an open ticket for supporting Kinesis (). Wanted to check when is it slated to be supported?
@g.kishore: we haven't prioritized it as we haven't seen users ask for it. Do you need kinesis?
@amitchopra: yes, in our case we are big consumers of Kinesis. And thus was checking if we can leverage it to ingest
@g.kishore: can you vote up that issue and add your requirement, we will review it and get back with eta @kharekartik had already done some work
@amitchopra: ok, cool. Thanks. Let me do that
@whatatrip888: Is there any configuration to increase the LIMIT from 10 to a higher number?
@mayanks: You can specify it in the query (eg LIMIT 1000).
@gloetscher: @gloetscher has joined the channel
@whatatrip888: I was working on star tree indexing. While loading the data, I got the following issue. Table and schemas are attached with this thread. I was trying to load 30 records. One of the column of star tree index is MSISDN and its cardinality: 10 also TARIFF_PLAN with cardinality: 5
@npawar: star tree doesn’t work for Multi Value columns. Sorry about the error not being more explicit, we’re adding it to our validations when creating the table. Could you try without the TARIFF_PLAN multi value column?
@g.kishore: also dont forget to add the time column in the split order
@karinwolok1: Just a reminder this Wednesday if it sounds like something you're interested in! :bell:
@graham: Hello all, I am encountering an error when trying to run the sample batch job located here in k8s after following this guide with the pinot-gcs plugin. I am assuming I have a config issue somewhere but could use some expertise.
@fx19880617: can you check what’s `JAVA_OPTS` in your container ?
@graham: Sure, one moment
@graham: ```JAVA_OPTS=-Xms256M -Xmx1G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/opt/pinot/gc-pinot-controller.log -Dlog4j2.configurationFile=/opt/pinot/conf/pinot-controller-log4j2.xml -Dplugins.dir=/opt/pinot/plugins -Dplugins.include=pinot-gcs```
@fx19880617: ah
@fx19880617: can you try remove :`-Dplugins.include=pinot-gcs` or set `-Dplugins.include=pinot-gcs,pinot-csv,pinot-batch-ingestion-standalone`
@graham: Can do one moment
@fx19880617: ```JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins" bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile job.yaml```
@fx19880617: you can just run this in the container
@graham: ```root@pinot-controller-0:/opt/pinot# JAVA_OPTS="-Dplugins.dir=/opt/pinot/plugins" bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /var/pinot/controller/data/job.yml 2020/11/30 18:21:31.169 ERROR [LaunchDataIngestionJobCommand] [main] Got exception to kick off standalone data ingestion job - java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:123) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:156) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:168) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] Caused by: java.lang.NullPointerException at shaded.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.BucketInfo$BuilderImpl.build(BucketInfo.java:1313) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.BucketInfo.of(BucketInfo.java:1755) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:209) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.filesystem.GcsPinotFS.getBucket(GcsPinotFS.java:87) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.filesystem.GcsPinotFS.isDirectory(GcsPinotFS.java:358) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:154) ~[pinot-batch-ingestion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] ... 4 more 2020/11/30 18:21:31.177 ERROR [PinotAdministrator] [main] Exception caught: java.lang.RuntimeException: Caught exception during running - org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:144) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.runIngestionJob(IngestionJobLauncher.java:113) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand.execute(LaunchDataIngestionJobCommand.java:123) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.PinotAdministrator.execute(PinotAdministrator.java:156) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.tools.admin.PinotAdministrator.main(PinotAdministrator.java:168) [pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] Caused by: java.lang.NullPointerException at shaded.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.BucketInfo$BuilderImpl.build(BucketInfo.java:1313) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.BucketInfo.of(BucketInfo.java:1755) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at com.google.cloud.storage.StorageImpl.get(StorageImpl.java:209) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.filesystem.GcsPinotFS.getBucket(GcsPinotFS.java:87) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.filesystem.GcsPinotFS.isDirectory(GcsPinotFS.java:358) ~[pinot-gcs-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner.run(SegmentGenerationJobRunner.java:154) ~[pinot-batch-ingestion-standalone-0.7.0-SNAPSHOT-shaded.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] at org.apache.pinot.spi.ingestion.batch.IngestionJobLauncher.kickoffIngestionJob(IngestionJobLauncher.java:142) ~[pinot-all-0.7.0-SNAPSHOT-jar-with-dependencies.jar:0.7.0-SNAPSHOT-318c1077bb4a8aa74a03acad8f236aff8eb9fa0d] ... 4 more```
@fx19880617: I think you need to specify the bucket in the input/output dir
@fx19880617: same for controller config
@graham: Yeah, I was doing some tinkering earlier while I was troubleshooting. Let me clean that up before re-running
@fx19880617: sure
@npawar: @fx19880617 @dlavoie ^^
@pb521: @pb521 has joined the channel
@pb521: Q: is it possible to query non-integer percentiles? I'm interested in calculating P99.9, but looking at the , it looks like those functions only support up to P99.
@jackie.jxt: Yes, we can support that. Can you please file an issue on GitHub? This feature requires some change on the function name parsing, should be simple. Contributions are very welcome
@mayanks: @jackie.jxt with functions now taking multiple args, do we still need to rely on name parsing?
@jackie.jxt: Good point, we just need to parse double
@fx19880617: right now the percentile only takes the integer
@pb521: Ok, thanks, I filed . Sadly, I don't have the spare cycles to work on this at the moment, but it's good news that this is an eminently feasible change.
@mayanks: @amrish.k.lal in case you want to take up ^^
@amrish.k.lal: Yes, I can take a look as soon as my pinot-controller test cleanups are done (close to wrapping that up)
@myeole: @myeole has joined the channel
@ken: Question about batch import job. When running a LaunchDataIngestionJob, I see the S3-based file(s) being ingested are being copied first to a temp directory on my local machine. Assuming I’ve set up a k8s-based cluster via EKS, is there a way to ingest directly from S3? I see to recall some option to do this, which would be much more efficient.
@fx19880617: The main motivation for that is that pinot will read the file twice and will rewind the input stream. It could cause issue when we have no knowledge about the upstream. So we tried to always copy the file to local for this. It’s also doubtable that reading directly from remote storage is efficient.
@fx19880617: We can add an option to allow directly set input file path and you can compare the results
@ken: OK, thanks. So in my situation, where I’m running this LaunchDataIngestionJob to pull in lots of big files from S3 to a k8s-based cluster running in AWS, what’s going to be most efficient currently? I guess I could spin up another beefy EC2 instance, and run the command from that server versus my (home office) laptop.
@fx19880617: hmm, how many segments you have ? One thing you can try is to start a pinot-ingestion job as a k8s batch job, so you can give resources for the container. Here is one example:
@fx19880617: Typically we want to avoid copy s3 data to your local laptop. It's a good idea to have an ec2 instance and run command from that .
@ken: Yes, exactly.
@karinwolok1: Lets welcome the newest Pinot :wine_glass: slack members!! Hello :wave: @pb521 @myeole @gloetscher @punnoose19 @me1630 @me1189 @jose.roca @graham @dovydas @achraibi @1amb4a @rishbits1994 @rajkath @vikas.rajoria29 @taranrishit1234 @ami.oneworld @jakob.edding @mats.poerschke @farshad @amitchopra @divya.sudhakar429 :smiley:

#random

@punnoose19: @punnoose19 has joined the channel
@vikas.rajoria29: @vikas.rajoria29 has joined the channel
@taranrishit1234: @taranrishit1234 has joined the channel
@gloetscher: @gloetscher has joined the channel
@pb521: @pb521 has joined the channel
@myeole: @myeole has joined the channel

#feat-rt-seg-complete

@harrynet222: @harrynet222 has joined the channel

#troubleshooting

@harrynet222: @harrynet222 has joined the channel
@abfisher0417: @abfisher0417 has joined the channel
@joao.comini: @joao.comini has joined the channel
@joao.comini: Hello guys, how are you? I'm having some trouble understanding the results from the `RealtimeProvisioningHelper`, may you help me? These are my doubts: • Why do we need a `numHours` parameter? What's the impact of having a consuming segment for a certain amount of time (pros/cons)? • And what does `Mapped` means in the `Memory used per host` result? Is it about the segments in disk? This is the results that I got: ```RealtimeProvisioningHelper -tableConfigFile /tmp/transaction-table.json -numPartitions 20 -pushFrequency null -numHosts 4,8,12,16,20 -numHours 24,48,72,96 -sampleCompletedSegmentDir /tmp/out/transaction_1606528528_1606614928_0 -ingestionRate 4 -maxUsableHostMemory 16G -retentionHours 768 Note: * Table retention and push frequency ignored for determining retentionHours since it is specified in command * See Memory used per host (Active/Mapped) numHosts --> 4 |8 |12 |16 |20 | numHours 24 --------> 6.8G/71.9G |3.4G/35.95G |2.72G/28.76G |2.04G/21.57G |1.36G/14.38G | 48 --------> 7.33G/72.62G |3.66G/36.31G |2.93G/29.05G |2.2G/21.79G |1.47G/14.52G | 72 --------> 8.01G/73.11G |4.01G/36.55G |3.2G/29.24G |2.4G/21.93G |1.6G/14.62G | 96 --------> 8.39G/74.08G |4.2G/37.04G |3.36G/29.63G |2.52G/22.22G |1.68G/14.82G | Optimal segment size numHosts --> 4 |8 |12 |16 |20 | numHours 24 --------> 20.02M |20.02M |20.02M |20.02M |20.02M | 48 --------> 40.04M |40.04M |40.04M |40.04M |40.04M | 72 --------> 60.05M |60.05M |60.05M |60.05M |60.05M | 96 --------> 80.07M |80.07M |80.07M |80.07M |80.07M | Consuming memory numHosts --> 4 |8 |12 |16 |20 | numHours 24 --------> 756.05M |378.02M |302.42M |226.81M |151.21M | 48 --------> 1.47G |750.11M |600.09M |450.07M |300.04M | 72 --------> 2.15G |1.07G |878.76M |659.07M |439.38M | 96 --------> 2.92G |1.46G |1.17G |896.57M |597.71M | Total number of segments queried per host (for all partitions) numHosts --> 4 |8 |12 |16 |20 | numHours 24 --------> 320 |160 |128 |96 |64 | 48 --------> 160 |80 |64 |48 |32 | 72 --------> 110 |55 |44 |33 |22 | 96 --------> 80 |40 |32 |24 |16 |```
@npawar: Hey @joao.comini, thanks for sharing so many details
@npawar: Have you already gone through this: ?
@joao.comini: Yes, a lot, and my doubts still persist haha
@npawar: numHours indicates the number of hours the realtime segment will be in CONSUMING state. In this state, all the ingested data is in memory. Periodically, based on thresholds, the data gets converted to a completed segment and flushed onto disk. Now this numHours should be set based on a few factors. 1. retention of your kafka stream. If your kafka topic retains data for 24h, then you don’t want to be setting the numHours in Pinot more than 24h. If the pinot-server gets restarted, it has to reconsume everything from the last checkpoint. And pinot will rely on the kafka stream to have all that data. 2. You want to keep the numHours reasonably low. The more the segment consumes, the bigger the segment it needs to create, and segment creation is memory-intensive. In case of pinot-server restarts, the server has to reconsume everything, so again, resonably low numHours is desired.
@npawar: Typically, we don’t recommend increasing this more than 24h
@joao.comini: Oh, right, got it! What about the number of segments queried? If numHours is low, i will have a lot more segments right?
@npawar: btw, if you’re fairly new to the realtime segment concepts, this video might help in making sense of some of these terms
@joao.comini: Thanks! I'll take a look :)
@ssubrama: @joao.comini all very good questions. Rows are stored in uncompressed format when the segment is consuming, but are compressed after it is completed. So, if you consume for longer time, you take in more volatile memory. Also, like Neha mentioned, if you need to restart the server, the rows are consumed from the start of the segment again.
@npawar: yes that is true. But looks like the max in your case is 320?
@ssubrama: On the other hand, if you set the numhours to be too low, then as you pointed out, you get too many segments. That can be bad for query processing, esp in high qps use cases.
@npawar: which is quite small
@ssubrama: All the segments still within retention period are in memory (mapped), as are the consuming segments. That is the total mapped memory. the active memory is estimated as the most recent 768 hours of data (as specified by you in the command line).
@joao.comini: Nice! Thank you guys.
@joao.comini: Oh, i see. Right, i'll watch Neha's video and see if i get more doubts.
@npawar: that video is only for beginner concepts about realtime consumption and segments. Does not cover the provisioning helper, but please watch it regardless hah :slightly_smiling_face:
@npawar: Also unrelated, if your ingestion rate is only 4, do you really need 20 partitions?
@npawar: the partitioning factor can also help with concerns about too many segments
@joao.comini: Yes, this one of my concerns. The company where i work is huge, and we kind of need this solution working quicly. The configuration of this topic is not in control of my team, so we would need a _middleman_ (Flink, Heron, etc), but we don't have that much time hahaha
@joao.comini: We could create a simple Java application too, that does this work of moving from one topic to another
@joao.comini: Oh, one more thing: by `mapped` you mean that the segments aren't in memory right? The segments are in the server's disk and mapped in memory, am i missing something?
@joao.comini: I'm asking this because I want to know how much disk space i need to get for my servers.
@ssubrama: yes, by "mapped" it means that the files in disk are mapped using `mmap`
@joao.comini: Right, now i feel much more comfortable haha
@joao.comini: Last one (really, i promise): and what about resources requests and limits in kubernetes? If the active memory + consuming memory use is about 8G, how much extra memory would i need to the off-heap computations?
@joao.comini: Should i ask kubernetes for 20G and set XMS and XMX to 16G for safety? (Just a hypothetical example)
@npawar: good question. @fx19880617 any recommendations based on this ^^
@fx19880617: I would put 4 cpu(if you have high qps use case, increase this) and 32gb ram for request/limit and make -xmx -xmx both to 16g
@npawar: curious on how you came up with these @fx19880617 ?
@joao.comini: Me too haha
@fx19880617: that’s the t3.2xlarge machine size
@fx19880617: 8cpu,32gb ram
@fx19880617: in general I recommend containers with more ram
@fx19880617: so I put the ratio to 4cpu/32gb ram
@fx19880617: if you can pick memory optimized sku like r5.xlarge, that’s the exact fit
@joao.comini: hmmm, that's something that i'll need to talk about here, our nodes are all m5.xlarge
@fx19880617: hmm, then I would just do 4 cpu/16gb ram for container and set -xmx and -xms to both 8gb.
@fx19880617: also I would suggest to use bigger machines like 2xlarge or 4xlarge
@g.kishore: @npawar ^^

#docs

@kennybastani: @kennybastani has left the channel

#pinot-dev

@harrynet222: @harrynet222 has joined the channel

#announcements

@harrynet222: @harrynet222 has joined the channel

#pinot-docs

@harrynet222: @harrynet222 has joined the channel
@gloetscher: @gloetscher has joined the channel

#config-tuner

@chinmay.cerebro: I've modified the PR based on this discussion. Here's the summary: ```1. Added a TableConfigTuner interface with explicit init and apply methods. 2. Added a new annotation type 'Tuner' to auto discover such classes. ```
@chinmay.cerebro: I haven't integrated it with RecommendedDriver yet. we can do that in the next PR or so
@chinmay.cerebro: please take a look when you get a chance. THanks !
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2020-11-30)

#general

#random

#feat-rt-seg-complete

#troubleshooting

#docs

#pinot-dev

#announcements

#pinot-docs

#config-tuner

Reply via email to