Apache Pinot Daily Email Digest (2021-06-14)

Pinot Slack Email Digest Mon, 14 Jun 2021 19:00:38 -0700

#general

@hamza.senoussi: Hello, I'm runnning Pinot in K8s and I have a job that creates my table, my schema and another job that does the ingestion from a GCS storage. This last job creates segments and store them in a GCS bucket. Is there a way for later runs to load these segments directly from the folder without recreating them ?
@mayanks: You can choose the job type that only does the push
@hamza.senoussi: that works well, thanks !
@mayanks: You probably want to use metadata+uri push since you are using deepstore. This way only metadata will be pushed to controller, and servers will download it from deepstore
@karinwolok1: Happy Monday everyone! We have an online meetup scheduled for tomorrow if anyone is interested. :slightly_smiling_face:
@jai.patel856: What’s the option called to disable upsert?
@jai.patel856: looks like it’s skipUpsert
@jai.patel856: FYI… sample image in the docs still shows disableUpsert
@g.kishore: @yupeng ^^
@chundong.wang: Is there any document on how theta-sketch columns should be generated? In the of `DistinctCountThetaSketch` it mentioned `thetaSketchColumn` . Is that column supposed to be serialized binary (hex string I suppose) of Theta Sketch framework? ``` UpdateSketch sketch2 = UpdateSketch.builder().build(); for (int key = 50000; key < 150000; key++) sketch2.update(key); FileOutputStream out2 = new FileOutputStream("ThetaSketch2.bin"); out2.write(sketch2.compact().toByteArray()); // or hexString()```
@mayanks: `sketch.compact().toByteArray()`
@mayanks: Of course, it needs to use the same datasketch library as Pinot uses.
@mayanks: Mind giving back to the community by adding it to the docs, for the next guy who has this question?
@mayanks: (Perhaps after you have verified it works - so you can add more info as needed)
@chundong.wang: For sure. For your context, I’m trying to figure out a correct way to do moving window DistinctCount outside Pinot.
@chundong.wang: Looks like `DISTINCTCOUNTRAWHLL` and `DistinctCountRawThetaSketch` both provides hexString that application could further process.
@mayanks: Thank you. you can join <#C023BNDT0N8|pinot-docsrus> on instructions on how to add it
@mayanks: The hex string is on the retrieval side.
@chundong.wang: ^^ I know. My understanding is, 1. You’d need to build a binary string with `sketch.compact().toByteArray()` as a column; 2. You’d need to do `distinctCountThetaSketch` to get count, with `postAggregationExpressionToEvaluate` which in most cases would match `where` clause and would be evaluated on brokers; 3. You could get the raw data via `DistinctCountRawThetaSketch` in query for HexEncoded Serialized Sketch Bytes.

#troubleshooting

@luanmorenomaciel: hi experts, i've a running realtime table that gets data from kafka running, today when I checked for new incoming data, i've got this error, any ideas what that could be? ```21/06/11 21:08:15.665 ERROR [LLRealtimeSegmentDataManager_realtime_enriched_music_data_users__7__0__20210611T2008Z] [realtime_enriched_music_data_users__7__0__20210611T2008Z] Could not build segment java.lang.IllegalArgumentException: Invalid format: "2021-06-10 10:42:25" is too short at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.writeMetadata(SegmentColumnarIndexCreator.java:552) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.seal(SegmentColumnarIndexCreator.java:512) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.handlePostCreation(SegmentIndexCreationDriverImpl.java:284) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.build(SegmentIndexCreationDriverImpl.java:257) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.realtime.converter.RealtimeSegmentConverter.build(RealtimeSegmentConverter.java:131) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentInternal(LLRealtimeSegmentDataManager.java:794) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.buildSegmentForCommit(LLRealtimeSegmentDataManager.java:728) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:634) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292] 2021/06/11 21:08:15.665 ERROR [LLRealtimeSegmentDataManager_realtime_enriched_music_data_users__4__0__20210611T2008Z] [realtime_enriched_music_data_users__4__0__20210611T2008Z] Could not build segment java.lang.IllegalArgumentException: Invalid format: "2021-06-10 10:42:25" is too short at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.writeMetadata(SegmentColumnarIndexCreator.java:552) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentColumnarIndexCreator.seal(SegmentColumnarIndexCreator.java:512) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435] at org.apache.pinot.segment.local.segment.creator.impl.SegmentIndexCreationDriverImpl.handlePostCreation(SegmentIndexCreationDriverImpl.java:284) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-2de40fde8051c2c0281416c2da11c179c2190435]```
@mayanks: The error states: ` Invalid format: "2021-06-10 10:42:25" is too short`
@luanmorenomaciel: what that could mean? because the job was actually working with no issues
@mayanks: this is while creating segment from realtime stream (which happens periodically). You may want to check if your datetime fieldspec is defined correctly and the incoming stream follows the format you specified
@luanmorenomaciel: hmm great tip i'll do that @mayanks thank you for the tip
@elon.azoulay: Hi, we just experienced an issue where a server restarted, and when downloading a segment from gcs it threw an `java.io.IOException: Input is not in the .gz format` - we saw that the segment was just being written to gcs. Once I deleted the segment on the server and restarted it downloaded it without any issues. Has anyone ever experienced that before? I can create a github issue w some ideas for fixes...
@mayanks: Was it an existing segment being overwritten? Typically, the .tar.gz should be ready before server attempts to read it
@elon.azoulay: yep, it was existing
@elon.azoulay: And right after the error occurred I manually downloaded from gcs, and it was valid tar.gz
@elon.azoulay: Could it have been in the middle of overwritting the old and new versions? i.e. gcs is not like a posix filesystem, maybe there were some concurrency issues?
@mayanks: Yeah i think that is what it is. Is a move atomic in GCS?
@elon.azoulay: There is no move, it's copy and delete
@elon.azoulay: Since it uses streaming to write it, could it have been in the middle of writing or deleting?
@mayanks: If there are no atomic constructs to do so, I am unsure how to handle this.
@mayanks: Not that I have put much thought to it
@elon.azoulay: Maybe write some other "signal" file? or updated it (gcs has generation # in metadata) - i.e. just a pointer to the real file?
@elon.azoulay: yeah, I just put 1 thought into it right now :rolling_on_the_floor_laughing:

#pinot-dev

@fx19880617: I’ve merged java 11 upgrade PR() today. So recommended to upgrade to java 11 for dev. Right now we still need to support java8 compilation for existing users to upgrade, so don’t use java11 code features until further notice when we drop the java 8 support. For devs still on java8, please use below command to build your project. ```mvn clean install -DskipTests -Pbin-dist -T 4 -Djdk.version=8``` Thanks a lot to @elon.azoulay for making this happen!

#pinot-docsrus

@chundong.wang: @chundong.wang has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-06-14)

#general

#troubleshooting

#pinot-dev

#pinot-docsrus

Reply via email to