Apache Pinot Daily Email Digest (2021-11-16)

Pinot Slack Email Digest Tue, 16 Nov 2021 18:00:31 -0800

#general

@mrpringle: I see version 0.9 is in rc, do we have a binary download link for this version? Some nice new features to try out.
@mayanks: It is not officially released yet, but should be shortly
@mayanks: Just curious, what features were you interested in trying out @mrpringle?
@mrpringle: looking at the last with timestamp aggregation function, we need this to do sums across pre aggregated totals
@ansi395958: @ansi395958 has joined the channel
@karinwolok1: :wave: Hello newbie Pinot community members! :wine_glass: :partying_face: We're happy to have you here! Curious on what you're working on and how you found Apache Pinot! Please introduce yourselves here in this thread! :smiley: @ansi395958 @shantanoo.sinha @julien.picard @aarti.gaddale187 @bowenzhu @brandon @gabriel.nau @waqasdilawardaha @maitreyi.kv @nicholas.nezis @dino.occhialini @scott.cohen @aaron.weiss @laabidi.raissi @nesrullayev.ali @akshay13jain @zaid.mohemmad @alisonjanedavey @dtong @raluca.lazar @andre578 @ayush.network @xinxinzhenbang @sumit.l @nsanthanam @cgregor @diogodssantos @mingfeng.tan @navi.trinity @stuartcoleman81 @stuart.coleman @ryan @shreya.chakraborty @joseph.roldan @folutade @jurio0 @priyam @randxiexyy29 @stavg @rohitdev.kulshrestha @hamsemxiao @vivek.bi @yeongjukang @mail9deep
@ashok.rex.2009: @ashok.rex.2009 has joined the channel
@troy: @troy has joined the channel
@sam: @sam has joined the channel
@cgregor: Thanks @karinwolok1! Hi Everyone :wave: I'm currently working an a set of automatic code transformations to help when migrating from Joda-Time to java.time, I noticed discussing migrating from Joda to java.time so I'm interested in whether I can be of any help during this process. I am currently just trying to get more familiar with pinot and it's components as I haven't used it before. I will demo pinot to our engineering team once I have a better grasp of it. If anyone is interested in discussing #7499 then i'm keen to understand if I can be of any use!

#random

@ansi395958: @ansi395958 has joined the channel
@ashok.rex.2009: @ashok.rex.2009 has joined the channel
@troy: @troy has joined the channel
@sam: @sam has joined the channel

#feat-presto-connector

@scott.cohen: @scott.cohen has joined the channel

#troubleshooting

@tony: Backfill question -- we have a large REALTIME table (~900GB/day). Due to a configuration error (ZK heap size too low) we lost some data because the Kafka retention was less than the time to fix the bug. This has me thinking of way to fill in missing data in the future for disaster recovery. We have all the raw data sitting in Parquet files in our data lake. My initial thought was to regenerate the segments with missing data (they are east to identify). Is it possible to upload (refresh) REALTIME segments, assuming the event time range is correct (there would be more events in the replacement segment)? Or do I have to use a HYBRID table and either populate the OFFLINE segments myself or use ?
@mayanks: Right now, data push to realtime table is disabled, and needs a managed offline flow. But afaik, Uber team is working on backfill support for RT tables. Is this still the case @yupeng?
@yupeng: right. we are working on such backfill pipeline in flink
@nair.a: Hi team, This is regarding batch ingestion from HDFS to Offline_Table. After running the following command. *bin/pinot-ingestion-job.sh -jobSpecFile /root/hdfsBatchIngestionSpec1.yaml* Getting the following logs, segments are not getting created. ```Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner Initializing PinotFS for scheme hdfs, classname org.apache.pinot.plugin.filesystem.HadoopPinotFS Unable to load native-hadoop library for your platform... using builtin-java classes where applicable log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer). log4j:WARN Please initialize the log4j system properly. log4j:WARN See for more info. No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. successfully initialized HadoopPinotFS Creating an executor service with 1 threads(Job parallelism: 0, available cores: 24.) Submitting one Segment Generation Task for Using class: org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader to read segment, ignoring configured file format: AVRO Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner Initializing PinotFS for scheme hdfs, classname org.apache.pinot.plugin.filesystem.HadoopPinotFS successfully initialized HadoopPinotFS Start pushing segments: []... to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@5d28bcd5] for table poc_test_table```
@dunithd: Is it possible to share the *hdfsBatchIngestionSpec1.yaml* with us?
@nair.a: BatchIngestionSpec File: ```executionFrameworkSpec: name: 'standalone' segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner' jobType: SegmentCreationAndTarPush inputDirURI: '' outputDirURI: '' overwriteOutput: true pinotFSSpecs: - scheme: hdfs className: org.apache.pinot.plugin.filesystem.HadoopPinotFS configs: hadoop.conf.path: '/root/hadoop-3.0.0/etc/hadoop/' recordReaderSpec: dataFormat: 'parquet' className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader' tableSpec: tableName: 'poc_test_table' schemaURI: '' tableConfigURI: '' pinotClusterSpecs: - controllerURI: '' pushJobSpec: pushParallelism: 2 pushAttempts: 1```
@adireddijagadesh: @nair.a Given `hadoop.conf.path`: ‘/root/hadoop-3.0.0/etc/hadoop/’ should contain hadoop XML configuration files such as hdfs-site.xml, core-site.xml. Can you recheck provided path contains config files or `/root/hadoop-3.0.0/etc/hadoop/conf/` is correct path
@nair.a: yes, its present. From the logs, it seems , script is able to connect to hadoop cluster. since it has listed the file.
@ken: I think you might be missing some important logging output, given the log4j warnings. Also, what version of Pinot are you running? Finally, what happens if you try running a job for just segment generation, and do it locally (download the Parquet file, and use local FS for input/output)?
@nair.a: Will check logging conf once. We are running Pinot 0.8. Haven't done ingestion with LocalFS. will try
@adireddijagadesh: Any logs related to Starting/Ending of Segment Index Creator
@ansi395958: @ansi395958 has joined the channel
@lars-kristian_svenoy: Hey everyone. Quick question; When querying for a specific time range in Pinot, is it more efficient to use the primary time column defined in the segmentsConfig, or is it equivalent to using any other time column? From the docs it seems to indicate that the primary time column is only used for retention purposes, meaning that querying for another timestamp should be fine too. In my case, I am creating a copy of the primary timestamp, reducing the granularity of it, and calling it `daysSinceEpoch`, as I want to query for entities within certain days. ```"ingestionConfig": { "transformConfigs": [ { "columnName": "daysSinceEpoch", "transformFunction": "toEpochDays(documentTimestamp)" } ], ...``` Additionally, for the RealtimeToOfflineSegmentsTask, I am using this value for deduplication purposes. In the schema: ```"primaryKeyColumns": ["customerId", "machineId", "daysSinceEpoch"] ...``` This is because for each event, I only want to keep the latest in a day. Here’s the RealtimeToOfflineSegmentsTask ``` "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "1d", "bufferTimePeriod": "2d", "mergeType": "dedup", "maxNumRecordsPerSegment": 10000000, "roundBucketTimePeriod": "1h" }``` In the realtime table, I am also filtering out any events older than 14 days (Where documentTimestamp is the actual primary timeColumnName) ```"filterConfig": { "filterFunction": "Groovy({documentTimestamp < (new Date() - 14).getTime()}, documentTimestamp)" },``` Does that make sense?
@npawar: you can use any time column. you’re right that primary time column is mainly used for things like retention
@lars-kristian_svenoy: That’s great, thank you @npawar :slightly_smiling_face: I had assumed as much
@npawar: you cannot really define your own primary keys for realtimeToOfflineSegments task dedup mode. It will dedup only if the entire row is same
@lars-kristian_svenoy: Oh, it doesn’t use the primary key defined in the schema?
@npawar: the primary Key columns field you see is for the upsertts feature. it doesnt take any effect for realtimeToOffline
@lars-kristian_svenoy: aahh
@lars-kristian_svenoy: Is there any reason why?
@npawar: dedup is a relatively new feature in realtimeToOffline task. This version only does the full row dedup. We’d need to add a lot more config and code, to support the next level of smarter dedup
@npawar: regarding filtering out events greater than 14d, you can just set table retention to 14d? any reason you’re using the filter function instead?
@lars-kristian_svenoy: I sometimes get old events coming in through kafka which I don’t want to include in my segments
@mayanks: @lars-kristian_svenoy You can filter those rows at ingestion time:
@ashok.rex.2009: @ashok.rex.2009 has joined the channel
@troy: @troy has joined the channel
@mercyshans: hi, team, any insight on this SQL issue I am trying to use `distinctCount` aggregation function to count under different conditions ```select distinctCount(case when condition1 then colA else null end) as condition1Count, distinctCount(case when condition2 then colA else null end) as condition2Count, distinctCount(case when condition3 then colA else null end) as condition3Count from tableA``` colA is type int or String. but looks like it’s not supported in pinot cause null is not supported in the selection query Will there be a future support for this.
@xiangfu0: It requires same type for functions to be applied. You can cast them to string always
@mercyshans: do you mean change the `null` to `'null'`? I tried that by then `'null'` is counted as one distinct value
@xiangfu0: yes, right now the work around is to handle null as one distinct value. The real null support in aggregation function will be supported later
@mercyshans: ok, thanks
@sam: @sam has joined the channel

#custom-aggregators

@kis: @kis has joined the channel

#pinot-dev

@ashok.rex.2009: @ashok.rex.2009 has joined the channel

#pql-sql-regression

@kis: @kis has joined the channel

#thirdeye-pinot

@pyne.suvodeep: Hi @shreya.chakraborty Please create a github issue in itself.

#getting-started

@kangren.chia: will using `IdSet` with “NOT IN” clause have any unintended performance impact? e.g. `select * from table where userid not in IDSET(...)`

#releases

@sam: @sam has joined the channel

#debug_upsert

@kkmagic99: @kkmagic99 has joined the channel

#pinot-docsrus

@bagi.priyank: @bagi.priyank has joined the channel
@bagi.priyank: @bagi.priyank has left the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org