Apache Pinot Daily Email Digest (2021-07-21)

Pinot Slack Email Digest Wed, 21 Jul 2021 11:52:00 -0700

#general

@karinwolok1: Please help us welcome to our newest Pinot community members!!! :wine_glass: :wave: We'd love to know - Where are you from? What do you do? What brought you here? @jingyigong98 @ashwin.thobbi @m.gautam @nolan.bebarta @santand @brayan1213 @robinvarghese19 @zjffdu @uditcr710107 @ajaysinha26 @syusuf01 @wiqistar @ali.yilmaz @stefan.harinko @shubhamg931.dev @guzeloglusoner @fandalon @lgabriellp @saurabh.dwivedy779 @ssanjay @baliga @xiaobing @subinthattaparambil @surajkmth29 @devdsolutionist @dhruv.jrt @jayzhan211 @bruce.ritchie @saraalibajaba558 @mailsanjay.ms @tmacksf @rkabir @s.spinatelli @sunilkumar.tc @savingoyal @alnourzarroug @arun.ak37526 @joshdnv2 @richballa
@r.clark: :thread: Complex schema (un-nesting json) not showing up in table
@r.clark: Our data looks like this: ```{ "one": "one", "two": "two", "three": "three", "fourTimestamp": "1593549705711", "payload": { "context": { "one": "one", "two": "two" }, "message": { "one": "one", "two": "two" } }, "five": "five" }```
@r.clark: Previously, my schema correctly showed `"one", "two", "three", "fourTimestamp"` in the table.
@r.clark: I added a new table config to include `ingestionConfig` for the first time: ```{ "tableName": "tableName", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "fourTimestamp", "timeType": "MILLISECONDS", "schemaName": "schemaName", "replicasPerPartition": "1" }, "tenants": {}, "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kinesis", "stream.kinesis.topic.name": "stream-name", "region": "us-east-1", "shardIteratorType": "AFTER_SEQUENCE_NUMBER", "stream.kinesis.consumer.type": "lowlevel", "stream.kinesis.fetch.timeout.millis": "30000", "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory", "realtime.segment.flush.threshold.size": "1000000", "realtime.segment.flush.threshold.time": "6h" } }, "ingestionConfig": { "complexTypeConfig": { "delimiter": ".", "fieldsToUnnest": [ "payload.connection", "payload.message" ], "collectionNotUnnestedToJson": "NON_PRIMITIVE" } }, "metadata": { "customConfigs": {} } }```
@r.clark: Then I changed the schema to include the nested objects, as well as `"five"` , which is not nested. ```{ "schemaName": "mobileEvent", "dimensionFieldSpecs": [ { "name": "one", "dataType": "STRING" }, { "name": "two", "dataType": "STRING" }, { "name": "five", "dataType": "STRING" }, { "name": "payload.context.one", "dataType": "STRING" }, { "name": "payload.context.two", "dataType": "STRING" }, { "name": "payload.message.one", "dataType": "STRING" }, { "name": "payload.message.two", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "three", "dataType": "INT" } ], "dateTimeFieldSpecs": [ { "name": "fourTimestamp", "dataType": "STRING", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:MILLISECONDS" } ] }```
@r.clark: The new schema shows up in the UI to the left of the table, but none of the new additions are showing up in the table. Even `"five"` which is not nested. Is there something invalid about my complexTypeConfig?
@mayanks: @jackie.jxt could you take a look?
@jackie.jxt: @r.clark Did you re-generate the segments?
@r.clark: nope. I just did `./pinot-admin.sh AddTable -tableConfigFile` and `./pinot-admin.sh AddSchema -schemaFile`
@r.clark: Jackie, now I see them (I did nothing yet) but they are all null.
@yupeng: note that complex type handling is on master and will be released in 0.8
@yupeng: but not available in 0.7.1
@r.clark: OK. So I should expect it not to work until 0.8 is released? Any idea when that will be?
@yupeng: @mayanks ^
@mayanks: We are about to start the 0.8 release work shortly, so may be a few weeks. The delay is because we are also working on Apache graduation, and the release process of incubating project is somewhat different from the top level project.
@surajkmth29: Hi All, I have written down my little understanding of apache pinot - Tables and segments and tried to put it down in simple and fun terms. Would love if you folks could check it out and help me build many such articles around pinot PS: If there any comments/suggestions on the details of the blog, please drop a comment so that we can make it better and accessible to pinot community
@kennybastani: If this hasn’t happened already, I think @matt can help this get more traction on our social channels. I didn’t bring my laptop with me for vacation (has my login for the Twitter account). Also, Karin I’m sure would be available to help tomorrow get this out in the open. Thanks so much for writing this!
@kennybastani: cc @karinwolok1
@karinwolok1: Whoa! This is so cool!! That's awesome that you took this initiative on your own! <3 We can def share that! Do you use twitter, @surajkmth29? @allison can we promote this on Pinot Twitter?
@g.kishore: hey @surajkmth29, thanks a lot for authoring this blog post. I thoroughly enjoyed reading it and the kitchen analogy. Some of us who know Apache Pinot in and out might think we are excellent chefs :wine_glass::wine_glass::wine_glass:
@jinghui.wang: @jinghui.wang has joined the channel
@uparekh: @uparekh has joined the channel
@abhijeet.kushe: @abhijeet.kushe has joined the channel
@abhijeet.kushe: I am interested in getting latest updates on kinesis-integration.This issue mentions joining #kinesis-integration but I dont see the channel here in slack.Can someone point me to the right place to get more details ?
@mayanks: I believe this is available now, cc @kharekartik @npawar
@npawar: Yes it is. You'll find a page about it in the docs. It's a very new feature, so you might face some hiccups along the way. Would be great if you can try it out
@abhijeet.kushe: yes I did try it out @npawar But I was not able to consume any messages from a kinesis stream after creating a schema and a table. I am trying to implement this usecase I am new here to Pinot so wanted to see anyone has reported any issues
@r.clark: Hi Abhijeet, I was able to connect to Kinesis. One extra step I needed to do, was grant permissions to EKS to read the stream. (if you are using EKS)
@npawar: Thanks @r.clark. In addition to trying that, can share your table config here @abhijeet.kushe ? Have you set these properties: . And any exceptions in the logs?
@abhijeet.kushe: I tried ingesting to a local pinot instance using 1 hour aws sts credentials and token.I was able to connect to kinesis but when I added some records to kinesis I queried them through the Pinot console I did not get any records.This is the table config. ```{ "tableName": "transcript", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "timestampInEpoch", "timeType": "MILLISECONDS", "schemaName": "transcript", "replicasPerPartition": "1" }, "tenants": {}, "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kinesis", "stream.kinesis.topic.name": "cdp_metric_events_poc", "region": "us-east-1", "endpoint": "", "shardIteratorType": "AFTER_SEQUENCE_NUMBER", "stream.kinesis.consumer.type": "lowlevel", "stream.kinesis.fetch.timeout.millis": "30000", "stream.kinesis.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kinesis.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kinesis.KinesisConsumerFactory", "realtime.segment.flush.threshold.size": "1000000", "realtime.segment.flush.threshold.time": "1s" } }, "metadata": { "customConfigs": {} } }```
@abhijeet.kushe: No exception in the logs
@abhijeet.kushe: Ok I changed the realtime.segment.flush.threshold.time to 1000 and it did consume records
@abhijeet.kushe: @npawar you talked about facing some hiccups.Wanted to know if we can use this for implementing a feature in our prod env
@mayanks: @abhijeet.kushe yes please, we are here to help to get you there
@abhijeet.kushe: thanks @mayanks
@npawar: Typically the time should be set to something like 2h or 6h depending on the rate of ingestion. Otherwise you will end up with too many small Pinot segment
@abhijeet.kushe: I see so we do have an SLA 2 hours will slow.Max we can go to is 15 min
@abhijeet.kushe: Is there a way to address the small segment issue by having a near realtime ingestion followed by a nightly batch ingestion which might compact these segments ?
@npawar: the 2h or 6h will not affect the freshness of the data. you will still be able to query the records as soon as they are ingested
@npawar: this setting only determines at what cadence the in-memory consumed events are converted to a Pinot segment on disk
@abhijeet.kushe: ok i see so the records will be consumed realtime but they will be pushed to disc in 6h
@npawar: yes
@npawar: this video might help with some realtime ingestion general concepts:
@abhijeet.kushe: Thanks will go through it
@ismvarunsharma: @ismvarunsharma has joined the channel
@rachel.pedreschi: @rachel.pedreschi has joined the channel
@neilteng233: Hi, I am interested in how the system time is synced across nodes. I pass a presto query like `date > now() - interval'30' minute` to pinot. How much I can be sure about the "now()" function? Is it be translated to a exact time in presto and then pass to pinot? Then how much difference it can have across different pinot nodes?
@mayanks: Good question, now() is computed at the broker that receives the query. And then all the servers go by that time.
@neilteng233: Do we have any mechanism in Pinot that sync the time across different brokers? Or it just assume the system time is up-to-date and sync across each server? And something else should keep the system time up-to-date.
@g.kishore: yes, we assume the system time is up-to-date across all brokers.
@g.kishore: typically syncing time across servers is done at a server level and independent of the programs that run on those servers..
@g.kishore: something is NTP
@m.gautam: @r.clark and are facing the issue with select * on the pinot, the select * is not reflecting all the columns of the table, when we do select with individual column names, they are getting reflected correctly, has anyone else faced any such problems?
@mayanks: Do all your segments in the table have all the columns?
@mayanks: One possibility is that you added new columns later one and did not perform a backfill, or reload (to auto-backfill with default values)?
@g.kishore: @mayanks now that most Pinot users have schema, it might be a good idea to set the list of columns for select * via schema instead of looking at individual segment
@mayanks: Yes, good idea @g.kishore. Will file an issue.
@m.gautam: @r.clark
@mayanks: @m.gautam Do you think you can help file an issue detailing the problem you ran into? That way, it will be associated with the actual issue that happened, instead of an enhancement request.

#random

@jinghui.wang: @jinghui.wang has joined the channel
@uparekh: @uparekh has joined the channel
@abhijeet.kushe: @abhijeet.kushe has joined the channel
@ismvarunsharma: @ismvarunsharma has joined the channel
@rachel.pedreschi: @rachel.pedreschi has joined the channel

#troubleshooting

@jinghui.wang: @jinghui.wang has joined the channel
@yash.agarwal: Hey team, I am seeing a huge variation in performance of the following queries. ```select distinct DATETIMECONVERT(transaction_date, '1:DAYS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', '1:DAYS') from transactions limit 1000 -- 80+ seconds select distinct transaction_date from transactions limit 1000 -- 3.5 seconds``` Can you help with the how to optimize the same. In the meanwhile we have added another column in the `yyyy-MM-dd` format to support the same.
@mayanks: How many docs does the filter select, and how many distinct values? Also what’s your jvm settings?
@yash.agarwal: There is no filter .. about 22 billion records. And 700 distinct values.
@yash.agarwal:
@mayanks: Yeah, so I think DateTimeConvert on 22B records might be slow.
@mayanks: Can you try count(*) + group by `DATETIMECONVERT(transaction_date, '1:DAYS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd', '1:DAYS') ` to get the distinct values?
@mayanks: It might be slightly faster, but not sure by how much. Also, adding another column will make it super fast.
@jackie.jxt: I don't think doing aggregation group-by can be faster than distinct though. The cost is mainly on transform as we can see distinct itself is quite fast (second query)
@jackie.jxt: Ideally we should support doing transform on broker side after getting the distinct results, which avoids the per-record transform
@jackie.jxt: Currently one workaround would be to create a derived column for this transformation, and directly querying the derived column
@jackie.jxt: You may create the derived column by adding the ingestion transform and do a table reload:
@uparekh: @uparekh has joined the channel
@abhijeet.kushe: @abhijeet.kushe has joined the channel
@ismvarunsharma: @ismvarunsharma has joined the channel
@rachel.pedreschi: @rachel.pedreschi has joined the channel

#pinot-dev

@khushbu.agarwal: Hi, I had a query. In an instance where we loose configuration data in zookeeper in the pinot cluster, is there any recovery method in pinot? If the data in stored in deep store like S3, can we rebuild the cluster with data in the deep store?
@g.kishore: zookeeper can be configured to store the snapshot in deepstore like s3
@g.kishore: and you can revive restart zookeeper by providing that snapshot as input
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Apache Pinot Daily Email Digest (2021-07-21)

#general

#random

#troubleshooting

#pinot-dev

Reply via email to