Apache Pinot Daily Email Digest (2021-04-23)

Pinot Slack Email Digest Fri, 23 Apr 2021 19:00:38 -0700

#general

@ashwin: @ashwin has joined the channel
@mukeshcsahutech: @mukeshcsahutech has joined the channel
@srirajanivetha: @srirajanivetha has joined the channel
@pedro.cls93: @pedro.cls93 has joined the channel
@pedro.cls93: Hello, Does Pinot support updating an existing Schema & Table's definition? I have a dimension which is a string representation of a JSON. The schema of this json payload is dynamic. Some inner fields exist for some rows but not others and will change over time. I have a business requirement to deconstruct the json such that users can use the inner fields in the json for queries. I've seen that it is possible to deconstruct json fields: but my question is whether pinot allows this deconstruct to change over time. Thank you.
@mayanks: Pinot allows changing of table configs and schemas. However, schema changes need to be backward compatible. For example, you cannot change a columns type from string to integer.
@pedro.cls93: Will Pinot reprocess existing segments in a table to match the new schema?
@mayanks: So if you add a new column for example then there is a segment reload api you need to call, and it will populate old segments with new column having a default value
@pedro.cls93: Alright, thank you for the information!
@sleepythread: Couple of Starter Question: Lets assume we have an table in HDFS which get loads every 30 min with following structure. e.g: /tmp/event/dt=2021-01-01/batch_id=2021-01-01_01_00_00 1. How do we incremental load the data in Pinot, atomically ? 2. Let’s assume we have to fix historical data. How do we reload the older batch (which is already loaded into Pinot) for e.g: /tmp/event/dt=2020-01-01/batch_id=2020-01-01_01_00_00 ? 3. Is there a way where we can directly build Pinot Segment from Spark DataFrame, is there any specific Implementation interface i can use in our exiting Spark App ?
@mayanks: 1. You can have a batch job that scheduled that incrementally pushed data to Pinot, as data arrives in HDFS. Curious though, if it is every 30min, do you have a stream pipeline that Pinot can ingest from directly? 2. Historical segments can be overwritten in Pinot. Any segment pushed to Pinot that has the same name as an existing segment within Pinot will overwrite the existing one. you just need to ensure that they are for the same time period. 3. Haven’t looked at spark data frame, but for segment generation from any format you just need to implement the RecordReader interface. @jlli do we have this in OSS?
@jlli: Yes, we’ve already had some spark job which is ready in our OSS. @sleepythread you can check `PinotSparkJobLauncher` or `IntermediateSegmentTest` to see how it’s going to be used in your spark App
@lars.zwaan: @lars.zwaan has joined the channel
@pedro.cls93: Hello again, is it normal when trying to create a realtime table in Pinot's UI to receive a popup saying the table has been saved but not seeing an entry in the UI?
@pedro.cls93: I get this log from the controller process: `2021/04/23 14:19:38.925 WARN [HelixHelper] [grizzly-http-server-1] Idempotent or null ideal state update for resource brokerResource, skipping update.` It seems the action is redundant and Pinot already has a table defined, but then why does nothing show in the UI?
@pedro.cls93: Using the REST API to list all tables in the cluster: `curl -X GET "" -H "accept: application/json"` the server tells me there are no tables. The response is: ```{ "tables": [] }```
@pedro.cls93: Broker reports: ```aught exception while processing transition from OFFLINE to ONLINE for table: HitExecutionView_REALTIME java.lang.IllegalStateException: Failed to find ideal state for table: HitExecutionView_REALTIME```
@pedro.cls93: Anyone familiar with this exception? cc @mayanks
@pedro.cls93: cc @ricardo.bernardino
@mayanks: No, once created, the table should show up.
@mayanks: How did you start the cluster and what’s your VM size? If you used quick-start, then note that the Xmx there is very small and not enough for anything larger than the example quick start loads
@pedro.cls93: This is a K8s cluster, which process do mean? The controller?
@pedro.cls93: I've noticed that the message popups in the cluster manager are not accurate. I've tried to create a table with a partition level of 3. My cluster have only 1 zookeeper and 1 broker instance, it should not be possible to create the table. I get some exceptions in the broker's logs but thats it. Nothing in the UI reports an error.
@pedro.cls93: A similar issue when trying to modify a realtime table definition, I tried to change the kafka topic. The UI reports a successful change, I reload the segments, nothing happens. If I create another table from scratch with the new kafka topic, it works.
@vinayakb: @vinayakb has joined the channel

#random

@ashwin: @ashwin has joined the channel
@mukeshcsahutech: @mukeshcsahutech has joined the channel
@srirajanivetha: @srirajanivetha has joined the channel
@pedro.cls93: @pedro.cls93 has joined the channel
@lars.zwaan: @lars.zwaan has joined the channel
@vinayakb: @vinayakb has joined the channel

#troubleshooting

@ashwin: @ashwin has joined the channel
@mukeshcsahutech: @mukeshcsahutech has joined the channel
@srirajanivetha: @srirajanivetha has joined the channel
@pedro.cls93: @pedro.cls93 has joined the channel
@lars.zwaan: @lars.zwaan has joined the channel
@ravikumar.m: I am doing a PoC, Till now I used quick_start_stream code. And did some work on that. But when I stop the Pinot, everything tables, schema and data is gone. How to make it all data should be in local permanently. And when I restart the pinot, how can I see tables, schema and data. *Need your help*
@mayanks: If you delete the cluster (which will happen if you delete ZK as part of stopping Pinot) then everything is deleted.
@vinayakb: @vinayakb has joined the channel

#getting-started

@lars.zwaan: @lars.zwaan has joined the channel

#pinot-rack-awareness

@jackie.jxt: @jackie.jxt has joined the channel

#minion-improvements

@laxman: @npawar and @jackie.jxt: After spending some time in understanding the feature through design doc and code and after tweaking with table configuration and , I’m now able to get this running. Thanks for your help and guidance.
@npawar: what did you have to tweak?
@npawar: would you mind adding any missing steps to the documentation?
@laxman: Have to add this config to controller as @jackie.jxt already pointed out ```controller.task.scheduler.enabled=true``` Following config I fixed in my REALTIME table as this was 2d earlier ``` "realtime.segment.flush.threshold.time": "6h",```
@laxman: Using the following task config now in my REALTIME table config ``` "task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "6h", "bufferTimePeriod": "6h" } } }```
@laxman: > would you mind adding any missing steps to the documentation? Sure. Will definitely do it after I test this e2e. Please give me few days.
@npawar: what’s the granularity of your time column?
@laxman: millis. here is my tableconfig
@laxman: ```{ "REALTIME": { "tableName": "domainEventView_REALTIME", "tableType": "REALTIME", "segmentsConfig": { "schemaName": "domainEventView", "timeType": "MILLISECONDS", "retentionTimeUnit": "DAYS", "retentionTimeValue": "5", "timeColumnName": "event_time_millis", "replication": "2", "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy", "segmentPushType": "APPEND", "replicasPerPartition": "2" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [ "api_id", "category", "security_event_category", "customer_id" ], "rangeIndexColumns": [ "event_time_millis" ], "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": false, "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "bloomFilterColumns": [], "loadMode": "MMAP", "streamConfigs": { "realtime.segment.flush.threshold.rows": "0", "stream.kafka.hlc.zk.connect.string": "zookeeper:2181", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.avro.confluent.KafkaConfluentSchemaRegistryAvroMessageDecoder", "streamType": "kafka", "stream.kafka.decoder.prop.schema.registry.rest.url": "", "realtime.segment.flush.threshold.segment.size": "50M", "stream.kafka.consumer.type": "LowLevel", "stream.kafka.broker.list": "bootstrap:9092", "realtime.segment.flush.threshold.time": "6h", "stream.kafka.zk.broker.url": "zookeeper:2181", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.consumer.prop.auto.offset.reset": "largest", "stream.kafka.topic.name": "normalized-domain-events" }, "noDictionaryColumns": [], "enableDefaultStarTree": false }, "metadata": {}, "task": { "taskTypeConfigsMap": { "RealtimeToOfflineSegmentsTask": { "bucketTimePeriod": "6h", "bufferTimePeriod": "6h" } } }, "isDimTable": false } }```
@laxman: ```"timeType": "MILLISECONDS", "timeColumnName": "event_time_millis",```
@npawar: cool. I would suggest making bufferTimePeriod slightly more than 6h. Otherwise there would be some time ranges which are in the offline, but are also being consumed in realtime, and that will mess up the broker time boundary
@laxman: okay. Please take a look at this once
@laxman: *Query* ```select $segmentName, ToDateTime(min(event_time_millis),'yyyy-MM-dd-HH-MM-ss') start_time, ToDateTime(max(event_time_millis) ,'yyyy-MM-dd-HH-MM-ss') end_time, count(*) from domainEventView group by $segmentName order by min(event_time_millis) limit 100``` *Output* ```$segmentName,start_time,end_time,count(*) "domainEventView_1618704000458_1618790386513_0","2021-04-18-00-04-00","2021-04-18-23-04-46","49707" "domainEventView_1618768844758_1618790386513_0","2021-04-18-18-04-44","2021-04-18-23-04-46","10892" "domainEventView_1618790405342_1618811999512_0","2021-04-19-00-04-05","2021-04-19-05-04-59","12219" "domainEventView_1618790405342_1618876799369_0","2021-04-19-00-04-05","2021-04-19-23-04-59","58919" "domainEventView_1618812003034_1618833595768_0","2021-04-19-06-04-03","2021-04-19-11-04-55","13563" "domainEventView_1618833602675_1618855199240_0","2021-04-19-12-04-02","2021-04-19-17-04-59","13885" "domainEventView_1618855200892_1618876799369_0","2021-04-19-18-04-00","2021-04-19-23-04-59","19252" "domainEventView_1618876801392_1618963198060_0","2021-04-20-00-04-01","2021-04-20-23-04-58","80864" "domainEventView_1618876801392_1618898396994_0","2021-04-20-00-04-01","2021-04-20-05-04-56","27863" "domainEventView_1618898401682_1618919999889_0","2021-04-20-06-04-01","2021-04-20-11-04-59","19495" "domainEventView_1618920001882_1618941594448_0","2021-04-20-12-04-01","2021-04-20-17-04-54","19989" "domainEventView_1618941604011_1618963198060_0","2021-04-20-18-04-04","2021-04-20-23-04-58","13517" "domainEventView_1618963200391_1618984797539_0","2021-04-21-00-04-00","2021-04-21-05-04-57","16244" "domainEventView_1618984802790_1619006396884_0","2021-04-21-06-04-02","2021-04-21-11-04-56","19706" "domainEventView_1619006401293_1619027957526_0","2021-04-21-12-04-01","2021-04-21-17-04-17","71129" "domainEventView_1619029429177_1619049599214_0","2021-04-21-18-04-49","2021-04-21-22-04-58","317084" "domainEventView__5__29__20210421T2252Z","2021-04-21-22-04-59","2021-04-22-05-04-48","49672" "domainEventView__0__100__20210421T1858Z","2021-04-21-23-04-00","2021-04-21-23-04-23","985" "domainEventView__1__27__20210421T2256Z","2021-04-21-23-04-01","2021-04-22-05-04-25","50446" "domainEventView__6__100__20210421T1923Z","2021-04-21-23-04-02","2021-04-21-23-04-06","4964" "domainEventView__4__100__20210421T1858Z","2021-04-21-23-04-03","2021-04-21-23-04-14","73" "domainEventView__3__28__20210421T1906Z","2021-04-21-23-04-08","2021-04-21-23-04-34","1641" "domainEventView__2__102__20210421T2251Z","2021-04-21-23-04-10","2021-04-22-05-04-24","49461" "domainEventView__4__101__20210421T2300Z","2021-04-21-23-04-14","2021-04-22-05-04-54","50625" "domainEventView__7__30__20210421T2258Z","2021-04-21-23-04-16","2021-04-22-05-04-59","50443" "domainEventView__0__101__20210421T2307Z","2021-04-21-23-04-23","2021-04-22-10-04-25","50625" "domainEventView__3__29__20210421T2308Z","2021-04-21-23-04-34","2021-04-22-05-04-12","50625" "domainEventView__6__101__20210421T2343Z","2021-04-21-23-04-06","2021-04-22-16-04-32","50625" "domainEventView__5__30__20210422T0537Z","2021-04-22-04-04-11","2021-04-23-21-04-44","24678" "domainEventView__2__103__20210422T0538Z","2021-04-22-04-04-38","2021-04-23-21-04-51","23689" "domainEventView__1__28__20210422T0539Z","2021-04-22-04-04-37","2021-04-23-21-04-42","23964" "domainEventView__4__102__20210422T0540Z","2021-04-22-04-04-20","2021-04-23-21-04-54","23567" "domainEventView__7__31__20210422T0544Z","2021-04-22-04-04-59","2021-04-23-21-04-49","22906" "domainEventView__3__30__20210422T0547Z","2021-04-22-04-04-09","2021-04-23-21-04-55","21771" "domainEventView__0__102__20210422T1017Z","2021-04-22-10-04-25","2021-04-23-21-04-34","18624" "domainEventView__6__102__20210422T1646Z","2021-04-22-16-04-32","2021-04-23-21-04-20","15901"```
@laxman: Here, I don’t see any overlap between OFFLINE and REALTIME.
@laxman: > I would suggest making bufferTimePeriod slightly more than 6h. Otherwise there would be some time ranges which are in the offline, but are also being consumed in realtime, and that will mess up the broker time boundary iiuc, this can happen when there is more than 6 hours lag while consuming kafka stream.
@laxman: is that correct?
@laxman: Have couple of basic questions after using this feature. Tried but failed to figure out from documentation. If you can, please respond when you have couple of minutes. • With the configs I have (bucket time period: 6h, buffer time period: 6h, segment flush threshold time: 6h), I expected REALTIME table to have only 12 hours of data. However, from above query results I can see lot more data in REALTIME tables (from 21st onwards). What am I missing here? Why data on 21st is not yet moved to OFFLINE? • Is there any documentation about pinot internal tables/columns (data dictionary) like $segmentName we used in above query? Druid has exposed an internal table which gives all metadata about segments via SQL that allows us to run some adhoc anlytics/queries on segment data. Do we have anything similar?
@jackie.jxt: 1. REALTIME segment is not physically moved. You need to set retention to a shorter period to remove them 2. We have 3 virtual columns in pinot: `$hostName`, `$segmentName` and `$docId`. Pinot does not support querying segment metadata via SQL. The metadata can only be queried via the rest endpoint
@jackie.jxt: Added some documentation here:
@laxman: okay. thanks @jackie.jxt
@laxman: > REALTIME segment is not physically moved. You need to set retention to a shorter period to remove them I initially assumed we are deleting/disabling the REALTIME segments after conversion. However, after inspecting the code, I also realized that we are not deleting REALTIME segments after conversion. But why in my results REALTIME(21st onwards) and OFFLINE(18th to 21st) are having non-overlapping data?
@jackie.jxt: We maintain a time boundary between offline and realtime table, and ensure the same data is only queried once. You can read more about hybrid table here:
@laxman: I want to create OFFLINE segments with discrete intervals (like for every n hours of a day) Is it feasible to achieve this? Whats the right task/table config for the above table?
@jackie.jxt: What is "discrete intervals"? The `bucketTimePeriod` allows you to define the time range for each offline segment. If you config it as 6 hours, then everyday there will be 4 offline segments generated
@laxman: By discrete I meant non-overlapping as you explained. I want to create 4 segments per day with time intervals [0-6), [6,12), [12, 18), [18,24). Understood now. Thanks Jackie.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org