Apache Pinot Daily Email Digest (2022-04-25)

Pinot Slack Email Digest Mon, 25 Apr 2022 19:00:37 -0700

#general

@tejaswini.iiitn: Any open source dashboard which i can integrate with pinot
@mitchellh: Superset is the most popular dashboard integration
@mark.needham: @dunithd also showed how to integrate with redash
@brett.kishkis: @brett.kishkis has joined the channel

#random

@brett.kishkis: @brett.kishkis has joined the channel

#troubleshooting

@lars-kristian_svenoy: Hello everyone :wave: I'm seeing a problem where Pinot is not able to ingest a JSON object, it just shows up as null in the table... Will post more details in thread
@lars-kristian_svenoy: Here is sample data from kafka ```{ "objectId": "00000000-0000-0000-0000-000000000000", "jsonObject": { "values": [ { "id": "bob", "names": [ "a", "b", "c", "d", "e" ] } ] } }```
@lars-kristian_svenoy: And schema.. ```{ "schemaName": "myObjects", "dimensionFieldSpecs": [ { "name": "objectId", "dataType": "STRING" }, { "name": "jsonObject", "dataType": "JSON" } ], "dateTimeFieldSpecs": [ { "name": "lastModified", "dataType": "LONG", "format": "1:MILLISECONDS:EPOCH", "granularity": "1:DAYS" } ] }```
@lars-kristian_svenoy: I've got jsonObject in noDictionaryColumns and in jsonIndexColumns
@lars-kristian_svenoy: I'm using Pinot 0.10.0, any ideas what's wrong?
@saurabhd336: Hi @lars-kristian_svenoy. The json object does not seem to have "lastModified" field. Have you intentionally truncated the object? Or is it actually not part of the objects you're trying to ingest?
@lars-kristian_svenoy: I truncated it yes
@lars-kristian_svenoy: It's all there
@lars-kristian_svenoy: All the other data in my schema is showing up
@saurabhd336: Could you share your table config @lars-kristian_svenoy? With ```{ "tableName": "myObject", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "lastModified", "timeType": "MILLISECONDS", "schemaName": "myObjects", "replicasPerPartition": "1" }, "tenants": {}, "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "object-topic", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "localhost:9876", "realtime.segment.flush.threshold.time": "5000", "realtime.segment.flush.threshold.rows": "1", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }, "metadata": { "customConfigs": {} } }``` I was able to successfully able to ingest the data
@lars-kristian_svenoy: Are you using 0.10.0 or master?
@lars-kristian_svenoy: ```{ "REALTIME": { "tableName": "myObjects", "tableType": "REALTIME", "segmentsConfig": { "timeType": "MILLISECONDS", "schemaName": "myObjects", "retentionTimeUnit": "DAYS", "retentionTimeValue": "365", "timeColumnName": "lastModified", "allowNullTimeValue": false, "replicasPerPartition": "2" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "rangeIndexVersion": 2, "jsonIndexColumns": [ "jsonObject" ], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP", "noDictionaryColumns": [ "lastModified", "jsonObject" ], "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "segmentPartitionConfig": { "columnPartitionMap": { "objectId": { "functionName": "Murmur", "numPartitions": 2 } } }, "aggregateMetrics": false, "nullHandlingEnabled": false }, "metadata": { "customConfigs": {} }, "routing": { "segmentPrunerTypes": [ "partition" ], "instanceSelectorType": "replicaGroup" }, "instanceAssignmentConfigMap": { "CONSUMING": { "tagPoolConfig": { "tag": "DefaultTenant", "poolBased": false, "numPools": 0 }, "replicaGroupPartitionConfig": { "replicaGroupBased": true, "numInstances": 0, "numReplicaGroups": 2, "numInstancesPerReplicaGroup": 8, "numPartitions": 0, "numInstancesPerPartition": 0 } } }, "upsertConfig": { "mode": "NONE", "hashFunction": "NONE" }, "ingestionConfig": { "streamIngestionConfig": { "streamConfigMaps": [ { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "my_objects", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": ".....", "realtime.segment.flush.threshold.rows": "0", "realtime.segment.flush.threshold.time": "24h", "realtime.segment.flush.threshold.segment.size": "200M", "realtime.segment.flush.autotune.initialRows": "2000000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } ] }, "transformConfigs": [], "complexTypeConfig": {} }, "isDimTable": false } }```
@harish.bohara: Any idea why Pinot server JVM usage grows as records grows - it does glow slowly. However, this way it will eventually have high heap usage. (I do have 5-6 inverted and sorted indexes in my table - with very low cardinality columns) • using off heap and MMAP for segments in my setup • have ~ 100-150 segments • 500-600M rows -> continue to grow over 150M per day.. • 6 server nodes - 8GB is given to JVM and rest is available to off-heap What I expected - it will grow and as segments go to disk it should come back the lower bound. I expected that this cycle should continue and the the lower bound of memory should remain constants. Am i missing some setting?
@mayanks: Which version of Pinot? Also are you see this increase only for server or broker/controller as well? If all, perhaps related to Prometheus
@harish.bohara: 0.10.0
@harish.bohara: Should I remove Prometheus and try?
@mayanks: First check if all components
@harish.bohara: Controller and broker - looks ok to me (i see controller and broker come to ~ lower bounds after GC, then go up and again go the ~same value).
@mayanks: But you are just ingesting data, so broker should not have any spikes at all
@mayanks: It should be flat lined
@mayanks: May be try without Prometheus, just to debug
@harish.bohara: Sure
@brett.kishkis: @brett.kishkis has joined the channel

#pinot-dev

@zaikhan: Hello Team, I have raised a PR to support refreshing realtime segments which are completed. This will allow us to purge/modify the records in purge minion task. Please review the PR
@mayanks: Could you add some description in the PR?
@zaid.mohemmad: @zaid.mohemmad has joined the channel

#getting-started

@arekchmura: Hi, I have a question about time-related columns in the default StarTree index configuration. The documentation says: > Here we assume that time columns will be included in most queries as the range filter column and/or the group by column, so for better performance, we always include them as the last elements in the _dimensionsSplitOrder_ How does putting time columns as the last elements improve performance? Is it related to the number of nodes that need to be processed to solve a given query?
@arekchmura: And a similar question, what's the reason behind sorting the dimensions by descending cardinality (in the default configuration)? I'm trying to understand why it would be better to sort them by descending cardinality rather than by ascending cardinality.
@brett.kishkis: @brett.kishkis has joined the channel

#introductions

@karinwolok1: @karinwolok1 has joined the channel
@tonya: @tonya has joined the channel
@tlberglund: @tlberglund has joined the channel
@mayanks: @mayanks has joined the channel
@vallamsetty: @vallamsetty has joined the channel
@xiangfu0: @xiangfu0 has joined the channel
@npawar: @npawar has joined the channel
@mark.needham: @mark.needham has joined the channel
@dunithd: @dunithd has joined the channel
@greetbot: @greetbot has joined the channel
@chad: @chad has joined the channel
@greetbot: Yes! @chad is here!
@sam: @sam has joined the channel
@troy: @troy has joined the channel
@madison: @madison has joined the channel
@glenn393: @glenn393 has joined the channel
@sandeep908: @sandeep908 has joined the channel
@kulbir.nijjer: @kulbir.nijjer has joined the channel
@greetbot: @troy we will be friends until forever, just you wait and see :smiley:
@greetbot: Yessir! @kulbir.nijjer is here!
@mitchellh: @mitchellh has joined the channel
@greetbot: Yes! @glenn393 is here!
@greetbot: Fantastic! @sandeep908 is here!
@mayanks: Hey community, I am one of the original team members that helped create Apache Pinot, and a PMC for the project. Looking forward to help evangelize Apache Pinot, and help the community.
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org