Apache Pinot Daily Email Digest (2022-05-16)

Pinot Slack Email Digest Mon, 16 May 2022 20:01:36 -0700

#general

@hello472: @hello472 has joined the channel
@ysuo: Hi team, how to check if instanceAssignmentConfigMap config takes effect?
@mayanks: One somewhat roundabout way of doing so would be to do a dry-run of rebalance and see if it shows any changes in ideal-state. If not, then it has taken effect.
@harish.bohara: i have 2-3 fileds in metricFieldSpecs. These columns captures taken to do some operation (e.g. time taken to sent -> delivery of a items). Any idea of how to get a histogram of this data (to be used in Superset)?
@mayanks: Afaik, there isn’t an inbuilt histogram function. You could use the percentileTDigest to get fast percentiles for histo. cc: @jackie.jxt @kharekartik
@saurabhkumarsharma96: @saurabhkumarsharma96 has joined the channel
@email2sandhu01: @email2sandhu01 has joined the channel
@karinwolok1: If anyone wants to submit a talk:
@maarten: @maarten has joined the channel

#random

#troubleshooting

@wcxzjtz: hello, wondering how I can check if a query is using rangeIndex. I added the index config like following, but from the tracing info, i didn’t see `RangeIndexBasedFilterOperator` is used. ``` "rangeIndexColumns": [ "some_column" ],``` btw, we are using pinot 0.8
@wcxzjtz: actually, I see it now. thanks . but looks like it only works for offline table?
@wcxzjtz: @richard892 when you have time.
@wcxzjtz: hold on. there maybe some issue with my data.
@xiangfu0: better check with @richard892, likely that thing is not enabled.
@hello472: @hello472 has joined the channel
@ysuo: Hi, we deployed presto based on the helm file. It seems like offset is not enabled. Any idea how to enable offset?
@kharekartik: @haitao
@haitao: @xiangfu0 has more knowledge about the helm chart
@xiangfu0: what is offset you are mentioning to?
@ysuo: Hi, when using offset num1 limit num2 in presto query, it returned Offset support is not enabled. presto:default> select * from table_name offset 10 limit 10; Query 20220516_231652_00449_7gh97 failed: *Offset support* is not enabled
@xiangfu0: oh, use limit 10,10
@xiangfu0: also pinot doesn’t support offset without ordering
@xiangfu0: so just select * won’t give you consist results
@xiangfu0: why you need offset 10?
@ysuo: Hi, it’s presto query that returned offset not enabled. Need offset for pagination.
@xiangfu0: Yeah, check presto query syntax
@xiangfu0: Note that pagination is not enabled
@xiangfu0: So the results is not stable
@xiangfu0: Better to fetch enough rows and cache from front-end
@ysuo: :ok_hand:
@ysuo: Thanks.
@dadelcas: Hello, I've got an issue with a realtime table which is consuming from a topic with 16 partitions. Pinot is consuming from all partitions except 1 and I can't find issues in the logs. Is there a way to force pinot consuming from that partition? I've tried rebalancing the servers and reloading all segments but it still won't consume from this one partition
@saurabhd336: Can you check if there are CONSUMING segments for all your partitions in ZK? You can use the controller UI to check that. Here's an example. It's under IDEALSTATES -> <tableName>_REALTIME. You should ideally have one segment per partition with state as "CONSUMING". If there are some partitions for which you don't have consuming segments, you might have to manually create them to resume consumption.
@saurabhd336: The key in mapFields in of the format <tableName>__<partitionGroupId>___<sequenceNumber>___<dateTime>
@kharekartik: Also, you can trigger `RealtimeSegmentValidation` task to detect new partitions. This can be done via API call to controller `curl -X GET "" -H "accept: application/json"`
@dadelcas: I had checked ideal states to confirm the table wasn't consuming, there aren't entries in ZK for this partition. I would rather avoid doing operations at this level
@kharekartik: @navi.trinity can you help here. What can be the cause of only one partition not showing up?
@dadelcas: I've run the the segment validation task but still no luck
@saurabhd336: Was any segment delete command run for this partition's segment @dadelcas?
@saurabhd336: Or are there any segments in OFFLINE state?
@dadelcas: There are no segments for this partition nor they've been deleted
@saurabhkumarsharma96: @saurabhkumarsharma96 has joined the channel
@nair.a: Hi Team, Regarding lookup/Dimension Table and array data type use case. We have created a Dimension Table with following schema: ```{ "schemaName": "test_dim_tags", "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "tag_name", "dataType": "STRING", "singleValueField": false } ], "primaryKeyColumns": [ "id" ] }``` Now when we use this table in lookup with Fact Table, query is returning no data or throwing NullPointerExpection. We wanted to use pinot's array explode functionality along with lookup. can someone please help to understand?
@richard892: I believe this is a feature gap in lookup
@richard892: I'll take a look and see if there are barriers to adding it
@nair.a: sure thanks @richard892
@richard892: featurewise it looks good, do you have a stack trace for the NPE?
@nair.a: ```[ { "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)", "errorCode": 200 }, { "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:242)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)", "errorCode": 200 }, { "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)", "errorCode": 200 }, { "message": "QueryExecutionError:\nProcessingException(errorCode:450, message:InternalError:\njava.lang.NullPointerException\n\tat org.apache.pinot.core.operator.combine.GroupByOrderByCombineOperator.mergeResults(GroupByOrderByCombineOperator.java:236)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:119)\n\tat org.apache.pinot.core.operator.combine.BaseCombineOperator.getNextBlock(BaseCombineOperator.java:50)", "errorCode": 200 } ]```
@richard892: ok, this is most likely caused by the query being slow
@richard892: are these lookups in unfiltered group bys?
@nair.a: we had few filter conditions , if that's what you are asking for.
@richard892: can you remove the lookup from the query and post the response metadata (numDocsScanned etc.) please?
@nair.a: ```"exceptions": [], "numServersQueried": 12, "numServersResponded": 12, "numSegmentsQueried": 569, "numSegmentsProcessed": 32, "numSegmentsMatched": 32, "numConsumingSegmentsQueried": 4, "numDocsScanned": 37273560, "numEntriesScannedInFilter": 88491445, "numEntriesScannedPostFilter": 260914920, "numGroupsLimitReached": false, "totalDocs": 5011102229, "timeUsedMs": 595, "offlineThreadCpuTimeNs": 0, "realtimeThreadCpuTimeNs": 0, "offlineSystemActivitiesCpuTimeNs": 0, "realtimeSystemActivitiesCpuTimeNs": 0, "offlineResponseSerializationCpuTimeNs": 0, "realtimeResponseSerializationCpuTimeNs": 0, "offlineTotalCpuTimeNs": 0, "realtimeTotalCpuTimeNs": 0, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 1652704731377, "numRowsResultSet": 350```
@richard892: ok so it's quite a heavy query, and then the lookup will make that worse because the approach employed is not very efficient, which makes timeout rather than feature incompleteness a more likely diagnosis
@richard892: all I can say is lookup isn't powerful enough to power anything but the simplest and lightest weight join use cases, but the multi stage query engine will solve problems like this one
@nair.a: thats great. loking forward for it.
@nair.a: @richard892 one more thing with Dimension table, lookups starts to return null after sometime, we have to rerun the ingestion job to fix this, any know reason?
@email2sandhu01: @email2sandhu01 has joined the channel
@maarten: @maarten has joined the channel
@stuart.millholland: So I've setup my controller/minions/servers to use a gcs bucket in a gke environment. Is there an easy button way to test that the gcs bucket perms and such are working correctly? I don't have any data yet, so curious if there's a way to test things are working.
@mayanks: Check controller/server logs to see how PinotFs is initialized.
@stuart.millholland: logs don't have any complaints
@mayanks: Do the logs contain something like: `Initializing PinotFS for scheme` for the right deep-store (GCS)?

#getting-started

@hello472: @hello472 has joined the channel
@filipdolinski: Hi all,
@filipdolinski: I am looking for a spark connector for writing the data to pinot. I saw on github, that write support will be availible in the future. Do you have any news about it or tips how to deal it? Thank you in advance !
@kharekartik: we currently don't support writing spark dataframes/RDDs directly to Pinot. However, you can use our spark plugin to read your data and dump into pinot's storage. You can find the documentation here -
@kharekartik: Example recipe -
@saurabhkumarsharma96: @saurabhkumarsharma96 has joined the channel
@email2sandhu01: @email2sandhu01 has joined the channel
@maarten: @maarten has joined the channel
@rbobbala: Hello Team, I'm new to Apache Pinot I'm have setup my Apache pinot cluster in my local laptop using KinD and Helm My question is: What is the best way to automate the upload of new schema, table and job (realtime & batch ingestion) files to pinot ?

#introductions

@hello472: @hello472 has joined the channel
@saurabhkumarsharma96: @saurabhkumarsharma96 has joined the channel
@email2sandhu01: @email2sandhu01 has joined the channel
@maarten: @maarten has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org