Apache Pinot Daily Email Digest (2021-05-13)

Pinot Slack Email Digest Thu, 13 May 2021 19:00:25 -0700

#general

@humengyuk18: How do I cast a string value to int or long using sql in pinot?
@jackie.jxt: Are you planning to aggregate on it? If so, it should be auto-casted
@jackie.jxt: E.g. `select sum(numericStringCol) from myTable`
@humengyuk18: I see, but this is a multivalue column, I got it working by using groovy udf.
@patidar.rahul8392: { "ID":"I", "ct_ts":"2021-05-12 08:13:09.000190", "time_ts":"2021-05-12T08:13:16.152000", "Pid":"111111", "Task":{ "STARTDATETIME":"20210512081308", "Id1":"111111111111", "Id2":null }
@patidar.rahul8392: Hi everyone, Kindly suggest how to write JSON schema for this sample data. I am not able the data for first 3 columns in pinot but cannot see any data starttimedate, id1 and id2
@patidar.rahul8392: Kindly suggest how to access data for the remaining 3 fields.
@patidar.rahul8392: Hi am following @npawar tutorial and trying to build my JSON but it's working file if we have data like this .but in my data inside one task segment I have all of my remaining fields. Kindly suggest.
@kmvb.tau: Does pinot supports partition only for RealTime Tables? For the Offline Table all partition data written in the same segment file. segment metadata.properties `column.RELATEDID.partitionFunction = Murmur` `column.RELATEDID.numPartitions = 10` `column.RELATEDID.partitionValues = 0,1,2,3,4,5,6,7,8,9` Note: Running Data Ingestion using pinot-admin.sh LaunchDataIngestionJob
@mayanks: For offline, Pinot expects data to be already partitioned. In future we will add capability of Pinot sorting and partitioning the data during offline ingestion.
@kmvb.tau: does pinot support partition for multiple columns?
@mayanks: No
@riteshrathi: @riteshrathi has joined the channel
@humengyuk18: Why the datetimeconvert transform function is much slower than time_floor in Druid? We are migrating our Druid table to Pinot, but found datetime transform and granularity is very slow compare to Druid.
@g.kishore: what is the conversion? simple date format to epoch?
@humengyuk18: Epoch to epoch with 15 minutes granularity.
@g.kishore: that should be fast, whats the query
@humengyuk18: ```SELECT DATETIMECONVERT(datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'), '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES'), COUNT(*) FROM product_log WHERE datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') >= 1620830640000 AND datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') < 1620917040000 AND method = 'DeviceInternalService.CheckDeviceInSameGroup' AND container_name = 'whale-device' AND error > '0' GROUP BY DATETIMECONVERT(datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'), '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '1:MINUTES') ORDER BY COUNT(*) DESC```
@humengyuk18: ```SELECT datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'), COUNT(*) FROM product_log WHERE datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') >= 1620830760000 AND datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') < 1620917160000 AND method = 'DeviceInternalService.CheckDeviceInSameGroup' AND container_name = 'whale-device' AND error > '0' GROUP BY datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') ORDER BY COUNT(*) DESC``` The cost is similar if I deselect the 1 minute time grain in Superset.
@humengyuk18: Druid returned result under 1 second, but Pinot cost 5-6 seconds.
@g.kishore: can you paste the response
@g.kishore: I think segment pruning is not happening,
@humengyuk18: ```{ "resultTable": { "dataSchema": { "columnDataTypes": [ "LONG", "LONG" ], "columnNames": [ "datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')", "count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [ 1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [ 1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [ 1620867600000, 162 ], [ 1620885600000, 161 ] ] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41, "numSegmentsProcessed": 41, "numSegmentsMatched": 12, "numConsumingSegmentsQueried": 3, "numDocsScanned": 7706, "numEntriesScannedInFilter": 195554753, "numEntriesScannedPostFilter": 7706, "numGroupsLimitReached": false, "totalDocs": 165272282, "timeUsedMs": 2335, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 1620917392724 }```
@g.kishore: ```"numSegmentsProcessed": 41, "numSegmentsMatched": 12, "numConsumingSegmentsQueried": 3, "numDocsScanned": 7706, "numEntriesScannedInFilter": 195554753, "numEntriesScannedPostFilter": 7706,```
@g.kishore: this should be processed in milliseconds
@g.kishore: @jackie.jxt looks like if the udf is in the predicate the pruning is not efficient
@g.kishore: @humengyuk18 can you file an issue
@humengyuk18: ok, I will file an issue for this.
@g.kishore: ```SELECT datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES'), COUNT(*) FROM product_log WHERE __time >= 1620830760000 AND __time < 1620917160000 AND method = 'DeviceInternalService.CheckDeviceInSameGroup' AND container_name = 'whale-device' AND error > '0' GROUP BY datetimeconvert(__time, '1:MILLISECONDS:EPOCH', '1:MILLISECONDS:EPOCH', '30:MINUTES') ORDER BY COUNT(*) DESC```
@g.kishore: whats the time for this query
@humengyuk18: ```{ "resultTable": { "dataSchema": { "columnDataTypes": [ "LONG", "LONG" ], "columnNames": [ "datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')", "count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [ 1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [ 1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [ 1620867600000, 162 ], [ 1620865800000, 161 ] ] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41, "numSegmentsProcessed": 12, "numSegmentsMatched": 12, "numConsumingSegmentsQueried": 3, "numDocsScanned": 7770, "numEntriesScannedInFilter": 68503679, "numEntriesScannedPostFilter": 7770, "numGroupsLimitReached": false, "totalDocs": 165381107, "timeUsedMs": 647, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 1620917833431 }``` This one only takes 647ms to finish.
@kelvin: @kelvin has joined the channel
@humengyuk18: ```{ "resultTable": { "dataSchema": { "columnDataTypes": [ "LONG", "LONG" ], "columnNames": [ "datetimeconvert(__time,'1:MILLISECONDS:EPOCH','1:MILLISECONDS:EPOCH','30:MINUTES')", "count(*)" ] }, "rows": [ [ 1620873000000, 180 ], [ 1620869400000, 179 ], [ 1620871200000, 178 ], [ 1620894600000, 172 ], [ 1620892800000, 166 ], [ 1620874800000, 164 ], [ 1620876600000, 163 ], [ 1620896400000, 163 ], [ 1620867600000, 162 ], [ 1620865800000, 161 ] ] }, "exceptions": [], "numServersQueried": 1, "numServersResponded": 1, "numSegmentsQueried": 41, "numSegmentsProcessed": 12, "numSegmentsMatched": 12, "numConsumingSegmentsQueried": 3, "numDocsScanned": 7770, "numEntriesScannedInFilter": 68503679, "numEntriesScannedPostFilter": 7770, "numGroupsLimitReached": false, "totalDocs": 165381107, "timeUsedMs": 647, "segmentStatistics": [], "traceInfo": {}, "minConsumingFreshnessTimeMs": 1620917833431 }``` This one only takes 647ms to finish.
@pedro.cls93: Hello, If updating an existing realtime table with a new transformConfig, is the transformed field automatically computed for existing segments or is there some endpoint I need to call to force that computation? The same question but for altering an existing transformConfig.
@mayanks: Afaik auto recompute of transforms is not supported.
@pedro.cls93: So I need to recreate the table to force the update?
@mayanks: Yeah. @npawar can you also confirm
@npawar: You have to call reload segments api
@mayanks: @npawar Is that supported for realtime also?
@npawar: It should work same for the completed segments
@mayanks: I see. I stand corrected then.
@mayanks: @pedro.cls93
@npawar: Note that it will only work if the arguments to the new transform function are already present
@mayanks: Could we document this @npawar
@pedro.cls93: Perfect, sounds just like my use-case thank you both!
@pedro.cls93: What is the performance implications of defining a dimension field in a schema as a string with a max length of Integer.MaxValue (2GB if all space is fully used)?
@pedro.cls93: This field is a stringified json, usually between 20 chars and 4000 chars (~20% of the time). What is the impact of json indexing by it?
@mayanks: Will there be filtering on this column? If so, may be better to use JSON if it is indeed JSON.
@pedro.cls93: There could potentially be some filtering, yes, though right now the use-case is simply to extract fields within this json field. What do you mean by "use JSON if it is indeed JSON"?
@mayanks: Use JSON indexing
@mayanks:
@pedro.cls93: I had it initially but by default Pinot's String are set to 512 chars in length. For the index to truly be useful I need to set this field's length to Integer.MaxValue (2^31-1 chars). I'm trying to understand if having fields this large should be a concern.
@mayanks: What's the max size?
@mayanks: Raw strings don't have an inv index at the moment, which is why I suggested JSON indexing
@pedro.cls93: 2³¹ -1, realistically I think the largest I've seen was around ~20k chars
@mayanks: Yeah, it will have perf impact. Do you really need the entire 2GB JSON blob? Or is there specific fields you can extract out from it?
@pedro.cls93: Can you quantify (very roughly) what the perf impact would be?
@pedro.cls93: The issue is that JSON blob is dynamic, there is fixed schema. In some cases we have text translations in the payload, which can be quite large.
@mayanks: I don't have any concrete numbers for that. I'll have to see the overall data size, and query
@mayanks: Do you need those text translations?
@mayanks: If not, may be filtering them out.
@pedro.cls93: My consumers do, yes.
@mayanks: I see
@mayanks: What's your overall latency SLA requirement?
@pedro.cls93: < 1 second for queries which access json blob. <100ms for top-level concrete fields.
@mayanks: Ok, may have to do some quick test for the largest json blob.
@pedro.cls93: Preliminary results tell me the largest json blob is 40500 chars long.
@mayanks: I meant like actually querying
@pedro.cls93: I am, it just takes a while to get long-term historical data.
@mayanks: I'd say just do one day may be? then we can extrapolate?
@pedro.cls93: Largest in the past month was 40k
@mayanks: Of course, with actual data, it will be more accurate.
@pedro.cls93: Unfortunately I can not share the actual data, it may contain PII
@mayanks: Oh I was suggesting for you to run it
@pedro.cls93: Ah I see now what you meant, I'll try that tomorrow probably.
@kr.vikas: @kr.vikas has joined the channel
@avasudevan: I have added few new columns to the Pinot Table and Pinot Schema, inorder for the new columns to be populated I did `Reload All Segments` for the Table in the UI. 2 Questions here: • I see `Reload All Segments` to re-index data is this the right approach to re-populate new columns? • I don’t see the progress of the `Reload All Segments` I see this PR is completed - which release is this part of?
@mayanks: Yes, reload all segments is the right approach
@mayanks: The PR should be part of 0.7.1 (latest)
@avasudevan: I see `GET /segments/{tableName}/reload-status` as the API endpoint in the PR documentation but i don’t see that endpoint in my swagger. Could you help me here on the right endpoint to call to get the status of `Reload All Segments`?
@avasudevan: @mayanks ^^
@mayanks: Hmm, I only see `"segments/{tableName}/metadata` to fetch all segment metadata. I didn't find the reload status in the PR

#random

@riteshrathi: @riteshrathi has joined the channel
@kelvin: @kelvin has joined the channel
@kr.vikas: @kr.vikas has joined the channel

#troubleshooting

@riteshrathi: @riteshrathi has joined the channel
@kelvin: @kelvin has joined the channel
@kr.vikas: @kr.vikas has joined the channel

#docs

@keweishang: @keweishang has joined the channel
@keweishang: @keweishang has left the channel

#pinot-dev

@keweishang: @keweishang has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]