Apache Pinot Daily Email Digest (2022-04-24)

Pinot Slack Email Digest Sun, 24 Apr 2022 19:00:33 -0700

#general

@fding: added an integration to this channel:
@nutbiz: @nutbiz has joined the channel
@g.kishore: removed an integration from this channel:
@aroopganguly: @aroopganguly has joined the channel

#random

@nutbiz: @nutbiz has joined the channel
@aroopganguly: @aroopganguly has joined the channel

#troubleshooting

@diogo.baeder: Quick question: is it possible to run a query on a certain table, but with an IN_SUBQUERY where the subquery is run against another table?
@mayanks: Not at the moment, no
@diogo.baeder: Thanks. Are there any plans to support it though?
@mayanks: Likely yes, but no concrete timeline at the moment. This is venturing into full support for nested queries and joins
@diogo.baeder: Got it, makes sense. Thanks man! :slightly_smiling_face:
@g.kishore: @diogo.baeder can you please try the query.. I dont remember restricting IN_SUBQUERY to the same table.
@mayanks: Yes, please try out @diogo.baeder, I might have missed something here.
@diogo.baeder: Ah, nice, I'll try it out then :slightly_smiling_face:
@diogo.baeder: Well my friends it just so happens that... YES, it does support subqueries to different tables!!! Yay!!! \o/
@diogo.baeder: For reference, this is my query: ```select weight, caseid as weights_caseid from weights where in_subquery(caseid, 'select id_set(caseid) from brands_metrics where brand_id = 1000226 and date_ > 20150501 and date_ < 20150515 limit 1000000') = 1 and in_subquery(caseid, 'select id_set(caseid) from filters where name = ''gender'' and values_ = 2 and date_ > 20150501 and date_ < 20150515 limit 1000000') = 1 and date_ > 20150501 and date_ < 20150515 limit 1000000``` and I get results! Guys, this is amazing news, because the fact that this works means that I don't even need to use Trino or anything like that, because then I can just do the aggregation of the data in the Pinot Broker itself! And it runs much faster than in Trino, actually :slightly_smiling_face:
@mayanks: Thanks for confirming @diogo.baeder, we will add this to the doc. cc: @mark.needham
@diogo.baeder: Dude, I am *SO* happy with this... it's a game-changer for me, really! :heart: Next on: I'll change our structure and do some more complex stuff with Groovy in order to have some calculations done over JSON columns. And let me say this again: Pinot is ridiculously fast, OMG!!!
@diogo.baeder: By the way: the query above ran in 28ms on my computer, with 1 year worth of data, but where each segment contains data for 1 day only - in total I have 202 segments (it's where we have data). If I was doing monthly merge roll-ups, this would be even faster. I can already see this awesome database becoming our next data source for our main data backend! :slightly_smiling_face:
@mayanks: Glad you are able to use Pinot to solve your use cases.
@diogo.baeder: Yep. We already use it for internal analytics, which is important but not mission-critical; Now this experiment I'm doing is to serve our main product, so if everything works, we'll put up another cluster, but a much more robust one, with more and beefier nodes.
@nutbiz: @nutbiz has joined the channel
@diogo.baeder: Me again. New issue, not sure if it's me or if it's a genuine bug: I'm trying to ingest JSON data for one of my columns, but I keep getting an error for that column: `Cannot read single-value from Collection:`. More on this thread.
@diogo.baeder: This is a part of the exception, which I believe should contain the necessary info to investigate it: ```Caused by: java.lang.IllegalStateException: Cannot read single-value from Collection: [1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 9, 1, 1] for column: brands_responses at shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:721) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardizeCollection(DataTypeTransformer.java:176) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.standardize(DataTypeTransformer.java:119) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:63) ~[pinot-all-0.10.0-jar-with-dependencies.jar:0.10.0-30c4635bfeee88f88aa9c9f63b93bcd4a650607f] ... 13 more 2022/04/25 00:19:39.735 ERROR [SegmentGenerationJobRunner] [pool-2-thread-1] Failed to generate Pinot segment for file - file:/sensitive-data/outputs/cases/br/20150501.json```
@diogo.baeder: This is a JSON file I'm trying to consume: ```[ { "brands_responses": { "first_1000226": 2, "second_1000226": 1, "third_1000226": 1, "fourth_1000226": 1, "fifth_1000226": 9, "sixth_1000226": 2, "seventh_1000226": 1, "eighth_1000226": 1, "ninth_1000226": 1, "tenth_1000226": 1, "eleventh_1000226": 1, "twelfth_1000226": 1, "thirteenth_1000226": 1 }, "caseid": 251214750, "date_": 20150501, "pmxid": 52735743, "region": "br", "sector_id": 1010, "uuid": "6702e33a-e961-4f62-b9df-2d65e4fe3fd5", "weight": 0.935066 } ]```
@diogo.baeder: This is my schema: ```{ "schemaName": "cases_schema", "dimensionFieldSpecs": [ { "name": "brands_responses", "dataType": "JSON", "maxLength": 2147483647 }, { "name": "caseid", "dataType": "INT" }, { "name": "pmxid", "dataType": "INT" }, { "name": "region", "dataType": "STRING" }, { "name": "sector_id", "dataType": "INT" }, { "name": "uuid", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "weight", "dataType": "FLOAT" } ], "dateTimeFieldSpecs": [ { "name": "date_", "dataType": "INT", "format": "1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd", "granularity": "1:DAYS" } ] }```
@diogo.baeder: Sorry, while gathering the information I think I found what the issue is, hold on...
@diogo.baeder: Nah, it's still erroring. I had forgotten to add the fields to `noDictionaryColumns`, but even with that added it still errors out
@diogo.baeder: This is the table definition I send to Pinot when creating it: ```{ "tableName": "cases", "tableType": "OFFLINE", "segmentsConfig": { "schemaName": "cases_schema", "timeColumnName": "date_", "timeType": "DAYS", "replicasPerPartition": "1", "replication": "1" }, "tableIndexConfig": { "loadMode": "MMAP", "noDictionaryColumns": [ "brands_responses" ], "jsonIndexColumns": [], "invertedIndexColumns": [], "nullHandlingEnabled": true, "segmentPartitionConfig": { "columnPartitionMap": { "region": { "functionName": "Murmur", "numPartitions": 400 } } } }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "metadata": { "customConfigs": {} }, "routing": { "instanceSelectorType": "balanced", "segmentPrunerTypes": [ "partition", "time" ] }, "transformConfigs": [ { "columnName": "brands_responses", "transformFunction": "jsonFormat(\"brands_responses\")" } ] }```
@diogo.baeder: I noticed, however, that `transformConfigs` is missing from the table definition when looking at how the table got created
@diogo.baeder: This is the table config I see in the incubator UI: ```{ "OFFLINE": { "tableName": "cases_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "timeType": "DAYS", "schemaName": "cases_schema", "replication": "1", "timeColumnName": "date_", "allowNullTimeValue": false, "replicasPerPartition": "1" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "invertedIndexColumns": [], "noDictionaryColumns": [ "brands_responses" ], "segmentPartitionConfig": { "columnPartitionMap": { "region": { "functionName": "Murmur", "numPartitions": 400 } } }, "rangeIndexVersion": 2, "jsonIndexColumns": [], "autoGeneratedInvertedIndex": false, "createInvertedIndexDuringSegmentGeneration": false, "loadMode": "MMAP", "enableDefaultStarTree": false, "enableDynamicStarTreeCreation": false, "aggregateMetrics": false, "nullHandlingEnabled": true }, "metadata": { "customConfigs": {} }, "routing": { "segmentPrunerTypes": [ "partition", "time" ], "instanceSelectorType": "balanced" }, "isDimTable": false } }```
@diogo.baeder: Alright, I figured that I was missing a `ingestionConfig` as part of the table config - the documentation about JSON indexing is wrong, it doesn't mention this field. But even using this field, it doesn't work, if I send the correct payload I get: ```{ "code": 400, "error": "Arguments of a transform function '[brands_responses]' cannot contain the destination column 'brands_responses'" }```
@diogo.baeder: Alright, it's working now. I figured out that the source column from where the JSON has to be transformed cannot have the same name as the destination column. I wish Pinot was able to just do the transformation for us without all these extra configuration though.
@diogo.baeder: Fixed on my side. I just opened a ticket to propose changes to the docs to improve them:
@aroopganguly: @aroopganguly has joined the channel
@very312: Hi, Does schema evolution only work at batch tables? We wanted to add a new column on hybrid table (offline table + realtime streaming table) which includes upsert columns. When we added new column on schema config, we also needed to change table config since the new columns should have beed overwritten. It means we must have deleted original realtime table and re-generated realtime table. Is it right way of creating new columns in a streaming upsert table? only shows batch table
@mayanks: Schema evolution works for hybrid tables as well. As for upsert cc: @yupeng
@very312: when i tried below in order, here is an error msg. cc. @yupeng • schema a' add (i.e. schema a modify) • table A' modify (i.e. table A modify - since we needed to highlight test_evolution column to be "OVERWRITE" in table config) • run bin/pinot-admin.sh AddTable -schemaFile a_schema.json -tableConfigFile a_table.json -exec; • send new msg ```[ { "message": "MergeResponseError:\nData schema mismatch between merged block: [card_alert(STRING),card_bin(STRING),card_id(STRING),card_issuer(STRING),card_number(STRING),card_type(STRING),create_time(TIMESTAMP),delete_time(TIMESTAMP),test_evolution(INT),update_time(TIMESTAMP),user_id(STRING)] and block to merge: [card_alert(STRING),card_bin(STRING),card_id(STRING),card_issuer(STRING),card_number(STRING),card_type(STRING),create_time(TIMESTAMP),delete_time(TIMESTAMP),update_time(TIMESTAMP),user_id(STRING)], drop block to merge", "errorCode": 500 } ]```
@very312: can you tell us process of adding new column which would be overwritten by using meetupRsvp example?
@diogo.baeder: Fixed on my side. I just opened a ticket to propose changes to the docs to improve them:

#tmp

@fding: added an integration to this channel:
@g.kishore: removed an integration from this channel:

#getting-started

@nutbiz: @nutbiz has joined the channel
@aroopganguly: @aroopganguly has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]