Apache Pinot Daily Email Digest (2021-11-19)

Pinot Slack Email Digest Fri, 19 Nov 2021 18:00:24 -0800

#general

@surajkmth29: Hi Folks, Is there a constraint of dimension tables being assigned to tenants? Does lookup UDF join work when a table and dim table are in different tenants?
@mayanks: Dim table has to be colocated with the table to be joined (on same server nodes). However note that a single node can have multiple tenant tags
@diana.arnos: @diana.arnos has joined the channel
@mapshen: Are we considering upgrade the Kafka lib version to something newer like 2.4+?

#random

@diana.arnos: @diana.arnos has joined the channel

#troubleshooting

@diana.arnos: @diana.arnos has joined the channel
@mapshen: Any idea how to trouble this error message? > 2021/11/19 22:36:05.496 INFO [CurrentStateComputationStage] [HelixController-pipeline-task-pinot-prod-(aa26cf97_TASK)] Event aa26cf97_TASK : Ignore a pending message ee7f9ef0-1de9-4737-b0b4-db4a4e1b9073 for a non-exist resource table0_REALTIME and partition table0__0__0__20211119T2150Z
@mayanks: Doesn't seem like an error message? Did you have a table that you deleted?
@mapshen: the table just got created, but the incoming messages simply got ignored
@mapshen: i can see the table configs in Zk

#getting-started

@bagi.priyank: i am noticing that disk on a controller instance starts filling up pretty fast. what can i do to slow it down?
@xiangfu0: Are you using cloud? If so, you can configure a deep store to offload controller disk usage to s3/gcs
@bagi.priyank: yes, running via launcher script on aws ec2 instances for now. what is it writing to disk?
@bagi.priyank: is ```"starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "filter_field_1", "filter_field_2", "filter_field_3", "event_ts" ], "skipStarNodeCreationForDimensions": [], "functionColumnPairs": [ "DISTINCTCOUNTHLL__metric_field_1", "SUM__metric_field_2" ], "maxLeafRecords": 10000 } ]``` the correct star-tree config for queries like ```SELECT DATETIMECONVERT(event_ts, '1:MICROSECONDS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH tz(UTC)', '1:HOURS') AS time_interval, SUM(metric_field_1) FROM table_name WHERE filter_field_1 = 'value_1' AND filter_field_2 = 'value_2' GROUP BY time_interval ORDER BY time_interval, SUM(metric_field_2) DESC``` and ```SELECT DATETIMECONVERT(event_ts, '1:MICROSECONDS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH tz(UTC)', '1:HOURS') AS time_interval, filter_field_3, DISTINCT_COUNT_HLL(metric_field_1), SUM(metric_field_2) FROM table_name WHERE filter_field_1 = 'value_1' AND filter_field_2 = 'value_2' GROUP BY time_interval, filter_field_3 ORDER BY time_interval, SUM(metric_field_2) DESC```
@bagi.priyank: i am seeing slow query performance (~3-5 seconds) and wondering if i am setting up the index right
@bagi.priyank: i am ingesting close to 5 million rows per minute...and wondering if this is happening because of too many segments?
@diana.arnos: @diana.arnos has joined the channel
@diana.arnos: Hello there :wave: I'm developing something that uses Pinot, consuming straight from a new kafka topic. I was able to run everything I need and it is beautiful (thanks for the work on this project :muscle: ) Now I'm trying to improve some things on my project and wondered if there is a way to use a schema registry instead of leaving the table schema inside the project itself. What I would like to happen: I have a json schema related to the topic Pinot will consume from and instead of manually editing/creating the table schema (), I would like for Pinot to read the JSON schema from my registry and automagically use it when ingesting. I'm not sure if the configs `stream.kafka.decoder.prop.schema.registry.rest.url` and `stream.kafka.decoder.prop.schema.registry.schema.name` could help me achieve this.
@richard892: I have been working on JSON schema inference recently, I'm curious how you would rank using a json schema vs inference
@richard892: I guess if you have the schema already you don't want Pinot to mess around figuring it out. One of the problems with JSON schema is it allows variant types, so e.g. a field can be a string or a double, and this sort of thing can actually be better handled in some ways with inference (if the values for the field are only ever doubles then it will be inferred as a double, or if the string values have temporal locality then they can be handled better than creating a sparse column for a handful of values)
@diana.arnos: What exactly do you mean by work with inference? What I would like to do is to not have to keep the table schema together with my project. I would like to point out a registry URL and a schema name on the table config and Pinot would find its way around it.
@richard892: > What exactly do you mean by work with inference? I have been working on a feature to infer schemas from JSON data, which is aimed at use cases where there is no schema for the records, but it could also support your use case reasonably well.
@richard892: if you had access to such a feature, would you still want to be able to point pinot to the schema?
@diana.arnos: Probably not, inference would solve my problem. But _maybe_ it would have some trouble dealing with the date format I need to use: RFC3339, which is basically a string with a very specific format...
@richard892: exactly, inferring dates is one of the headaches
@diana.arnos: right now I have my dateTime field with the following config: ```{ "name": "operationDate", "dataType": "STRING", "format": "1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss.SSSZ", "granularity": "1:MILLISECONDS" }``` A pretty dirty and ugly workaround would be to infer it as string and query it using the now available `LIKE` operation.... So if I wanted to group data by month, I could try something like `WHERE operationDate LIKE '2021-10%'` it could work...
@mayanks: Side note, schema in Pinot can potentially be different from upstream (if you use transforms, derived columns or not use some of the columns from upstream).
@diana.arnos: That's a good point :thinking_face:

#releases