Apache Pinot Daily Email Digest (2021-03-16)

Pinot Slack Email Digest Tue, 16 Mar 2021 19:00:43 -0700

#general

@deepakcse2k5: @deepakcse2k5 has joined the channel
@deepakcse2k5: Is update query is possible using pinot
@mayanks: If you mean SQL `update` statement, then no. What's your use case?
@deepakcse2k5: we are using some update statement , basically to update new column based on other column in offline table
@g.kishore: you can use derived column feature for that
@vibhor.jain: Hi All, what is the general approach preferred for retrofitting old data? I see that MS teams uses Pinot. Now if I sent a msg via teams and later updated that, how can such use case be handled in Pinot? Suggestions welcome.
@g.kishore: You can use upsert feature
@ravi.maddi: *is It correct??* I have a column contains list of integers("madIds": [1111, 2222, 3444]) for that I am writing like in schema config file, please correct me and confirm me. ```{ "name": "madIds", "datatype": "INT", "delimiter":",", "singleValueField":false },```
@fx19880617: I think this is ok, you don’t need to set delimiter in schema. Is should be how the record reader parsing the data
@fx19880617: What’s your data format, if it’s json, then the parser should parse it to an array already
@ravi.maddi: ok, got thanks,
@zzh243402448: @zzh243402448 has joined the channel
@ravi.maddi: @All - how to write schema for *date column* I have a column with date: "startDate": "2021-01-04 00:00:00" Need help :slightly_smiling_face:
@fx19880617: Can you try something like :
@fx19880617: "dateTimeFieldSpecs": [{ "name": "startDate", "dataType": "STRING", "format" : "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }]
@fx19880617: FYI :
@ravi.maddi: Thanks, I have two fields startDate and endDate then. I have to write two times this block of code with name different. am I right?
@ravi.maddi: I have three date columns, So, I written like this, ```"dateTimeFieldSpecs": [ { "name": "_source.startDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }, { "name": "_source.lastUpdate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "name": "_source.sDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]``` can you please correct. I am getting error {"code":400,"error":"Cannot find valid fieldSpec for timeColumn: timestamp from the table config: eventflow_REALTIME, in the schema: eventflowstats"}
@ravi.maddi: Hi Xiang Fu, can you check once
@ravi.maddi: @All - I added a table by using *addTable* pinot command, but after I changed the schema, how to update the existing table already added. *How to do update and delete table here*.
@fx19880617: You can update schema using schema api
@fx19880617: Try out controller swagger UI
@fx19880617: It also generates the corresponding requests
@ravi.maddi: What happens , If I run the same addTable commands with the latest schema file. any Idea?
@vibhor.jain: @vibhor.jain has joined the channel
@vibhor.jain: Hi All, what is the general approach preferred for retrofitting old data in Pinot? I see that MS teams uses Pinot. Now if I sent a msg via teams and later updated that, how can such use case be handled in Pinot where there is no update supported? Suggestions welcome.
@ganesh.github: @vibhor.jain Can you have a look at this?
@ravi.maddi: @All - I am getting an error while stating zookeeper with pinot admin. zookeeper state changed (SyncConnected) Waiting for keeper state SyncConnected Terminate ZkClient event thread. Session: 0x10003506d770000 closed Start zookeeper at localhost:2181 in thread main EventThread shut down for session: 0x10003506d770000 Expiring session 0x10002b33f080005, timeout of 30000ms exceeded Expiring session 0x10002b33f080006, timeout of 30000ms exceeded Expiring session 0x10002b33f080007, timeout of 30000ms exceeded Expiring session 0x10002b33f080004, timeout of 30000ms exceeded Expiring session 0x10002b33f080008, timeout of 30000ms exceeded Expiring session 0x10002b33f080002, timeout of 30000ms exceeded Expiring session 0x10002b33f08000b, timeout of 60000ms exceeded any solutions, Need help
@ravi.maddi: One doubt, How can I know which version of Kafka using by my local Pinot?
@chad.preisler: @chad.preisler has joined the channel
@ravi.maddi: *Hi All,* I have three date columns, So, I written like this, ```"dateTimeFieldSpecs": [ { "name": "_source.startDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }, { "name": "_source.lastUpdate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "name": "_source.sDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]``` can you please correct. I am getting error ```{"code":400,"error":"Cannot find valid fieldSpec for timeColumn: timestamp from the table config: eventflow_REALTIME, in the schema: eventflowstats"}``` Need your help :slightly_smiling_face:
@g.kishore: Let’s use <#C011C9JHN7R|troubleshooting> for these questions.
@ravi.maddi: Sure thanks
@g.kishore: The error message has the info- time column (time stamp) specified in tableconfig does not exist in schema
@g.kishore: Change the time column in table config to point to one of these name in schema
@ravi.maddi: i don't have any column named 'timestamp'.
@g.kishore: Pinot meetup talk happening now if interested (
@karinwolok1: :wave: Welcome all the new Pinot :wine_glass: community members! How did you find out about Pinot? What are you working on? @chad.preisler @vibhor.jain @zzh243402448 @deepakcse2k5 @harshvardhan.surolia @nirav.shah @slatermegank @timebertt @orajason @satish @terodeakshay @abprakash2003 @prshnt.1314 @hussain @shilpa.kumar1222 @thejas.nair @akashkumar @mohamedsultan.ms304 @tamilselvansk23 @nurcahyopujo @prachiprakash80 @matteo.santero @ravi.maddi @morzaria @suresh.k.kode @rrepaka123 @jainendra1607tarun @santosh.rudra @xulinnankai @manish.bhoge @ratchetmdt @james.wes.taylor @contactvivekjain @carlosmanzueta @dileepkumarv.allam
@rkitay: @rkitay has joined the channel
@rkitay: Hi, what data types does `Pinot` support out-of-the-box? I’m guessing `String`, numerics (integers and floating points), `date` and `boolean` - are there any others supported? For example - `ip-address`?
@g.kishore: no ipaddress is not supported.
@ken:
@ken: Note no specific date type
@ken: There is a somewhat hidden Boolean type, but it gets mapped to a string internally, I believe.
@rkitay: So if I need to keep IP Addresses and support a filter like: ```IP inCidr(212.36.0.0/24)``` What are my options? Do I keep the data in raw `byte[]` format and implement a `UDF` that will perform this filter? Can such a query be sub-second ?
@g.kishore: thats right, but we just use 1 bit to represent it on disk
@g.kishore: @rkitay we can extend range query to support things like that
@g.kishore: can you please an issue for ip indexing, its a cool use case
@rkitay: @g.kishore, :slightly_smiling_face: Sure - though I haven’t decided yet if we can invest time in checking `Pinot` at this time.
@ken: @rkitay I haven’t worked with IP addresses in SQL, but worst case I assume you could store as 32 bit int and do range queries?
@g.kishore: @rkitay thats totally fine. someone else might ask for it later or a contributor might want to pick it up.
@rkitay: @ken, for IPv4 - yes. But for IPv6 I need 16 bytes. So either I use a composite field with two `long`s or a `byte []`
@rkitay: Is there any limitation on the size of a single record written into `Pinot`? Our average records are about 6KB when stored in `AVRO` , but can reach up to ~50K in edge cases
@mayanks: Pinot is columnar. Is the size due to wide schema or columns that have large data?
@mayanks: If former, no issues If latter, what’s the data type of those columns?
@rkitay: A combination - we have about 90 fields, some are numeric, others are short strings - the rest are potentially large strings (e.g. HTTP Request/Response Headers) - that can reach several KB for a single field. Also, we keep nested records within each record - and each outer record can contain several nested records - which also increases the size of a single column
@rkitay: Also - for some of these fields, we do not need indexing (e.g. Request Headers) - I just need to be able to find them based on other dimensions
@g.kishore: yes, you can apply snappy compression on such columns
@g.kishore: indexing a column is optional in Pinot
@mayanks: For Strings, there is a default max length (iirc 512), but can be overwritten:
@xd: @xd has joined the channel
@jamesmills: @jamesmills has joined the channel
@simon.paradis: @simon.paradis has joined the channel

#random

@deepakcse2k5: @deepakcse2k5 has joined the channel
@deepakcse2k5: Is update query possible using pinot
@fx19880617: you can do pinot upsert with realtime only table:
@deepakcse2k5: is it possible for offline table
@zzh243402448: @zzh243402448 has joined the channel
@vibhor.jain: @vibhor.jain has joined the channel
@deepakcse2k5: can we make ‘date’ related column as part of primary key in pinot?
@chad.preisler: @chad.preisler has joined the channel
@rkitay: @rkitay has joined the channel
@xd: @xd has joined the channel
@jamesmills: @jamesmills has joined the channel
@simon.paradis: @simon.paradis has joined the channel

#feat-text-search

@akashkumar: @akashkumar has joined the channel

#feat-presto-connector

@akashkumar: @akashkumar has joined the channel
@hussain: @hussain has joined the channel

#troubleshooting

@deepakcse2k5: @deepakcse2k5 has joined the channel
@jungmwiner: To run thirdeye locally, refer to the manual below. - I used the master branch, and building thirdeye was successful. ### 1 By the way An error occurs when executing `./run-frontend.sh`. ``` Error: Could not find or load main class org.apache.pinot.thirdeye.dashboard.ThirdEyeDashboardApplication ``` I tested the same in several environments, but the problem occurred the same. ### 2 The same problem occurs when executing the "./run-backend.sh" script. However, the cause of the problem seems to be different. There seems to be a problem because there is no `org.apache.pinot.thirdeye.anomaly.ThirdEyeAnomalyApplication` class in the jar file. **Tell me how to fix it, and I'll send you a PR.**
@fx19880617: can you ask this in thirdeye slack? ()
@jungmwiner: @fx19880617 thank you^^
@deepakcse2k5: Is update query possible using pinot
@zzh243402448: @zzh243402448 has joined the channel
@vibhor.jain: @vibhor.jain has joined the channel
@deepakcse2k5: can we make ‘date’ related column as part of primary key in pinot?
@chad.preisler: @chad.preisler has joined the channel
@ravi.maddi: *Hi All,* I have three date columns, So, I written like this, ```"dateTimeFieldSpecs": [ { "name": "_source.startDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }, { "name": "_source.lastUpdate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "name": "_source.sDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]``` can you please correct. I am getting error ```{"code":400,"error":"Cannot find valid fieldSpec for timeColumn: timestamp from the table config: eventflow_REALTIME, in the schema: eventflowstats"}``` Need your help :slightly_smiling_face:
@npawar: can you share the table config?
@ravi.maddi: Table Config: ```{ "tableName": "eventflow", "tableType": "REALTIME", "segmentsConfig": { "timeColumnName": "timestamp", "timeType": "MILLISECONDS", "schemaName": "eventflowstats", "replicasPerPartition": "1" }, "tenants": {}, "tableIndexConfig": { "loadMode": "MMAP", "streamConfigs": { "streamType": "kafka", "stream.kafka.consumer.type": "lowlevel", "stream.kafka.topic.name": "event_count-topic", "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder", "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory", "stream.kafka.broker.list": "localhost:9876", "realtime.segment.flush.threshold.time": "3600000", "realtime.segment.flush.threshold.size": "50000", "stream.kafka.consumer.prop.auto.offset.reset": "smallest" } }, "metadata": { "customConfigs": {} } }``` And Schema File: ``` { "schemaName": "eventflowstats", "eventflow": [ { "name": "_index", "dataType": "INT" }, { "name": "_type", "dataType": "STRING" }, { "name": "id", "dataType": "INT" } ], "dateTimeFieldSpecs": [ { "name": "_source.startDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }, { "name": "_source.lastUpdate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "name": "_source.sDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ] }```
@npawar: In your table config, you've configured "timeColumnName" : "timestamp"
@npawar: You need to change that to one of the dateTime columns from your schema
@npawar: Also, in your schema, you have the dimensions under "eventFlow" instead of "dimensionFieldSpecs"
@ravi.maddi: ok, got , so I have to remove remaining two. And I have to add as normal fields am I right?
@npawar: you can keep all 3 as dateTimeFieldSpecs
@npawar: but select one of them as the primary time column, and enter that in the tableConfig
@ravi.maddi: is It right: ```"timeColumnName": "_source.startDate, _source.lastUpdate, _source.sDate",```
@ravi.maddi: is It correct: ```"dateTimeFieldSpecs": [ { "name": "_source.startDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd", "granularity": "1:DAYS" }, { "name": "_source.lastUpdate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" }, { "name": "_source.sDate", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss", "granularity": "1:DAYS" } ]```
@npawar: No, you have to put just one column in the tableConfig. ```"timeColumnName": "_source.sDate"```
@npawar: schema is correct
@ravi.maddi: got
@ravi.maddi: after this changes, I am getting this error: Sending request: to controller: localhost, version: Unknown Got Exception to upload Pinot Schema: aschema I think Pinot server went down. Any idea
@ravi.maddi: @npawar -- can check once
@ravi.maddi: *Hi All* I am getting this error: I am trying to addTable with pinot-admin. Sending request: to controller: localhost, version: Unknown Got Exception to upload Pinot Schema: aschema Need help :slightly_smiling_face:
@rkitay: @rkitay has joined the channel
@xd: @xd has joined the channel
@jiatao: Hi, `Pinot Quickstart on JDK 15-ea` is failing for my pr (which only change log messages). Seems like the test is running java 16 instead of 15: `JAVA_HOME_16.0.0_x64=/opt/hostedtoolcache/jdk/16.0.0/x64` . Any idea how to fix this? The pr test link for reference:
@fx19880617: Rerun the test doesn’t help?
@jiatao: I changed one line, and rebase the pr which trigger the test again, but it's still running jdk 16.
@fx19880617: I saw it’s failing for other PR as well
@fx19880617:
@fx19880617: guess the issue is on github action side
@jiatao: I see. Thanks.
@jiatao: FYI: ^^ @jlli
@jlli: sorry I have no context on this issue in apache pinot repo though..
@jlli: One thing worthy to try is to use java 15 GA version instead of EA:
@jlli: Build passes. @fx19880617@jiatao we should be good to go with that changes now :point_up_2:
@fx19880617: :thumbsup:
@jiatao: @jlli Thanks!
@jamesmills: @jamesmills has joined the channel
@simon.paradis: @simon.paradis has joined the channel

#pinot-dev

@akashkumar: @akashkumar has joined the channel
@zzh243402448: @zzh243402448 has joined the channel

#pinot-docs

@zzh243402448: @zzh243402448 has joined the channel

#segment-write-api

@npawar: set up a meeting for 11am. Please move it around if that time doesnt work @yupeng @chinmay.cerebro
@yupeng: sg

#metrics-plugin-impl

@fx19880617: @fx19880617 has joined the channel
@xd: @xd has joined the channel
@jlli: @jlli has joined the channel
@fx19880617:
@fx19880617: we can move the metrics plugin discussion here
@fx19880617: I think Xiaoman has some question about the listener implementation
@xd: I think the major problem here is that the old plugins that implemented `MetricsRegistryRegistrationListener` makes the server start process hang, without any clue in our log to trace it
@xd: I agree that the interface change is a good direction in design; but I am a bit concerned about the other Pinot users that did the same thing will have trouble debugging
@xd: It took me quite a few hours digging until I figure out
@xd: Even `jstack` does not help
@xd: I don't have a better solution to this though. Maybe a proper communication is the only way
@xd: Originally I thought it was another interface change, but here we are dealing with dependencies too, so it is hard to find a good solution
@xd: After reimplementing my plugin the pinot server now starts properly with my metrics plugin
@jlli: Hey Xiaoman, thanks for reaching out! I understand your concern of removing the methods in `PinotMetricsRegistryListener`. While `PinotMetricsRegistryListener` is just a wrapper. The wrapper’s method won’t be registered to the actual registry; instead, it’s the actual yammer’s listener which methods will be invoked. That’s why I think you want to add the method like `void onMetricsRegistryRegistered(MetricsRegistry metricsRegistry);`. While that’ll make the repo unclean, because we still have to pull in the actual yammer dependencies to pinot’s code. One thing I’d suggest is to initialize an actual Yammer listener and pass it as the param to the constructor.
@jlli: this is the sample code on how to handle listener in LinkedIn (not open source), hope that will give you some idea: ``` @Override public void onMetricsRegistryRegistered(final PinotMetricsRegistry metricsRegistry) { MetricsRegistryListener metricsRegistryListener = new MetricsRegistryListener() { @Override public void onMetricAdded(MetricName metricName, Metric metric) { // do sth } } @Override public void onMetricRemoved(MetricName metricName) { // do sth } }; metricsRegistry.addListener(new YammerMetricsRegistryListener(metricsRegistryListener)); }```
@xd: Thanks. I have got mine working after recompiling. Mostly it is because of Plugins are loaded in runtime by reflection, and then Plugins built agains old Pinot got loaded without any check.
@jlli: I see. Glad that was resolved. Usually the build will fail at compile time, then we should do some changes on addressing the new code

#flink-pinot-connector

@fx19880617: @fx19880617 has joined the channel
@npawar: @npawar has joined the channel
@yupeng: @yupeng has joined the channel
@chinmay.cerebro: @chinmay.cerebro has joined the channel
@fx19880617: this is the design doc:
@chinmay.cerebro: :thumbsup:
@chinmay.cerebro: Looks like this doc needs a lot of changes
@yupeng: sure. i can update the doc to reflect the latest discussions
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org