Apache Pinot Daily Email Digest (2021-02-08)

Pinot Slack Email Digest Mon, 08 Feb 2021 18:00:37 -0800

#general

@aalekhsonebhadra: @aalekhsonebhadra has joined the channel
@dilansri: @dilansri has joined the channel
@grace.walkuski: @grace.walkuski has joined the channel
@jasoncrisch: @jasoncrisch has joined the channel
@karinwolok1: Ayyo! :wave: Welcome new :wine_glass: Pinot members! How did you find us Pinot? What kind of projects are you working on? @minolino71 @robert.bastian @shakeeburrahman1990 @tamas.nadudvari @aalekhsonebhadra @dilansri @grace.walkuski @jasoncrisch @tariqahmed.farhan @yuzhug @ratna @rabeeb.rahman.225 @sashastic @terrysv @huangzhenqiu0825 @murat.ozcan @ranabanerji @kha.nguyen @thomas.may @edan @gokulrk2696 @aviv4339 @pankaj @bowenli86 @gulshan.yadav @justin.smalkowski @wooodini
@beth: @beth has joined the channel

#random

#troubleshooting

@neer.shay: Hi, It seems there's some compatibility issues between Pinot and Superset in regards to the time column: In Pinot, I have it defined like this ```"dateTimeFieldSpecs": [ { "name": "ts", "dataType": "STRING", "format": "1:SECONDS:SIMPLE_DATE_FORMAT:\"yyyy-MM-dd HH:mm:ss\"", "granularity": "1:MINUTES" } ]``` In Superset, I must define the string format in the Python way for it to parse correctly: ```%Y-%m-%d %H:%M:%S``` When I try creating a chart, I get this error: ```Apache Pinot Error unsupported format character 'Y' (0x59) at index 58 This may be triggered by: Issue 1002 - The database returned an unexpected error. ``` Because the query gets translated to this (note that if I remove the "DATETIMECONVERT" and simply use "ts" column it works fine): ```SELECT DATETIMECONVERT(ts, '1:SECONDS:SIMPLE_DATE_FORMAT:%Y-%m-%d %H:%M:%S', '1:SECONDS:SIMPLE_DATE_FORMAT:%Y-%m-%d %H:%M:%S', '1:DAYS'), AVG(metric) AS "AVG_1" FROM schema.table WHERE ts >= '2021-02-01 00:00:00' AND ts < '2021-02-08 00:00:00' GROUP BY DATETIMECONVERT(ts, '1:SECONDS:SIMPLE_DATE_FORMAT:%Y-%m-%d %H:%M:%S', '1:SECONDS:SIMPLE_DATE_FORMAT:%Y-%m-%d %H:%M:%S', '1:DAYS') LIMIT 50000;``` *Has anyone encountered something similar? What is the solution?*
@fx19880617: hmm, I think the issue is that the date format in pinot query should use formt `yyyy-MM-dd HH:mm:ss` not `%Y-%m-%d %H:%M:%S`
@fx19880617: Also can you provide: 1. what’s the superset version? 2. what’s the working query on your side? 3. can you give an example value of your column ts?
@neer.shay: 1. 0.999.0dev 2. This query works ```SELECT ts, AVG(metric) AS "AVG_1" FROM schema.table WHERE ts >= '2021-02-01 00:00:00' AND ts < '2021-02-08 00:00:00' GROUP BY ts LIMIT 50000;``` 3. See filter in the query
@fx19880617: thanks, also how do you define this ts column in superset? can you put a screen shot here
@fx19880617: for the superset, are you using docker image or you build your own superset? if docker, what’s the image tag
@neer.shay: Here's the definition of the column in superset
@neer.shay: the docker image tag is from
@fx19880617: ic, so you are building it from your own superset image
@fx19880617: i will check on that
@aalekhsonebhadra: @aalekhsonebhadra has joined the channel
@tanmay.movva: Hello, is there an API to reset a table when it’s offset is out of range for consumption? If not, does disabling and enabling the table trigger a offset reset?
@tanmay.movva: `/segments/{tableNameWithType}/reset` does this reset the consumption offset for the table?
@g.kishore: What do you mean out of range?
@tanmay.movva: We have migrated to a new kafka cluster, but the pinot is trying to consume from the previously committed segments, which is expected. But in the new cluster, as the offsets again start from 0 we want pinot to reset and start consuming from the earliest(smallest) offset.
@g.kishore: Ah.. what about the previous data?
@tanmay.movva: We drained out the data from previous cluster. If we start reading data from the starting offset of the new cluster then there shouldn’t be any data loss.
@tanmay.movva: The only change I made to table config was for kafka bootstrap servers. It was an endpoint change.
@tanmay.movva: This is what I see in the logs.
@tanmay.movva:
@g.kishore: Got it
@g.kishore: Yeah, it’s behaving according to the design and as you mentioned it’s using the offset from the previous segment
@g.kishore: Do you have hybrid table or real-time only?
@tanmay.movva: real time only
@g.kishore: The number of partitions remain the same?
@tanmay.movva: Yes. topic names and partitions are same as earlier.
@g.kishore: Ok, there is a way to achieve this but might require some complex sequences of steps
@g.kishore: Can you file an issue.. I will add the steps there
@tanmay.movva: Sure. Doing it.
@g.kishore: Thanks
@g.kishore: Why did you change the Kafka cluster?
@tanmay.movva: We have upgraded the kafka version.
@g.kishore: Okay.. upgrading resets the offsets?
@tanmay.movva: Umm..not sure. But we created a new cluster with the newer version and are migrating applications to the new cluster. There are many applications still using the older versioned cluster, so had to change the endpoint and recreate the topics again in the new cluster.
@tanmay.movva: Filed an issue related to this -
@tanmay.movva: This is exactly what I was looking for - . I was assuming the periodic task, realtime segment validator would take care of these situations too.
@g.kishore: makes sense
@g.kishore: @ssubrama @npawar ^^
@ssubrama: Disclaimer: This has never been tried before. @tanmay.movva you probably have a bunch of segments in CONSUMING state right now. If you reset the start offset of these segments to 0 in the metadata and change the state from CONSUMING to OFFLINE, then the next time realtime segment data manager runs, it should create new segments at earliest offset available in each partition. Again, never tried before.
@g.kishore: I suggested that and it worked
@ssubrama: ah ok nice
@g.kishore: what would have been better is to have an API around that - commit the existing segment and the new segment starts from the earliest or the latest
@g.kishore: basically a way to override the offsets
@g.kishore: it helps in cases where there was bad data
@ssubrama: API is good, but we need to think through the case when there are multipole controllers. In a distributed controller system, the realtime segment repair job can run in one contoller, and the api fielded by another cntroller, causing potential race conditions. I think we need to think this through well before starting to build APIs. One way is to provide a primitive to disable all operations for a table -- automatic repair, segment completion, etc. -- do whatever we want to do, and then re-enable the operations. Again, this is just a top-level idea. Devil is in the details.
@g.kishore: yes, a global way to pause operations for a table is a good thing to have
@dilansri: @dilansri has joined the channel
@grace.walkuski: @grace.walkuski has joined the channel
@contact: Hey, quick question i don't find anything on the docs, i have a realtime table with a consuming segment and i would like to stop the consumtion and save it into the deep storage without creating a new consuming segment ? My use-case is simply to be able to stop ingesting new events to do some tasks like updating the server or any maintenance. Thanks !
@tanmay.movva: If you are not playing around with topic offsets, then Disable Table -> maintenance -> Enable Table should help you. be aware that disabling a table makes it unavailable for querying also.
@g.kishore: we have talked about adding pause/unpause operations multiple times... we are yet to agree on a safe way to achieve this @ssubrama ^^
@contact: Sadly i'm playing around with topic offsets because i use gcp pubsub system (that doesnt have any partition/offset). I wrote a plugin that essentialy fake offset to be able to use LLC segments
@ssubrama: @contact you should be able to update the server even as rows are getting consumed, since consumption will resume from where it left off before maintenance.
@contact: @ssubrama Thats not what i observe, the consuming segment isn't fsync'ed to the disk so when the server restart the segment is empty
@contact: And i think thats what is described there: (`In either mode, Pinot servers store the ingested rows in volatile memory until either one of the following conditions are met:`)
@ssubrama: On a restart it will start consuming from where the previous segment completed. There is no fsync going on anytime.
@contact: So if i understand you are saying that when shutting down the server, it should complete the segment ?
@ssubrama: No. Let me try with an example. Assume we have messages in offsets 4, 8, 12, ... in partition 13 of a stream. The server consumed until offset 48 and completed segment 33, and saved it in segment store. It starts to consume from offset 52 for segment 34. Let us say it consumes offsets 52 and 56, and the server is restarted. The messages consumed for 52 and 56 (from volatile memory) will be discarded. When the server comes back, it will start consuming from 52 again. Hope this helps.
@contact: Okay got it, that was my understanding with a system like kafka. However as i said above we "fake" offset because our gcp pubsub doesn't implement them so it doesn't work for us. But thanks for the answers.
@ssubrama: While a primitive like "commit segment now, and then stop further consumption" may help some, a few things are still unclear to me. (1) How is this sustainable in a production environment when servers may be restarted anytime? (2) How do you "fake" offsets? Continuing from my previous example, if after consuming offset 52, pinot gets a command to commit now and hold peace. Pinot creats a segment with 52, and sets the next offset to consume from as 56 but disables consumption. If, after maintenance, your offsets are not valid anymore, will it still not be a problem?
@contact: 1. I would guess that this would be only useful for maintenance but you are right that offset fixes this problem (except maybe pausing while performing multiple restart) 2. Clearly thats a hack (if you are interested, the code is there: ). In my case that would not matter because the plugin read the latest messages (since there are no concept of offset to read from)
@ssubrama: So, if I understand the implementation right, you disregard the offset passed in, and just get the next set of messages. In other words, the "offset" is maintained by the stream partition, and it does not provide for a way to consume a message multiple times. Am I right? In that case, the way to phrase the problem is to support a stream like this. While the solution (I agree, hack) you have is a good demo, any reasonable installation will require a clear handling of the case when a server's power is pulled. Can you file an issue to support streams like this (would help if you point to this aspect and any other differences that you know off-hand. Clearly, having pause/restart will not help in the larger case. Do you agree?
@contact: Totally agree with you. I think we'll need to transition over a different pubsub system to correctly handle this correctly (fortunaly we were expecting to do this anyway).
@contact: I guess i could open a ticket so people looking to integrate with GCP pubsub system will have all informations needed (and maybe someday pinot will handle it somewhat)
@grace.walkuski: Hi! I’m following the instructions to setup my environment and when I run , I get the following error. ```[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.1:npm (npm install) on project pinot-controller: Failed to run task: 'npm install' failed. (error code 1) -> [Help 1]``` The pinot-controller package seems to be a java project, so why is it trying to run `npm install`? How do I get around this? Thanks!
@ken: FWIW I’ve also seen this error when running the recommended `mvn clean install -DskipTests -Pbin-dist` (from the top), and all required Java artifacts do get created, so I’ve ignored the issue. As to why it’s running `npm install`, I think that’s for the controller UI.
@grace.walkuski: Yep, I get the same thing when I run that command too
@g.kishore: yes, its for the controller UI
@g.kishore: can you try the command suggested by @ken
@grace.walkuski: Yep, I get the same error
@g.kishore: which os?
@grace.walkuski: macOS Catalina 10.15.7
@jasoncrisch: @jasoncrisch has joined the channel
@beth: @beth has joined the channel

#pinot-dev

@grace.walkuski: @grace.walkuski has joined the channel
@grace.walkuski: Hi! :wave: I’m attempting to integrate the Pinot JDBC with so it can build sql and manage connections for us. This is not officially supported at this point, but it looks like the only change that would be needed to support it, is to add the `execute()` function (no params) to the . There are already several `execute(…)` functions with params, and an `executeQuery()` function with no params. Is there a reason this hasn’t been done already? Or why was the `executeQuery()` function named differently than the rest? Currently Jooq calls `execute()` and it hits since it hasn’t been overridden. I’m happy to make a PR to add this functionality but wanted to check first. Thanks!
@grace.walkuski: Could I just add this? ```@Override public ResultSet execute() throws SQLException { executeQuery(); }```
@grace.walkuski: I have created an issue:
@g.kishore: @kharekartik ^^
@kharekartik: @kharekartik has joined the channel

#discuss-validation

@chinmay.cerebro: ok - ready for review now:

#getting-started

@grace.walkuski: @grace.walkuski has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]