Apache Pinot Daily Email Digest (2021-11-04)

Pinot Slack Email Digest Thu, 04 Nov 2021 19:00:34 -0700

#general

@karinwolok1: :musical_note: _If you like Pinot and you know it, tell your friends_ :notes: :rolling_on_the_floor_laughing: In all seriousness, check out this great set of blog posts by @npawar and @chinmay.cerebro on "_What Makes Apache Pinot so Fast._" They worked pretty hard on this, so if you love Pinot (and love our PMCs and committers) feel free to share with your network! :two_hearts:
@bcwong: How do I do “array_agg” (in ) in Pinot? (The agg function should just return an array of all the elements.) Should I write my own custom agg function, or run it in Presto? Thanks!
@g.kishore: sumMV?
@bcwong: I’m aggregating on a non-numeric column. For example: ```select student, array_agg(course) from transcript group by 1``` would produce something like: ```John, [English, Math] Mary, [Math]```
@ayush.network: @ayush.network has joined the channel
@xinxinzhenbang: @xinxinzhenbang has joined the channel
@diogo.baeder: Hi folks, I have some questions about migrating data from a previous database into Pinot: In my project, we'll start publishing data to a Pinot realtime table, but we also need to port historical data. For historical data, do you recommend using an OFFLINE table to be used in conjunction with the REALTIME table, or is it fine to port the historical data to the REALTIME table directly? What are the pros and cons for each approach? Thanks!

#random

@ayush.network: @ayush.network has joined the channel
@xinxinzhenbang: @xinxinzhenbang has joined the channel

#troubleshooting

@alihaydar.atil: Hey everyone, i have a few questions regarding to deep storage. •Is it possible to use Linux filesystem as deep storage? If so how can i configure it? •What is actually stored in the folder controller.dir.data property value pointing at? •Is Peer Download functionality still supported in version 0.7.1? I would appreciate it if you could share your knowledge with me
@mapshen: We run a realtime table `table1` with fields `X` in `upsert` mode. When a new field `Y` is added to the schema, a simple query ```select * from table1 limit 10``` in the Pinot explorer will return the following error: ```[ { "message": "MergeResponseError:\nData schema mismatch between merged block: [X(DOUBLE)] and block to merge: [X(DOUBLE),Y(DOUBLE)], drop block to merge", "errorCode": 500 } ]``` However, the following query would work as expected ```select * from table1 limit 10 option (skipUpsert=True)``` Has anyone seen this before?
@mayanks: @jackie.jxt @yupeng ^^
@jackie.jxt: Have you reloaded the table after adding the column?
@mapshen: @jackie.jxt Hi again! Here is my response to you on Github > Yes we did. We don’t see this error with a regular table. This only manifests when `upsert` is on.
@mapshen: > That said, maybe we are doing the reload incorrectly? Can you let us know the right way to do it?
@jackie.jxt: The current reload has limited support on consuming segment, which might not work properly with upsert enabled. Restarting the servers will recreate the consuming segment, and apply the new schema.
@jackie.jxt: @yupeng Have you run into this issue before?
@mapshen: we are way past creating a consuming segment already
@mapshen: could you also advise on the right way to restart the server so that a consuming segment can be closed and persisted to disk?
@jackie.jxt: It does not need to be persisted to disk. We want to destroy the in-memory one and re-create one when the server starts
@jackie.jxt: Send a signal to kill the server should be okay
@mapshen: Okay so an offset is not committed to Kafka to a segment is built? Good to know
@mapshen: Back to the main question, there are already numerous new segments being created since the new filed was added. So the consuming segment should not be an issue
@jackie.jxt: During consumption, there is nothing persisted. Pinot uses low-level Kafka consumer, and maintains the committed segment offset in ZK. It does not commit the offset to Kafka
@jackie.jxt: Since there are already new consuming segments created, can you try reload the segments again and see if the problem is resolved?
@mapshen: Already did last night. It didn’t work. Unless we did it wrong using the reload API
@jackie.jxt: Do you have a lot of segments? If not, you may try this query to figure out which segment does not have the new added column: `SELECT MAX(Y) FROM table GROUP BY $segmentName LIMIT 10000`
@mapshen: Although we have thousands of segments but only 13 segments were retruned by the above query in an upsert enabled table and all the 13 segments had this field
@jackie.jxt: The segments without this column won't be returned. You can also try `SELECT DISTINCT $segmentName FROM table LIMIT 10000` to get all the segments
@mapshen: I guess upsert enabled tables are unique. You will need to do `SELECT MAX(Y) FROM table GROUP BY $segmentName LIMIT 100000 option (skipUpsert=true)` will return all the segments
@mapshen: alright, so there are 15k segments in total and about 3k don’t have this field. Guess we were doing reload incorrectly? Could you help advise on the right way to do it? Further, would you be able to explain why we don’t encounter this error if we do `select * from table1 limit 10 option (skipUpsert=True)`?
@yupeng: That’s surprising because the upsert impl is decoupled from the schema evolution
@yupeng: It would be helpful to see if there’s err log in the server log
@jackie.jxt: We actually want to find all segments with valid docs after upsert, and `SELECT DISTINCT $segmentName FROM table LIMIT 10000` should give you that
@jackie.jxt: `select * from table1 limit 10` will read the first 10 valid docs, and with upsert enabled, it might need to read more than 1 segment, thus cause merge conflict
@mapshen: @yupeng there is actually no error in the logs. also FYI, if we query the issue via Trino, no such error is returned.
@mapshen: @jackie.jxt i answered that. 15k in total
@jackie.jxt: I mean with upsert on
@mapshen: yes
@jackie.jxt: What? With upsert off you got 15k right?
@jackie.jxt: With upsert on you should get less
@mapshen: I got the same count
@jackie.jxt: How about `SELECT COUNT(*) FROM table GROUP BY $segmentName LIMIT 10000`?
@jackie.jxt: Anyway, I think reload is somehow not done correctly. Let's reload again and see if the issue is fixed
@mapshen: `SELECT COUNT(*) FROM table GROUP BY $segmentName LIMIT 10000 returns`15k rows
@mapshen: @jackie.jxt again, mind telling us the correct way to reload?
@jackie.jxt: You can use the cluster manager UI to do the reload
@jackie.jxt: Or use the rest API: `POST /tables/{tableName}/segments/reload`
@ayush.network: @ayush.network has joined the channel
@xinxinzhenbang: @xinxinzhenbang has joined the channel

#pinot-dev

@joseph.roldan: @joseph.roldan has joined the channel

#getting-started

@joseph.roldan: @joseph.roldan has joined the channel
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org