Apache Pinot Daily Email Digest (2021-04-12)

Pinot Slack Email Digest Mon, 12 Apr 2021 19:00:37 -0700

#general

@alicelyu: @alicelyu has joined the channel
@ilchernenko: @ilchernenko has joined the channel
@prshnt.1314: Hi all, is there a possibility in pinot to create a table from another table with some schema changes and import partial data to the newer one ?
@mayanks: Not at the moment. What's your use case?
@prshnt.1314: We wanted to tie our client end queries based on user inputted fields during the data prep stage to pinot.
@prshnt.1314: If there are any changes to the presto query, then modify the pinot table
@mayanks: Since Pinot is columnar, what's the issue with having a single table?
@mayanks: Or do you need the new table to have modified columns (as in transformations applied)?
@prshnt.1314: Yes, new table could be anything, the schema be derived and transformation done in a previous layer based on user input
@mayanks: I see. At this point there isn't a way to do so. But you could file an issue, so we can track.
@prshnt.1314: Sure thanks will file tomorrow
@mayanks: If you want to do it yourself, we can discuss if a minion task can be used.
@prshnt.1314: Oh yes, I'm pretty much still thinking about the usecase. Once I've the definition clear I'll come back to it
@mayanks: Great, let us know when you are ready
@sg: @sg has joined the channel
@hochuen.wong: @hochuen.wong has joined the channel

#random

@alicelyu: @alicelyu has joined the channel
@ilchernenko: @ilchernenko has joined the channel
@sg: @sg has joined the channel
@hochuen.wong: @hochuen.wong has joined the channel

#troubleshooting

@alicelyu: @alicelyu has joined the channel
@ilchernenko: @ilchernenko has joined the channel
@sg: @sg has joined the channel
@hochuen.wong: @hochuen.wong has joined the channel

#feat-partial-upsert

@yupeng: @tingchen @qiaochu @jackie.jxt there is a hole in the case of partial upsert + the real-time segment replacement. i put down the problematic example in this doc
@yupeng: please take a look, and see if we can find a solution
@jackie.jxt: The problem is raised because we allow changing the history by replacing the segments
@jackie.jxt: In order to do such changes, I feel we have to override the values with the latest timestamp
@jackie.jxt: There is actually another hole from real-time segment replacement even without partial upsert. I'll put that into the doc
@yupeng: i think for full upsert, it's possible to reload all segments to derive the latest state
@yupeng: but for partial, we lost some info in the current proposal/implementation
@yupeng: that we do not know which columns shall be derived
@yupeng: but store the end output instead
@jackie.jxt: From data correction angle, what if we always correct data by sending another kafka event?
@jackie.jxt: Because of this hole, I started to think if replacing segment is the correct way to correct the upsert records
@yupeng: segment replacement is more performant
@yupeng: nice thing about the current full upsert support is that we can always derive the current state by replay of all the segments (i.e. load)
@yupeng: however, partial is not, due to some missing info
@jackie.jxt: Replaying all the segments when replacing segment could also be very costly
@jackie.jxt: We do segment replacement because the records are immutable, which is not the case for upsert
@jackie.jxt: We might want to revisit our solution.. Didn't consider these holes in the first place
@yupeng: how costly? it's reload of all segments?
@jackie.jxt: Yes, reloading all segments. Also, during the reload, there will be inconsistency
@jackie.jxt: For small table, that won't be an issue; for large table, that might take a while
@jackie.jxt: If it is point correction (single record), doing it via kafka should be better
@jackie.jxt: We need to introduce a way to invalid a record (should have already been included in the partial upsert design)
@yupeng: for point correct, i agree, doing via kafka is the way
@yupeng: however, there is still need for backfill
@yupeng: i think we still need to fill this hole
@yupeng: we discussed this reloading this before, as well as the inconsistency
@yupeng: i think it'll take seconds to scan all
@yupeng: even for full upsert, we still need to do the reload
@g.kishore: can we meet?
@yupeng: yeah, what time works for. you folks?
@jackie.jxt: I'm available most afternoons after 3
@yupeng: how about Wed 3-4?
@jackie.jxt: Works for me

#fix-numerical-predicate

@amrish.k.lal: On the Broker side, the optimizer is being called twice (one for brokerRequest and the other for PinotQuery) in the `BaseBrokerRequestHandler.optimize` function. I am wondering if I am missing something here (or if this is a bug) because this appears to be just repeating the same work twice?
@jackie.jxt: @amrish.k.lal Good question. That is because we need to keep backward-compatibility while migrating from PQL (BrokerRequest) to SQL (PinotQuery). During the migration, server side used to mix the query execution on both request format, thus we have to optimize both side. In the last release, we completely decouple the query execution on PQL and SQL, and I'm planing to decouple the broker side code as well in the following weeks after releasing `0.7.1`
@amrish.k.lal: Ok, so I will mainly look at SQL (PinotQuery) rewrite and ignore PQL (BrokerRequest) optimize method.

#complex-type-support

@yupeng: @npawar @g.kishore to confirm, there is no such flattening logic today in the ingestion flow, right?
@amrish.k.lal: @amrish.k.lal has joined the channel
@npawar: you can use jsonFormat to store the entire complex object as a string, and then use jsonExtractScalar/json index. Or, you can use jsonPath to extract fields you want,
@npawar: so additional need for flattening shouldnt exist, as we have many options to go about it
@npawar: i havent read the proopsal yet, dunno if what is being proposaed is already handled. I will look today
@g.kishore: I think Yupeng is talking about generic way to flatten everything and turn them into columns
@g.kishore: or atleast flatten a specific field
@npawar: use jsonPath?
@g.kishore: no
@g.kishore: in the decoder
@npawar: why does it need to be in the decoder?
@g.kishore: this is similar to the walmart usecase, where an order record needs to transformed to multiple rows - one for each lineitem
@npawar: oh multiple rows
@npawar: i didnt catch that
@steotia: I think the proposal is to flatten it during ingestion and store all the leaves as primitive columns so that later on all analyticss (group by etc) can be performed on inner/nested data
@npawar: this ^^ is not multile rows
@g.kishore: quick zoom call?
@steotia: does 10 work? I am about to go into a meeting at 9:30 in 2 mins
@yupeng: i can do 10 too. but we need to give neha some time to read the proposal
@steotia: yes, I just skimmed through it.. need to read through completely. will give some comments and may be then we can discuss sometime today
@npawar: i can do 10
@g.kishore: ok
@g.kishore: i have a conflict at 10
@g.kishore: but you guys go ahead and will sync up later
@npawar: how about later in the afternoon then? so i’ll also read the propsal properly
@yupeng: i'm avail 1-2pm, 2:30-4pm
@amrish.k.lal: We recently completed an evaluation of JSON query and type support in Pinot, let me see how I can share findings here. Just want to make sure that we are all in sync with future direction.
@amrish.k.lal:
@amrish.k.lal: @ssubrama Adding Subbu
@ssubrama: @ssubrama has joined the channel
@g.kishore: thanks for sharing json doc
@yupeng: @amrish.k.lal that's a nice doc. thanks. btw, @g.kishore @jackie.jxt, is there a plan to backfill the json indexing design doc?
@g.kishore: will be great if someone can pick it up
@g.kishore:
@g.kishore: we have 3 docs now, we should try to consolidate
@amrish.k.lal: @g.kishore Yeah, I think that is a good idea. @yupeng are we meeting this afternoon to discuss your design?
@steotia: Somewhere we should capture the desired end state. I think the current state is capture by both Yupeng and Amrish. It will be great to pen down and discuss how we want this feature to evolve and what it will look like in ideal state
@amrish.k.lal: Yeah, I think it will be important to know where we are heading :slightly_smiling_face:
@g.kishore: there is indexing and storage
@yupeng: would 3pm work for everyone?
@amrish.k.lal: ok from my and @steotia side.
@yupeng: @g.kishore @npawar would this work for you?
@yupeng: i can set up a zoom invite
@npawar: include jackie since there’s discussion about json index
@npawar: @jackie.jxt
@g.kishore: 3:30 for me
@yupeng: this is the link:
@yupeng: i need to drop at 4pm, so shall we start at 3, and kishore you can join at 3:30?
@jackie.jxt: 3:00 works for me
@g.kishore: indexing is mostly there with few tactical things left that Amrish has articulated very well
@g.kishore: we need a new doc for storage
@tingchen: @tingchen has joined the channel
@changliu: @changliu has joined the channel
@yupeng: plz join
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org