Hi David, Thanks for sharing your thoughts!
> It sounds like you might already have an end-to-end solution in mind. It would be really helpful if you could put that into writing so we can all align our thinking. It makes sense to create a high level vision. > I’m not a fan of the mindset of “this is how it was done in Spark, so we’ll just replicate it” without proper discussion. We’ve had similar conversations before. I think we've had this conversation already in case of delegation token framework and I can say the same. No intention to take over things blindly but it's not a shame to be inspired by solutions which are welcome by users. The intention is similar just like in scalable authentication area where Flink is now ahead of Spark. > Would it be too much to ask for a FLIP that outlines the overall vision (without delving too deeply into the details) to ensure everyone is aligned and moving in the same direction? That's a fair point and a constructive way how we can proceed. I'm going to come back with the details... BR, G On Fri, Aug 9, 2024 at 1:36 PM David Morávek <d...@apache.org> wrote: > Hi Gabor, > > Thanks for taking the initiative on this. It’s clear that significant > improvements are needed in this area, and parsing state files can be > incredibly challenging, even for those who are well-versed in it. > > > Just to make it crystal clear, I’m not shooting for an ad-hoc tiny fix > but started a path where we fill each and every gap which will end up in a > functionality and UX bar just like the Spark solution. > > It sounds like you might already have an end-to-end solution in mind. It > would be really helpful if you could put that into writing so we can all > align our thinking. > > I’m not a fan of the mindset of “this is how it was done in Spark, so we’ll > just replicate it” without proper discussion. We’ve had similar > conversations before. > > > But this doesn’t mean we create a single giga big FLIP after several > months of discussion. > > I don’t think anyone is asking for a massive FLIP after lengthy > discussions, but having a document that outlines the overall vision could > be incredibly valuable, especially in a distributed setting. It also opens > the door for others to contribute to and shape this shared vision, which is > a core principle of community-driven open-source development. > > Would it be too much to ask for a FLIP that outlines the overall vision > (without delving too deeply into the details) to ensure everyone is aligned > and moving in the same direction? > > Best, > D. > > On Fri, Aug 9, 2024 at 11:44 AM Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > > > Hi Zakelly, > > > > > I'd suggest we could think of this as a whole > > > > In general I think we have the same idea in our mind about considering > the > > state observability as a whole, just we need to agree about the physical > > task scheduling. > > > > > But such a solution requires more design and discussion > > > > I can't even agree more. But this doesn't mean we create a single giga > big > > FLIP after several months of discussion. > > > > > Regarding the current issue you are facing, here's my idea > > > > Just to make it crystal clear, I'm not shooting for ad-hoc tiny fix but > > started a path where we fill each and every gap which will end-up in a > > solution > > where we hit the functionality and UX bar just like the Spark solution. > > From plan and code perspective I'm more ahead of this FLIP. > > So when you aim for different task scheduling then make your exact > > suggestion instead of providing hacks. > > > > If I assume correctly you suggest to create a FLIP where we define and > > agree all the missing pieces in a single giga big FLIP, right? > > I would say there are obvious missing pieces which are clear that they > > needed. Just like in PRs the more consumable pieces we have > > the better because this single change is about 1k lines of code. Having > an > > overkill FLIP/PR can end up in feature creep which I think > > is disadvantageous. > > > > Of course this doesn't exclude the possibility that we start general more > > high level discussion about the whole state observability story. > > Here are my high level conceptual points (I consider roughly each point > as > > a separate FLIP): > > * Store human readable IDs for operators in metadata > > * Expose the metadata as data stream > > * Store state with user defined schemas as self containing entity > > * SQL integration > > * State metastore with all the created checkpoints/savepoints > > * State file cleanup strategy in case of failure > > * Optional: Some extra tool like metadata explorer > > > > That said I suggest to split the higher level discussion from this FLIP > in > > a separate thread. > > > > BR, > > G > > > > > > On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan <zakelly....@gmail.com> > wrote: > > > > > Hi Márton and Gabor, > > > > > > Thanks for sharing context! > > > > > > Yes, I'd admit that users need a more friendly way to explore states. > And > > > it seems Flink lacks something like the state metadata store. I'd > suggest > > > we could think of this as a whole, to store enough information for > > > querying, including operator names, uids, hashes, as well as the state > > > types or descriptors. Moreover we provide a tool to list those > metadata. > > My > > > thoughts is to provide a complete solution instead of adding one or two > > > specific data alongside the checkpoint. WDTY? I believe with the state > > > schema queryable, the State Processor API could become more powerful > and > > > easier to use. > > > > > > But such a solution requires more design and discussion. Regarding the > > > current issue you are facing, here's my idea: If you could get access > to > > > the web UI, you can get the hash (vertex id) in the url by clicking and > > > zooming in on the operator you want to query. IIUC, this hash can be > used > > > to query the state. Is this feasible? Additionally, I think we could > add > > > user-defined UIDs on the web UI and related REST APIs. Thus users could > > > easily identify an operator by uid, or get the uid of an operator. > > > > > > Best, > > > Zakelly > > > > > > On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi < > gabor.g.somo...@gmail.com > > > > > > wrote: > > > > > > > Hi Zakelly, > > > > > > > > Thanks for the feedback, let me elaborate on this. > > > > > > > > In short Databricks has created a much more user friendly solution[1] > > for > > > > state observability (based on Flink's state processor API) than what > we > > > > have now. > > > > > > > > Up until now our state processor API was good enough but now we're > > > lagging > > > > behind. We see users (just like Spark) where the first class citizen > is > > > the > > > > state itself and they're > > > > pointing to the new Spark solution. Since the state became first > class > > > > citizen there is a natural need to use it for business logic > > validation, > > > > debugging, explanatory browsing, etc... > > > > > > > > The main message here is that there are cases where users are not > able > > to > > > > identify operators because hash is a one way conversion. > > > > I'm open to any suggestion but somehow the initial operator human > > > readable > > > > identifier must be available. Let me come up with examples where > > > > users are completely blind. > > > > > > > > > Are you saying the user can set the operator uid but then doesn't > > know > > > > what they set when debugging? > > > > > > > > There are cases where the user is setting the UID in the job, such > case > > > > it's not user friendly to parse git repos but doable. > > > > But there are cases where the user has limited or no control related > > > UIDs: > > > > * SQL jobs are generating operators with meaningful names, but I > think > > > it's > > > > not realistic to enforce users to understand all the internals of > Flink > > > SQL > > > > implementation (which operator named where and how). > > > > * Iceberg is using the given UID as prefix and generating more > > operators > > > > with it > > > > * Weak justification but exists: Since operator name and UID are both > > > > optional some of the users are setting name only. Such case Flink > > > generates > > > > a random hash, where only name can give some pointers. > > > > > > > > Hope I've given better context. > > > > > > > > [1] > > > > > > > > > > > > > > https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source > > > > > > > > BR, > > > > G > > > > > > > > > > > > > > > > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan <zakelly....@gmail.com> > > > wrote: > > > > > > > > > Hi Gabor, > > > > > > > > > > Thanks for the proposal! However, I find it a little strange. Are > you > > > > > saying the user can set the operator uid but then doesn't know what > > > they > > > > > set when debugging? Otherwise, is the > > > > `OperatorIdentifier.forUid("my-uid")` > > > > > feasible? I understand your point about potential cross-team work, > > but > > > > the > > > > > person may not be able to debug code that was not written by them. > > > Things > > > > > get complex in this scenario. Could you provide more details about > > the > > > > > issue you are facing? > > > > > > > > > > Regarding the checkpoint, it is not designed to be self-contained > or > > > > > human-readable. I suggest not introducing such columns for > debugging > > > > > purposes. > > > > > > > > > > > > > > > Best, > > > > > Zakelly > > > > > > > > > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi < > > > gabor.g.somo...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > Hi Devs, > > > > > > > > > > > > I would like to start a discussion on FLIP-474: Store operator > name > > > and > > > > > UID > > > > > > in state metadata[1]. > > > > > > > > > > > > In short users are interested in what kind of operators are > inside > > a > > > > > > checkpoint data which can be enhanced from user experience > > > perspective. > > > > > The > > > > > > details can be found in FLIP-474[1]. > > > > > > > > > > > > Please share your thoughts on this. > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata > > > > > > > > > > > > BR, > > > > > > G > > > > > > > > > > > > > > > > > > > > >