Hi David,

Thanks for sharing your thoughts!

> It sounds like you might already have an end-to-end solution in mind. It
would be really helpful if you could put that into writing so we can all
align our thinking.

It makes sense to create a high level vision.

> I’m not a fan of the mindset of “this is how it was done in Spark, so
we’ll
just replicate it” without proper discussion. We’ve had similar
conversations before.

I think we've had this conversation already in case of delegation token
framework
and I can say the same. No intention to take over things blindly but it's
not a shame
to be inspired by solutions which are welcome by users.
The intention is similar just like in scalable authentication area where
Flink is now ahead of Spark.

> Would it be too much to ask for a FLIP that outlines the overall vision
(without delving too deeply into the details) to ensure everyone is aligned
and moving in the same direction?

That's a fair point and a constructive way how we can proceed.
I'm going to come back with the details...

BR,
G


On Fri, Aug 9, 2024 at 1:36 PM David Morávek <d...@apache.org> wrote:

> Hi Gabor,
>
> Thanks for taking the initiative on this. It’s clear that significant
> improvements are needed in this area, and parsing state files can be
> incredibly challenging, even for those who are well-versed in it.
>
> > Just to make it crystal clear, I’m not shooting for an ad-hoc tiny fix
> but started a path where we fill each and every gap which will end up in a
> functionality and UX bar just like the Spark solution.
>
> It sounds like you might already have an end-to-end solution in mind. It
> would be really helpful if you could put that into writing so we can all
> align our thinking.
>
> I’m not a fan of the mindset of “this is how it was done in Spark, so we’ll
> just replicate it” without proper discussion. We’ve had similar
> conversations before.
>
> > But this doesn’t mean we create a single giga big FLIP after several
> months of discussion.
>
> I don’t think anyone is asking for a massive FLIP after lengthy
> discussions, but having a document that outlines the overall vision could
> be incredibly valuable, especially in a distributed setting. It also opens
> the door for others to contribute to and shape this shared vision, which is
> a core principle of community-driven open-source development.
>
> Would it be too much to ask for a FLIP that outlines the overall vision
> (without delving too deeply into the details) to ensure everyone is aligned
> and moving in the same direction?
>
> Best,
> D.
>
> On Fri, Aug 9, 2024 at 11:44 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
> > Hi Zakelly,
> >
> > > I'd suggest we could think of this as a whole
> >
> > In general I think we have the same idea in our mind about considering
> the
> > state observability as a whole, just we need to agree about the physical
> > task scheduling.
> >
> > > But such a solution requires more design and discussion
> >
> > I can't even agree more. But this doesn't mean we create a single giga
> big
> > FLIP after several months of discussion.
> >
> > > Regarding the current issue you are facing, here's my idea
> >
> > Just to make it crystal clear, I'm not shooting for ad-hoc tiny fix but
> > started a path where we fill each and every gap which will end-up in a
> > solution
> > where we hit the functionality and UX bar just like the Spark solution.
> > From plan and code perspective I'm more ahead of this FLIP.
> > So when you aim for different task scheduling then make your exact
> > suggestion instead of providing hacks.
> >
> > If I assume correctly you suggest to create a FLIP where we define and
> > agree all the missing pieces in a single giga big FLIP, right?
> > I would say there are obvious missing pieces which are clear that they
> > needed. Just like in PRs the more consumable pieces we have
> > the better because this single change is about 1k lines of code. Having
> an
> > overkill FLIP/PR can end up in feature creep which I think
> > is disadvantageous.
> >
> > Of course this doesn't exclude the possibility that we start general more
> > high level discussion about the whole state observability story.
> > Here are my high level conceptual points (I consider roughly each point
> as
> > a separate FLIP):
> > * Store human readable IDs for operators in metadata
> > * Expose the metadata as data stream
> > * Store state with user defined schemas as self containing entity
> > * SQL integration
> > * State metastore with all the created checkpoints/savepoints
> > * State file cleanup strategy in case of failure
> > * Optional: Some extra tool like metadata explorer
> >
> > That said I suggest to split the higher level discussion from this FLIP
> in
> > a separate thread.
> >
> > BR,
> > G
> >
> >
> > On Fri, Aug 9, 2024 at 10:17 AM Zakelly Lan <zakelly....@gmail.com>
> wrote:
> >
> > > Hi Márton and Gabor,
> > >
> > > Thanks for sharing context!
> > >
> > > Yes, I'd admit that users need a more friendly way to explore states.
> And
> > > it seems Flink lacks something like the state metadata store. I'd
> suggest
> > > we could think of this as a whole, to store enough information for
> > > querying, including operator names, uids, hashes, as well as the state
> > > types or descriptors. Moreover we provide a tool to list those
> metadata.
> > My
> > > thoughts is to provide a complete solution instead of adding one or two
> > > specific data alongside the checkpoint. WDTY? I believe with the state
> > > schema queryable, the State Processor API could become more powerful
> and
> > > easier to use.
> > >
> > > But such a solution requires more design and discussion. Regarding the
> > > current issue you are facing, here's my idea: If you could get access
> to
> > > the web UI, you can get the hash (vertex id) in the url by clicking and
> > > zooming in on the operator you want to query. IIUC, this hash can be
> used
> > > to query the state. Is this feasible? Additionally, I think we could
> add
> > > user-defined UIDs on the web UI and related REST APIs. Thus users could
> > > easily identify an operator by uid, or get the uid of an operator.
> > >
> > > Best,
> > > Zakelly
> > >
> > > On Thu, Aug 8, 2024 at 11:03 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Zakelly,
> > > >
> > > > Thanks for the feedback, let me elaborate on this.
> > > >
> > > > In short Databricks has created a much more user friendly solution[1]
> > for
> > > > state observability (based on Flink's state processor API) than what
> we
> > > > have now.
> > > >
> > > > Up until now our state processor API was good enough but now we're
> > > lagging
> > > > behind. We see users (just like Spark) where the first class citizen
> is
> > > the
> > > > state itself and they're
> > > > pointing to the new Spark solution. Since the state became first
> class
> > > > citizen there is a natural need to use it for business logic
> > validation,
> > > > debugging, explanatory browsing, etc...
> > > >
> > > > The main message here is that there are cases where users are not
> able
> > to
> > > > identify operators because hash is a one way conversion.
> > > > I'm open to any suggestion but somehow the initial operator human
> > > readable
> > > > identifier must be available. Let me come up with examples where
> > > > users are completely blind.
> > > >
> > > > > Are you saying the user can set the operator uid but then doesn't
> > know
> > > > what they set when debugging?
> > > >
> > > > There are cases where the user is setting the UID in the job, such
> case
> > > > it's not user friendly to parse git repos but doable.
> > > > But there are cases where the user has limited or no control related
> > > UIDs:
> > > > * SQL jobs are generating operators with meaningful names, but I
> think
> > > it's
> > > > not realistic to enforce users to understand all the internals of
> Flink
> > > SQL
> > > > implementation (which operator named where and how).
> > > > * Iceberg is using the given UID as prefix and generating more
> > operators
> > > > with it
> > > > * Weak justification but exists: Since operator name and UID are both
> > > > optional some of the users are setting name only. Such case Flink
> > > generates
> > > > a random hash, where only name can give some pointers.
> > > >
> > > > Hope I've given better context.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://www.databricks.com/blog/announcing-state-reader-api-new-statestore-data-source
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > >
> > > > On Thu, Aug 8, 2024 at 12:06 PM Zakelly Lan <zakelly....@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Gabor,
> > > > >
> > > > > Thanks for the proposal! However, I find it a little strange. Are
> you
> > > > > saying the user can set the operator uid but then doesn't know what
> > > they
> > > > > set when debugging? Otherwise, is the
> > > > `OperatorIdentifier.forUid("my-uid")`
> > > > > feasible? I understand your point about potential cross-team work,
> > but
> > > > the
> > > > > person may not be able to debug code that was not written by them.
> > > Things
> > > > > get complex in this scenario. Could you provide more details about
> > the
> > > > > issue you are facing?
> > > > >
> > > > > Regarding the checkpoint, it is not designed to be self-contained
> or
> > > > > human-readable. I suggest not introducing such columns for
> debugging
> > > > > purposes.
> > > > >
> > > > >
> > > > > Best,
> > > > > Zakelly
> > > > >
> > > > > On Wed, Aug 7, 2024 at 10:07 PM Gabor Somogyi <
> > > gabor.g.somo...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Devs,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-474: Store operator
> name
> > > and
> > > > > UID
> > > > > > in state metadata[1].
> > > > > >
> > > > > > In short users are interested in what kind of operators are
> inside
> > a
> > > > > > checkpoint data which can be enhanced from user experience
> > > perspective.
> > > > > The
> > > > > > details can be found in FLIP-474[1].
> > > > > >
> > > > > > Please share your thoughts on this.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-474%3A+Store+operator+name+and+UID+in+state+metadata
> > > > > >
> > > > > > BR,
> > > > > > G
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to