Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Zakelly Lan Tue, 26 Mar 2024 20:46:33 -0700

Hi everyone,

Piotr and I had a long discussion about the checkpoint duration issue. We
think that the lambda serialization approach I proposed in last mail may
bring in more issues, the most serious one is that users may not be able to
modify their code in serialized lambda to perform a bug fix.


But fortunately we found a possible solution. By introducing a set of
declarative APIs and a `declareProcess()` function that users should
implement in some newly introduced AbstractStreamOperator, we could get the
declaration of record processing in runtime, broken down to requests and
callbacks (lambdas) like FLIP-424 introduced. Thus we avoid the problem of
lambda (de)serialization and instead we retrieve callbacks every time
before a task runs. The next step is to provide an API allowing users to
assign an unique id to each state request and callback, or automatically
assign one by declaration order. Thus we can find the corresponding
callback in runtime for each restored state request based on the id, then
the whole pipeline can be resumed.

Note that all these above are internal and won't expose to average users.
Exposing this on Stream APIs can be discussed later. I will prepare another
FLIP(s) describing the whole optimized checkpointing process, and in the
meantime, we can proceed on current FLIPs. The new FLIP(s) are built on top
of current ones and we can achieve this incrementally.


Best,
Zakelly

On Thu, Mar 21, 2024 at 12:31 AM Zakelly Lan <[email protected]> wrote:

> Hi Piotrek,
>
> Thanks for your comments!
>
> As we discussed off-line, you agreed that we can not checkpoint while some
>> records are in the middle of being
>> processed. That we would have to drain the in-progress records before
>> doing
>> the checkpoint. You also argued
>> that this is not a problem, because the size of this buffer can be
>> configured.
>>
>> I'm really afraid of such a solution. I've seen in the past plenty of
>> times, that whenever Flink has to drain some
>> buffered records, eventually that always brakes timely checkpointing (and
>> hence ability for Flink to rescale in
>> a timely manner). Even a single record with a `flatMap` like operator
>> currently in Flink causes problems during
>> back pressure. That's especially true for example for processing
>> watermarks. At the same time, I don't see how
>> this value could be configured by even Flink's power users, let alone an
>> average user. The size of that in-flight
>> buffer not only depends on a particular query/job, but also the "good"
>> value changes dynamically over time,
>> and can change very rapidly. Sudden spikes of records or backpressure,
>> some
>> hiccup during emitting watermarks,
>> all of those could change in an instant the theoretically optimal buffer
>> size of let's say "6000" records, down to "1".
>> And when those changes happen, those are the exact times when timely
>> checkpointing matters the most.
>> If the load pattern suddenly changes, and checkpointing takes suddenly
>> tens
>> of minutes instead of a couple of
>> seconds, it means you can not use rescaling and you are forced to
>> overprovision the resources. And there also
>> other issues if checkpointing takes too long.
>>
>
> I'm gonna clarify some misunderstanding here. First of all, is the sync
> phase of checkpointing for the current plan longer than the synchronous
> execution model? The answer is yes, it is a trade-off for parallel
> execution model. I think the cost is worth the improvement. Now the
> question is, how much longer are we talking about? The PoC result I
> provided is that it takes 3 seconds to drain 6000 records of a simple job,
> and I said it is not a big deal. Even though you would say we may encounter
> long watermark/timer processing that make the cp wait, thus I provide
> several ways to optimize this:
>
>    1. Instead of only controlling the in-flight records, we could control
>    the in-flight watermark.
>    2. Since we have broken down the record processing into several state
>    requests with at most one subsequent callback for each request, the cp can
>    be processed after current RUNNING requests (NOT records) and their
>    callbacks finish. Which means, even though we have a lot of records
>    in-flight (I mean in 'processElement' here), once only a small group of
>    state requests finishes, the cp can proceed. They will form into 1 or 2
>    multiGets to rocksdb, which takes less time. Moreover, the timer processing
>    is also split into several requests, so cp won't wait for the whole timer
>    to finish. The picture attached describes this idea.
>
> And the size of this buffer can be configured. I'm not counting on average
> users to configure it well, I'm just saying that we'd better not focus on
> absolute numbers of PoC or specific cases since we can always provide
> a conservative default value to make this acceptable for most cases.The
> adaptive buffer size is also worth a try if we provide a conservative
> strategy.
>
> Besides, I don't understand why the load pattern or spikes or backpressure
> will affect this. We are controlling the records that can get in the
> 'processElement' and the state requests that can fire in parallel, no
> matter how high the load spikes, they will be blocked outside. It is
> relatively stable within the proposed execution model itself. The unaligned
> barrier will skip those inputs in the queue as before.
>
>
> At the same time, I still don't understand why we can not implement things
>> incrementally? First
>> let's start with the current API, without the need to rewrite all of the
>> operators, we can asynchronously fetch whole
>> state for a given record using its key. That should already vastly improve
>> many things, and this way we could
>> perform a checkpoint without a need of draining the in-progress/in-flight
>> buffer. We could roll that version out,
>> test it out in practice, and then we could see if the fine grained state
>> access is really needed. Otherwise it sounds
>> to me like a premature optimization, that requires us to not only rewrite
>> a
>> lot of the code, but also to later maintain
>> it, even if it ultimately proves to be not needed. Which I of course can
>> not be certain but I have a feeling that it
>> might be the case.
>
>
> The disaggregated state management we proposed is target at including but
> not limited to the following challenges:
>
>    1. Local disk constraints, including limited I/O and space. (discussed
>    in FLIP-423)
>    2. Unbind the I/O resource with pre-allocated CPU resource, to make
>    good use of both (motivation of FLIP-424)
>    3. Elasticity of scaling I/O or storage capacity.
>
> Thus our plan is dependent on DFS, using local disk as an optional cache
> only. The pre-fetching plan you mentioned is still binding I/O with CPU
> resources, and will consume even more I/O to load unnecessary state. It
> makes things worse. Please note that we are not targeting some scenarios
> where the local state could handle well, and our goal is not to replace the
> local state.
>
> And If manpower is a big concern of yours, I would say many of my
> colleagues could help contribute in runtime or SQL operators. It is
> experimental on a separate code path other than the local state and will be
> recommended to use only when we prove it mature.
>
>
> Thanks & Best,
> Zakelly
>
> On Wed, Mar 20, 2024 at 10:04 PM Piotr Nowojski <[email protected]>
> wrote:
>
>> Hey Zakelly!
>>
>> Sorry for the late reply. I still have concerns about the proposed
>> solution, with my main concerns coming from
>> the implications of the asynchronous state access API on the checkpointing
>> and responsiveness of Flink.
>>
>> >> What also worries me a lot in this fine grained model is the effect on
>> the checkpointing times.
>> >
>> > Your concerns are very reasonable. Faster checkpointing is always a core
>> advantage of disaggregated state,
>> > but only for the async phase. There will be some complexity introduced
>> by
>> in-flight requests, but I'd suggest
>> > a checkpoint containing those in-flight state requests as part of the
>> state, to accelerate the sync phase by
>> > skipping the buffer draining. This makes the buffer size have little
>> impact on checkpoint time. And all the
>> > changes keep within the execution model we proposed while the checkpoint
>> barrier alignment or handling
>> > will not be touched in our proposal, so I guess the complexity is
>> relatively controllable. I have faith in that :)
>>
>> As we discussed off-line, you agreed that we can not checkpoint while some
>> records are in the middle of being
>> processed. That we would have to drain the in-progress records before
>> doing
>> the checkpoint. You also argued
>> that this is not a problem, because the size of this buffer can be
>> configured.
>>
>> I'm really afraid of such a solution. I've seen in the past plenty of
>> times, that whenever Flink has to drain some
>> buffered records, eventually that always brakes timely checkpointing (and
>> hence ability for Flink to rescale in
>> a timely manner). Even a single record with a `flatMap` like operator
>> currently in Flink causes problems during
>> back pressure. That's especially true for example for processing
>> watermarks. At the same time, I don't see how
>> this value could be configured by even Flink's power users, let alone an
>> average user. The size of that in-flight
>> buffer not only depends on a particular query/job, but also the "good"
>> value changes dynamically over time,
>> and can change very rapidly. Sudden spikes of records or backpressure,
>> some
>> hiccup during emitting watermarks,
>> all of those could change in an instant the theoretically optimal buffer
>> size of let's say "6000" records, down to "1".
>> And when those changes happen, those are the exact times when timely
>> checkpointing matters the most.
>> If the load pattern suddenly changes, and checkpointing takes suddenly
>> tens
>> of minutes instead of a couple of
>> seconds, it means you can not use rescaling and you are forced to
>> overprovision the resources. And there also
>> other issues if checkpointing takes too long.
>>
>> At the same time, I still don't understand why we can not implement things
>> incrementally? First
>> let's start with the current API, without the need to rewrite all of the
>> operators, we can asynchronously fetch whole
>> state for a given record using its key. That should already vastly improve
>> many things, and this way we could
>> perform a checkpoint without a need of draining the in-progress/in-flight
>> buffer. We could roll that version out,
>> test it out in practice, and then we could see if the fine grained state
>> access is really needed. Otherwise it sounds
>> to me like a premature optimization, that requires us to not only rewrite
>> a
>> lot of the code, but also to later maintain
>> it, even if it ultimately proves to be not needed. Which I of course can
>> not be certain but I have a feeling that it
>> might be the case.
>>
>> Best,
>> Piotrek
>>
>> wt., 19 mar 2024 o 10:42 Zakelly Lan <[email protected]> napisał(a):
>>
>> > Hi everyone,
>> >
>> > Thanks for your valuable feedback!
>> >
>> > Our discussions have been going on for a while and are nearing a
>> > consensus. So I would like to start a vote after 72 hours.
>> >
>> > Please let me know if you have any concerns, thanks!
>> >
>> >
>> > Best,
>> > Zakelly
>> >
>> > On Tue, Mar 19, 2024 at 3:37 PM Zakelly Lan <[email protected]>
>> wrote:
>> >
>> > > Hi Yunfeng,
>> > >
>> > > Thanks for the suggestion!
>> > >
>> > > I will reorganize the FLIP-425 accordingly.
>> > >
>> > >
>> > > Best,
>> > > Zakelly
>> > >
>> > > On Tue, Mar 19, 2024 at 3:20 PM Yunfeng Zhou <
>> > [email protected]>
>> > > wrote:
>> > >
>> > >> Hi Xintong and Zakelly,
>> > >>
>> > >> > 2. Regarding Strictly-ordered and Out-of-order of Watermarks
>> > >> I agree with it that watermarks can use only out-of-order mode for
>> > >> now, because there is still not a concrete example showing the
>> > >> correctness risk about it. However, the strictly-ordered mode should
>> > >> still be supported as the default option for non-record event types
>> > >> other than watermark, at least for checkpoint barriers.
>> > >>
>> > >> I noticed that this information has already been documented in "For
>> > >> other non-record events, such as RecordAttributes ...", but it's at
>> > >> the bottom of the "Watermark" section, which might not be very
>> > >> obvious. Thus it might be better to reorganize the FLIP to better
>> > >> claim that the two order modes are designed for all non-record
>> events,
>> > >> and which mode this FLIP would choose for each type of event.
>> > >>
>> > >> Best,
>> > >> Yunfeng
>> > >>
>> > >> On Tue, Mar 19, 2024 at 1:09 PM Xintong Song <[email protected]>
>> > >> wrote:
>> > >> >
>> > >> > Thanks for the quick response. Sounds good to me.
>> > >> >
>> > >> > Best,
>> > >> >
>> > >> > Xintong
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Tue, Mar 19, 2024 at 1:03 PM Zakelly Lan <[email protected]
>> >
>> > >> wrote:
>> > >> >
>> > >> > > Hi Xintong,
>> > >> > >
>> > >> > > Thanks for sharing your thoughts!
>> > >> > >
>> > >> > > 1. Regarding Record-ordered and State-ordered of processElement.
>> > >> > > >
>> > >> > > > I understand that while State-ordered likely provides better
>> > >> performance,
>> > >> > > > Record-ordered is sometimes required for correctness. The
>> question
>> > >> is how
>> > >> > > > should a user choose between these two modes? My concern is
>> that
>> > >> such a
>> > >> > > > decision may require users to have in-depth knowledge about the
>> > >> Flink
>> > >> > > > internals, and may lead to correctness issues if State-ordered
>> is
>> > >> chosen
>> > >> > > > improperly.
>> > >> > > >
>> > >> > > > I'd suggest not to expose such a knob, at least in the first
>> > >> version.
>> > >> > > That
>> > >> > > > means always use Record-ordered for custom operators / UDFs,
>> and
>> > >> keep
>> > >> > > > State-ordered for internal usages (built-in operators) only.
>> > >> > > >
>> > >> > >
>> > >> > > Indeed, users may not be able to choose the mode properly. I
>> agree
>> > to
>> > >> keep
>> > >> > > such options for internal use.
>> > >> > >
>> > >> > >
>> > >> > > 2. Regarding Strictly-ordered and Out-of-order of Watermarks.
>> > >> > > >
>> > >> > > > I'm not entirely sure about Strictly-ordered being the
>> default, or
>> > >> even
>> > >> > > > being supported. From my understanding, a Watermark(T) only
>> > >> suggests that
>> > >> > > > all records with event time before T has arrived, and it has
>> > >> nothing to
>> > >> > > do
>> > >> > > > with whether records with event time after T has arrived or
>> not.
>> > >> From
>> > >> > > that
>> > >> > > > perspective, preventing certain records from arriving before a
>> > >> Watermark
>> > >> > > is
>> > >> > > > never supported. I also cannot come up with any use case where
>> > >> > > > Strictly-ordered is necessary. This implies the same issue as
>> 1):
>> > >> how
>> > >> > > does
>> > >> > > > the user choose between the two modes?
>> > >> > > >
>> > >> > > > I'd suggest not expose the knob to users and only support
>> > >> Out-of-order,
>> > >> > > > until we see a concrete use case that Strictly-ordered is
>> needed.
>> > >> > > >
>> > >> > >
>> > >> > > The semantics of watermarks do not define the sequence between a
>> > >> watermark
>> > >> > > and subsequent records. For the most part, this is
>> inconsequential,
>> > >> except
>> > >> > > it may affect some current users who have previously relied on
>> the
>> > >> implicit
>> > >> > > assumption of an ordered execution. I'd be fine with initially
>> > >> supporting
>> > >> > > only out-of-order processing. We may consider exposing the
>> > >> > > 'Strictly-ordered' mode once we encounter a concrete use case
>> that
>> > >> > > necessitates it.
>> > >> > >
>> > >> > >
>> > >> > > My philosophies behind not exposing the two config options are:
>> > >> > > > - There are already too many options in Flink that barely know
>> how
>> > >> to use
>> > >> > > > them. I think Flink should try as much as possible to decide
>> its
>> > own
>> > >> > > > behavior, rather than throwing all the decisions to the users.
>> > >> > > > - It's much harder to take back knobs than to introduce them.
>> > >> Therefore,
>> > >> > > > options should be introduced only if concrete use cases are
>> > >> identified.
>> > >> > > >
>> > >> > >
>> > >> > > I agree to keep minimal configurable items especially for the
>> MVP.
>> > >> Given
>> > >> > > that we have the opportunity to refine the functionality before
>> the
>> > >> > > framework transitions from @Experimental to @PublicEvolving, it
>> > makes
>> > >> sense
>> > >> > > to refrain from presenting user-facing options until we have
>> ensured
>> > >> > > their necessity.
>> > >> > >
>> > >> > >
>> > >> > > Best,
>> > >> > > Zakelly
>> > >> > >
>> > >> > > On Tue, Mar 19, 2024 at 12:06 PM Xintong Song <
>> > [email protected]>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Sorry for joining the discussion late.
>> > >> > > >
>> > >> > > > I have two questions about FLIP-425.
>> > >> > > >
>> > >> > > > 1. Regarding Record-ordered and State-ordered of
>> processElement.
>> > >> > > >
>> > >> > > > I understand that while State-ordered likely provides better
>> > >> performance,
>> > >> > > > Record-ordered is sometimes required for correctness. The
>> question
>> > >> is how
>> > >> > > > should a user choose between these two modes? My concern is
>> that
>> > >> such a
>> > >> > > > decision may require users to have in-depth knowledge about the
>> > >> Flink
>> > >> > > > internals, and may lead to correctness issues if State-ordered
>> is
>> > >> chosen
>> > >> > > > improperly.
>> > >> > > >
>> > >> > > > I'd suggest not to expose such a knob, at least in the first
>> > >> version.
>> > >> > > That
>> > >> > > > means always use Record-ordered for custom operators / UDFs,
>> and
>> > >> keep
>> > >> > > > State-ordered for internal usages (built-in operators) only.
>> > >> > > >
>> > >> > > > 2. Regarding Strictly-ordered and Out-of-order of Watermarks.
>> > >> > > >
>> > >> > > > I'm not entirely sure about Strictly-ordered being the
>> default, or
>> > >> even
>> > >> > > > being supported. From my understanding, a Watermark(T) only
>> > >> suggests that
>> > >> > > > all records with event time before T has arrived, and it has
>> > >> nothing to
>> > >> > > do
>> > >> > > > with whether records with event time after T has arrived or
>> not.
>> > >> From
>> > >> > > that
>> > >> > > > perspective, preventing certain records from arriving before a
>> > >> Watermark
>> > >> > > is
>> > >> > > > never supported. I also cannot come up with any use case where
>> > >> > > > Strictly-ordered is necessary. This implies the same issue as
>> 1):
>> > >> how
>> > >> > > does
>> > >> > > > the user choose between the two modes?
>> > >> > > >
>> > >> > > > I'd suggest not expose the knob to users and only support
>> > >> Out-of-order,
>> > >> > > > until we see a concrete use case that Strictly-ordered is
>> needed.
>> > >> > > >
>> > >> > > >
>> > >> > > > My philosophies behind not exposing the two config options are:
>> > >> > > > - There are already too many options in Flink that barely know
>> how
>> > >> to use
>> > >> > > > them. I think Flink should try as much as possible to decide
>> its
>> > own
>> > >> > > > behavior, rather than throwing all the decisions to the users.
>> > >> > > > - It's much harder to take back knobs than to introduce them.
>> > >> Therefore,
>> > >> > > > options should be introduced only if concrete use cases are
>> > >> identified.
>> > >> > > >
>> > >> > > > WDYT?
>> > >> > > >
>> > >> > > > Best,
>> > >> > > >
>> > >> > > > Xintong
>> > >> > > >
>> > >> > > >
>> > >> > > >
>> > >> > > > On Fri, Mar 8, 2024 at 2:45 AM Jing Ge
>> <[email protected]
>> > >
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > +1 for Gyula's suggestion. I just finished FLIP-423 which
>> > >> introduced
>> > >> > > the
>> > >> > > > > intention of the big change and high level architecture.
>> Great
>> > >> content
>> > >> > > > btw!
>> > >> > > > > The only public interface change for this FLIP is one new
>> config
>> > >> to use
>> > >> > > > > ForSt. It makes sense to have one dedicated discussion thread
>> > for
>> > >> each
>> > >> > > > > concrete system design.
>> > >> > > > >
>> > >> > > > > @Zakelly The links in your mail do not work except the last
>> one,
>> > >> > > because
>> > >> > > > > the FLIP-xxx has been included into the url like
>> > >> > > > >
>> > >> > >
>> > >>
>> >
>> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
>> > >> > > > .
>> > >> > > > >
>> > >> > > > > NIT fix:
>> > >> > > > >
>> > >> > > > > FLIP-424:
>> > >> > > >
>> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
>> > >> > > > >
>> > >> > > > > FLIP-425:
>> > >> > > >
>> https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
>> > >> > > > >
>> > >> > > > > FLIP-426:
>> > >> > > >
>> https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
>> > >> > > > >
>> > >> > > > > FLIP-427:
>> > >> > > >
>> https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
>> > >> > > > >
>> > >> > > > > FLIP-428:
>> > >> > > >
>> https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
>> > >> > > > >
>> > >> > > > > Best regards,
>> > >> > > > > Jing
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > On Thu, Mar 7, 2024 at 10:14 AM Zakelly Lan <
>> > >> [email protected]>
>> > >> > > > wrote:
>> > >> > > > >
>> > >> > > > > > Hi everyone,
>> > >> > > > > >
>> > >> > > > > > Thank you all for a lively discussion here, and it is a
>> good
>> > >> time to
>> > >> > > > move
>> > >> > > > > > forward to more detailed discussions. Thus we open several
>> > >> threads
>> > >> > > for
>> > >> > > > > > sub-FLIPs:
>> > >> > > > > >
>> > >> > > > > > FLIP-424:
>> > >> > > > >
>> > https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864
>> > >> > > > > > FLIP-425
>> > >> > > > > > <
>> > >> > > > >
>> > >> > >
>> > >>
>> >
>> https://lists.apache.org/thread/nmd9qd0k8l94ygcfgllxms49wmtz1864FLIP-425
>> > >> > > > >:
>> > >> > > > > >
>> > >> https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0h
>> > >> > > > > > FLIP-426
>> > >> > > > > > <
>> > >> > > > >
>> > >> > >
>> > >>
>> >
>> https://lists.apache.org/thread/wxn1j848fnfkqsnrs947wh1wmj8n8z0hFLIP-426
>> > >> > > > >:
>> > >> > > > > >
>> > >> https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrf
>> > >> > > > > > FLIP-427
>> > >> > > > > > <
>> > >> > > > >
>> > >> > >
>> > >>
>> >
>> https://lists.apache.org/thread/bt931focfl9971cwq194trmf3pkdsxrfFLIP-427
>> > >> > > > >:
>> > >> > > > > >
>> > >> https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
>> > >> > > > > > FLIP-428
>> > >> > > > > > <
>> > >> > > > >
>> > >> > >
>> > >>
>> >
>> https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ftFLIP-428
>> > >> > > > >:
>> > >> > > > > >
>> > >> https://lists.apache.org/thread/vr8f91p715ct4lop6b3nr0fh4z5p312b
>> > >> > > > > >
>> > >> > > > > > If you want to talk about the overall architecture,
>> roadmap,
>> > >> > > milestones
>> > >> > > > > or
>> > >> > > > > > something related with multiple FLIPs, please post it here.
>> > >> Otherwise
>> > >> > > > you
>> > >> > > > > > can discuss some details in separate mails. Let's try to
>> avoid
>> > >> > > repeated
>> > >> > > > > > discussion in different threads. I will sync important
>> > messages
>> > >> here
>> > >> > > if
>> > >> > > > > > there are any in the above threads.
>> > >> > > > > >
>> > >> > > > > > And reply to @Jeyhun: We will ensure the content between
>> those
>> > >> FLIPs
>> > >> > > is
>> > >> > > > > > consistent.
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > Best,
>> > >> > > > > > Zakelly
>> > >> > > > > >
>> > >> > > > > > On Thu, Mar 7, 2024 at 2:16 PM Yuan Mei <
>> > [email protected]
>> > >> >
>> > >> > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > I have been a bit busy these few weeks and sorry for
>> > >> responding
>> > >> > > late.
>> > >> > > > > > >
>> > >> > > > > > > The original thinking of keeping discussion within one
>> > thread
>> > >> is
>> > >> > > for
>> > >> > > > > > easier
>> > >> > > > > > > tracking and avoid for repeated discussion in different
>> > >> threads.
>> > >> > > > > > >
>> > >> > > > > > > For details, It might be good to start in different
>> threads
>> > if
>> > >> > > > needed.
>> > >> > > > > > >
>> > >> > > > > > > We will think of a way to better organize the discussion.
>> > >> > > > > > >
>> > >> > > > > > > Best
>> > >> > > > > > > Yuan
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Thu, Mar 7, 2024 at 4:38 AM Jeyhun Karimov <
>> > >> > > [email protected]>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Hi,
>> > >> > > > > > > >
>> > >> > > > > > > > + 1 for the suggestion.
>> > >> > > > > > > > Maybe we can the discussion with the FLIPs with minimum
>> > >> > > > dependencies
>> > >> > > > > > > (from
>> > >> > > > > > > > the other new/proposed FLIPs).
>> > >> > > > > > > > Based on our discussion on a particular FLIP, the
>> > >> subsequent (or
>> > >> > > > its
>> > >> > > > > > > > dependent) FLIP(s) can be updated accordingly?
>> > >> > > > > > > >
>> > >> > > > > > > > Regards,
>> > >> > > > > > > > Jeyhun
>> > >> > > > > > > >
>> > >> > > > > > > > On Wed, Mar 6, 2024 at 5:34 PM Gyula Fóra <
>> > >> [email protected]>
>> > >> > > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > Hey all!
>> > >> > > > > > > > >
>> > >> > > > > > > > > This is a massive improvement / work. I just started
>> > going
>> > >> > > > through
>> > >> > > > > > the
>> > >> > > > > > > > > Flips and have a more or less meta comment.
>> > >> > > > > > > > >
>> > >> > > > > > > > > While it's good to keep the overall architecture
>> > >> discussion
>> > >> > > > here, I
>> > >> > > > > > > think
>> > >> > > > > > > > > we should still have separate discussions for each
>> FLIP
>> > >> where
>> > >> > > we
>> > >> > > > > can
>> > >> > > > > > > > > discuss interface details etc. With so much content
>> if
>> > we
>> > >> start
>> > >> > > > > > adding
>> > >> > > > > > > > > minor comments here that will lead to nowhere but
>> those
>> > >> > > > discussions
>> > >> > > > > > are
>> > >> > > > > > > > > still important and we should have them in separate
>> > >> threads
>> > >> > > (one
>> > >> > > > > for
>> > >> > > > > > > each
>> > >> > > > > > > > > FLIP)
>> > >> > > > > > > > >
>> > >> > > > > > > > > What do you think?
>> > >> > > > > > > > > Gyula
>> > >> > > > > > > > >
>> > >> > > > > > > > > On Wed, Mar 6, 2024 at 8:50 AM Yanfei Lei <
>> > >> [email protected]
>> > >> > > >
>> > >> > > > > > wrote:
>> > >> > > > > > > > >
>> > >> > > > > > > > > > Hi team,
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > Thanks for your discussion. Regarding FLIP-425, we
>> > have
>> > >> > > > > > supplemented
>> > >> > > > > > > > > > several updates to answer high-frequency questions:
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > 1. We captured a flame graph of the Hashmap state
>> > >> backend in
>> > >> > > > > > > > > > "Synchronous execution with asynchronous APIs"[1],
>> > which
>> > >> > > > reveals
>> > >> > > > > > that
>> > >> > > > > > > > > > the framework overhead (including reference
>> counting,
>> > >> > > > > > future-related
>> > >> > > > > > > > > > code and so on) consumes about 9% of the keyed
>> > operator
>> > >> CPU
>> > >> > > > time.
>> > >> > > > > > > > > > 2. We added a set of comparative experiments for
>> > >> watermark
>> > >> > > > > > > processing,
>> > >> > > > > > > > > > the performance of Out-Of-Order mode is 70% better
>> > than
>> > >> > > > > > > > > > strictly-ordered mode under ~140MB state size.
>> > >> Instructions
>> > >> > > on
>> > >> > > > > how
>> > >> > > > > > to
>> > >> > > > > > > > > > run this test have also been added[2].
>> > >> > > > > > > > > > 3. Regarding the order of StreamRecord, whether it
>> has
>> > >> state
>> > >> > > > > access
>> > >> > > > > > > or
>> > >> > > > > > > > > > not. We supplemented a new *Strict order of
>> > >> > > > 'processElement'*[3].
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > [1]
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-SynchronousexecutionwithasynchronousAPIs
>> > >> > > > > > > > > > [2]
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-Strictly-orderedmodevs.Out-of-ordermodeforwatermarkprocessing
>> > >> > > > > > > > > > [3]
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-425%3A+Asynchronous+Execution+Model#FLIP425:AsynchronousExecutionModel-ElementOrder
>> > >> > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > Best regards,
>> > >> > > > > > > > > > Yanfei
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > Yunfeng Zhou <[email protected]>
>> > 于2024年3月5日周二
>> > >> > > > 09:25写道：
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > Hi Zakelly,
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > > 5. I'm not very sure ... revisiting this later
>> > >> since it
>> > >> > > is
>> > >> > > > > not
>> > >> > > > > > > > > > important.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > It seems that we still have some details to
>> confirm
>> > >> about
>> > >> > > > this
>> > >> > > > > > > > > > > question. Let's postpone this to after the
>> critical
>> > >> parts
>> > >> > > of
>> > >> > > > > the
>> > >> > > > > > > > > > > design are settled.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > > 8. Yes, we had considered ... metrics should be
>> > like
>> > >> > > > > > afterwards.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > Oh sorry I missed FLIP-431. I'm fine with
>> discussing
>> > >> this
>> > >> > > > topic
>> > >> > > > > > in
>> > >> > > > > > > > > > milestone 2.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > Looking forward to the detailed design about the
>> > >> strict
>> > >> > > mode
>> > >> > > > > > > between
>> > >> > > > > > > > > > > same-key records and the benchmark results about
>> the
>> > >> epoch
>> > >> > > > > > > mechanism.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > Best regards,
>> > >> > > > > > > > > > > Yunfeng
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > On Mon, Mar 4, 2024 at 7:59 PM Zakelly Lan <
>> > >> > > > > > [email protected]>
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > Hi Yunfeng,
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > For 1:
>> > >> > > > > > > > > > > > I had a discussion with Lincoln Lee, and I
>> realize
>> > >> it is
>> > >> > > a
>> > >> > > > > > common
>> > >> > > > > > > > > case
>> > >> > > > > > > > > > the same-key record should be blocked before the
>> > >> > > > > `processElement`.
>> > >> > > > > > It
>> > >> > > > > > > > is
>> > >> > > > > > > > > > easier for users to understand. Thus I will
>> introduce
>> > a
>> > >> > > strict
>> > >> > > > > mode
>> > >> > > > > > > for
>> > >> > > > > > > > > > this and make it default. My rough idea is just
>> like
>> > >> yours,
>> > >> > > by
>> > >> > > > > > > invoking
>> > >> > > > > > > > > > some method of AEC instance before
>> `processElement`.
>> > The
>> > >> > > > detailed
>> > >> > > > > > > > design
>> > >> > > > > > > > > > will be described in FLIP later.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > For 2:
>> > >> > > > > > > > > > > > I agree with you. We could throw exceptions for
>> > now
>> > >> and
>> > >> > > > > > optimize
>> > >> > > > > > > > this
>> > >> > > > > > > > > > later.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > For 5:
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> It might be better to move the default values
>> to
>> > >> the
>> > >> > > > > Proposed
>> > >> > > > > > > > > Changes
>> > >> > > > > > > > > > > >> section instead of making them public for
>> now, as
>> > >> there
>> > >> > > > will
>> > >> > > > > > be
>> > >> > > > > > > > > > > >> compatibility issues once we want to
>> dynamically
>> > >> adjust
>> > >> > > > the
>> > >> > > > > > > > > thresholds
>> > >> > > > > > > > > > > >> and timeouts in future.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > Agreed. The whole framework is under experiment
>> > >> until we
>> > >> > > > > think
>> > >> > > > > > it
>> > >> > > > > > > > is
>> > >> > > > > > > > > > complete in 2.0 or later. The default value should
>> be
>> > >> better
>> > >> > > > > > > determined
>> > >> > > > > > > > > > with more testing results and production
>> experience.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >> The configuration
>> execution.async-state.enabled
>> > >> seems
>> > >> > > > > > > unnecessary,
>> > >> > > > > > > > > as
>> > >> > > > > > > > > > > >> the infrastructure may automatically get this
>> > >> > > information
>> > >> > > > > from
>> > >> > > > > > > the
>> > >> > > > > > > > > > > >> detailed state backend configurations. We may
>> > >> revisit
>> > >> > > this
>> > >> > > > > > part
>> > >> > > > > > > > > after
>> > >> > > > > > > > > > > >> the core designs have reached an agreement.
>> WDYT?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > I'm not very sure if there is any use case
>> where
>> > >> users
>> > >> > > > write
>> > >> > > > > > > their
>> > >> > > > > > > > > > code using async APIs but run their job in a
>> > >> synchronous way.
>> > >> > > > The
>> > >> > > > > > > first
>> > >> > > > > > > > > two
>> > >> > > > > > > > > > scenarios that come to me are for benchmarking or
>> for
>> > a
>> > >> small
>> > >> > > > > > state,
>> > >> > > > > > > > > while
>> > >> > > > > > > > > > they don't want to rewrite their code. Actually it
>> is
>> > >> easy to
>> > >> > > > > > > support,
>> > >> > > > > > > > so
>> > >> > > > > > > > > > I'd suggest providing it. But I'm fine with
>> revisiting
>> > >> this
>> > >> > > > later
>> > >> > > > > > > since
>> > >> > > > > > > > > it
>> > >> > > > > > > > > > is not important. WDYT?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > For 8:
>> > >> > > > > > > > > > > > Yes, we had considered the I/O metrics group
>> > >> especially
>> > >> > > the
>> > >> > > > > > > > > > back-pressure, idle and task busy per second. In
>> the
>> > >> current
>> > >> > > > plan
>> > >> > > > > > we
>> > >> > > > > > > > can
>> > >> > > > > > > > > do
>> > >> > > > > > > > > > state access during back-pressure, meaning that
>> those
>> > >> metrics
>> > >> > > > for
>> > >> > > > > > > input
>> > >> > > > > > > > > > would better be redefined. I suggest we discuss
>> these
>> > >> > > existing
>> > >> > > > > > > metrics
>> > >> > > > > > > > as
>> > >> > > > > > > > > > well as some new metrics that should be introduced
>> in
>> > >> > > FLIP-431
>> > >> > > > > > later
>> > >> > > > > > > in
>> > >> > > > > > > > > > milestone 2, since we have basically finished the
>> > >> framework
>> > >> > > > thus
>> > >> > > > > we
>> > >> > > > > > > > will
>> > >> > > > > > > > > > have a better view of what metrics should be like
>> > >> afterwards.
>> > >> > > > > WDYT?
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > Best,
>> > >> > > > > > > > > > > > Zakelly
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > On Mon, Mar 4, 2024 at 6:49 PM Yunfeng Zhou <
>> > >> > > > > > > > > > [email protected]> wrote:
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Hi Zakelly,
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Thanks for the responses!
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> > 1. I will discuss this with some expert SQL
>> > >> > > developers.
>> > >> > > > > ...
>> > >> > > > > > > mode
>> > >> > > > > > > > > > for StreamRecord processing.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> In DataStream API there should also be use
>> cases
>> > >> when
>> > >> > > the
>> > >> > > > > > order
>> > >> > > > > > > of
>> > >> > > > > > > > > > > >> output is strictly required. I agree with it
>> that
>> > >> SQL
>> > >> > > > > experts
>> > >> > > > > > > may
>> > >> > > > > > > > > help
>> > >> > > > > > > > > > > >> provide more concrete use cases that can
>> > >> accelerate our
>> > >> > > > > > > > discussion,
>> > >> > > > > > > > > > > >> but please allow me to search for DataStream
>> use
>> > >> cases
>> > >> > > > that
>> > >> > > > > > can
>> > >> > > > > > > > > prove
>> > >> > > > > > > > > > > >> the necessity of this strict order
>> preservation
>> > >> mode, if
>> > >> > > > > > answers
>> > >> > > > > > > > > from
>> > >> > > > > > > > > > > >> SQL experts are shown to be negative.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> For your convenience, my current rough idea is
>> > >> that we
>> > >> > > can
>> > >> > > > > > add a
>> > >> > > > > > > > > > > >> module between the Input(s) and
>> processElement()
>> > >> module
>> > >> > > in
>> > >> > > > > > Fig 2
>> > >> > > > > > > > of
>> > >> > > > > > > > > > > >> FLIP-425. The module will be responsible for
>> > >> caching
>> > >> > > > records
>> > >> > > > > > > whose
>> > >> > > > > > > > > > > >> keys collide with in-flight records, and AEC
>> will
>> > >> only
>> > >> > > be
>> > >> > > > > > > > > responsible
>> > >> > > > > > > > > > > >> for handling async state calls, without
>> knowing
>> > the
>> > >> > > record
>> > >> > > > > > each
>> > >> > > > > > > > call
>> > >> > > > > > > > > > > >> belongs to. We may revisit this topic once the
>> > >> necessity
>> > >> > > > of
>> > >> > > > > > the
>> > >> > > > > > > > > strict
>> > >> > > > > > > > > > > >> order mode is clarified.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> > 2. The amount of parallel StateRequests ...
>> > >> instead of
>> > >> > > > > > > invoking
>> > >> > > > > > > > > > yield
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Your suggestions generally appeal to me. I
>> think
>> > >> we may
>> > >> > > > let
>> > >> > > > > > > > > > > >> corresponding Flink jobs fail with OOM for
>> now,
>> > >> since
>> > >> > > the
>> > >> > > > > > > majority
>> > >> > > > > > > > > of
>> > >> > > > > > > > > > > >> a StateRequest should just be references to
>> > >> existing
>> > >> > > Java
>> > >> > > > > > > objects,
>> > >> > > > > > > > > > > >> which only occupies very small memory space
>> and
>> > can
>> > >> > > hardly
>> > >> > > > > > cause
>> > >> > > > > > > > OOM
>> > >> > > > > > > > > > > >> in common cases. We can monitor the pending
>> > >> > > StateRequests
>> > >> > > > > and
>> > >> > > > > > if
>> > >> > > > > > > > > there
>> > >> > > > > > > > > > > >> is really a risk of OOM in extreme cases, we
>> can
>> > >> throw
>> > >> > > > > > > Exceptions
>> > >> > > > > > > > > with
>> > >> > > > > > > > > > > >> proper messages notifying users what to do,
>> like
>> > >> > > > increasing
>> > >> > > > > > > memory
>> > >> > > > > > > > > > > >> through configurations.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Your suggestions to adjust threshold
>> adaptively
>> > or
>> > >> to
>> > >> > > use
>> > >> > > > > the
>> > >> > > > > > > > > blocking
>> > >> > > > > > > > > > > >> buffer sounds good, and in my opinion we can
>> > >> postpone
>> > >> > > them
>> > >> > > > > to
>> > >> > > > > > > > future
>> > >> > > > > > > > > > > >> FLIPs since they seem to only benefit users in
>> > rare
>> > >> > > cases.
>> > >> > > > > > Given
>> > >> > > > > > > > > that
>> > >> > > > > > > > > > > >> FLIP-423~428 has already been a big enough
>> > design,
>> > >> it
>> > >> > > > might
>> > >> > > > > be
>> > >> > > > > > > > > better
>> > >> > > > > > > > > > > >> to focus on the most critical design for now
>> and
>> > >> > > postpone
>> > >> > > > > > > > > > > >> optimizations like this. WDYT?
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> > 5. Sure, we will introduce new configs as
>> well
>> > as
>> > >> > > their
>> > >> > > > > > > default
>> > >> > > > > > > > > > value.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Thanks for adding the default values and the
>> > values
>> > >> > > > > themselves
>> > >> > > > > > > > LGTM.
>> > >> > > > > > > > > > > >> It might be better to move the default values
>> to
>> > >> the
>> > >> > > > > Proposed
>> > >> > > > > > > > > Changes
>> > >> > > > > > > > > > > >> section instead of making them public for
>> now, as
>> > >> there
>> > >> > > > will
>> > >> > > > > > be
>> > >> > > > > > > > > > > >> compatibility issues once we want to
>> dynamically
>> > >> adjust
>> > >> > > > the
>> > >> > > > > > > > > thresholds
>> > >> > > > > > > > > > > >> and timeouts in future.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> The configuration
>> execution.async-state.enabled
>> > >> seems
>> > >> > > > > > > unnecessary,
>> > >> > > > > > > > > as
>> > >> > > > > > > > > > > >> the infrastructure may automatically get this
>> > >> > > information
>> > >> > > > > from
>> > >> > > > > > > the
>> > >> > > > > > > > > > > >> detailed state backend configurations. We may
>> > >> revisit
>> > >> > > this
>> > >> > > > > > part
>> > >> > > > > > > > > after
>> > >> > > > > > > > > > > >> the core designs have reached an agreement.
>> WDYT?
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Besides, inspired by Jeyhun's comments, it
>> comes
>> > >> to me
>> > >> > > > that
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> 8. Should this FLIP introduce metrics that
>> > measure
>> > >> the
>> > >> > > > time
>> > >> > > > > a
>> > >> > > > > > > > Flink
>> > >> > > > > > > > > > > >> job is back-pressured by State IOs? Under the
>> > >> current
>> > >> > > > > design,
>> > >> > > > > > > this
>> > >> > > > > > > > > > > >> metric could measure the time when the
>> blocking
>> > >> buffer
>> > >> > > is
>> > >> > > > > full
>> > >> > > > > > > and
>> > >> > > > > > > > > > > >> yield() cannot get callbacks to process, which
>> > >> means the
>> > >> > > > > > > operator
>> > >> > > > > > > > is
>> > >> > > > > > > > > > > >> fully waiting for state responses.
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> Best regards,
>> > >> > > > > > > > > > > >> Yunfeng
>> > >> > > > > > > > > > > >>
>> > >> > > > > > > > > > > >> On Mon, Mar 4, 2024 at 12:33 PM Zakelly Lan <
>> > >> > > > > > > > [email protected]>
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Hi Yunfeng,
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Thanks for your detailed comments!
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 1. Why do we need a close() method on
>> > >> StateIterator?
>> > >> > > > This
>> > >> > > > > > > > method
>> > >> > > > > > > > > > seems
>> > >> > > > > > > > > > > >> >> unused in the usage example codes.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > The `close()` is introduced to release
>> internal
>> > >> > > > resources,
>> > >> > > > > > but
>> > >> > > > > > > > it
>> > >> > > > > > > > > > does not seem to require the user to call it. I
>> > removed
>> > >> this.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 2. In FutureUtils.combineAll()'s JavaDoc,
>> it
>> > is
>> > >> > > stated
>> > >> > > > > that
>> > >> > > > > > > "No
>> > >> > > > > > > > > > null
>> > >> > > > > > > > > > > >> >> entries are allowed". It might be better to
>> > >> further
>> > >> > > > > explain
>> > >> > > > > > > > what
>> > >> > > > > > > > > > will
>> > >> > > > > > > > > > > >> >> happen if a null value is passed, ignoring
>> the
>> > >> value
>> > >> > > in
>> > >> > > > > the
>> > >> > > > > > > > > > returned
>> > >> > > > > > > > > > > >> >> Collection or throwing exceptions. Given
>> that
>> > >> > > > > > > > > > > >> >> FutureUtils.emptyFuture() can be returned
>> in
>> > the
>> > >> > > > example
>> > >> > > > > > > code,
>> > >> > > > > > > > I
>> > >> > > > > > > > > > > >> >> suppose the former one might be correct.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > The statement "No null entries are allowed"
>> > >> refers to
>> > >> > > > the
>> > >> > > > > > > > > > parameters, it means some arrayList like [null,
>> > >> StateFuture1,
>> > >> > > > > > > > > StateFuture2]
>> > >> > > > > > > > > > passed in are not allowed, and an Exception will be
>> > >> thrown.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 1. According to Fig 2 of this FLIP, ... .
>> This
>> > >> > > > situation
>> > >> > > > > > > should
>> > >> > > > > > > > > be
>> > >> > > > > > > > > > > >> >> avoided and the order of same-key records
>> > >> should be
>> > >> > > > > > strictly
>> > >> > > > > > > > > > > >> >> preserved.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > I will discuss this with some expert SQL
>> > >> developers.
>> > >> > > And
>> > >> > > > > if
>> > >> > > > > > it
>> > >> > > > > > > > is
>> > >> > > > > > > > > > valid and common, I suggest a strict order
>> > preservation
>> > >> mode
>> > >> > > > for
>> > >> > > > > > > > > > StreamRecord processing. WDYT?
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 2. The FLIP says that StateRequests
>> submitted
>> > by
>> > >> > > > > Callbacks
>> > >> > > > > > > will
>> > >> > > > > > > > > not
>> > >> > > > > > > > > > > >> >> invoke further yield() methods. Given that
>> > >> yield() is
>> > >> > > > > used
>> > >> > > > > > > when
>> > >> > > > > > > > > > there
>> > >> > > > > > > > > > > >> >> is "too much" in-flight data, does it mean
>> > >> > > > StateRequests
>> > >> > > > > > > > > submitted
>> > >> > > > > > > > > > by
>> > >> > > > > > > > > > > >> >> Callbacks will never be "too much"? What if
>> > the
>> > >> total
>> > >> > > > > > number
>> > >> > > > > > > of
>> > >> > > > > > > > > > > >> >> StateRequests exceed the capacity of Flink
>> > >> operator's
>> > >> > > > > > memory
>> > >> > > > > > > > > space?
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > The amount of parallel StateRequests for one
>> > >> > > > StreamRecord
>> > >> > > > > > > cannot
>> > >> > > > > > > > > be
>> > >> > > > > > > > > > determined since the code is written by users. So
>> the
>> > >> > > in-flight
>> > >> > > > > > > > requests
>> > >> > > > > > > > > > may be "too much", and may cause OOM. Users should
>> > >> > > re-configure
>> > >> > > > > > their
>> > >> > > > > > > > > job,
>> > >> > > > > > > > > > controlling the amount of on-going StreamRecord.
>> And I
>> > >> > > suggest
>> > >> > > > > two
>> > >> > > > > > > ways
>> > >> > > > > > > > > to
>> > >> > > > > > > > > > avoid this:
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Adaptively adjust the count of on-going
>> > >> StreamRecord
>> > >> > > > > > according
>> > >> > > > > > > > to
>> > >> > > > > > > > > > historical StateRequests amount.
>> > >> > > > > > > > > > > >> > Also control the max StateRequests that can
>> be
>> > >> > > executed
>> > >> > > > in
>> > >> > > > > > > > > parallel
>> > >> > > > > > > > > > for each StreamRecord, and if it exceeds, put the
>> new
>> > >> > > > > StateRequest
>> > >> > > > > > in
>> > >> > > > > > > > the
>> > >> > > > > > > > > > blocking buffer waiting for execution (instead of
>> > >> invoking
>> > >> > > > > > yield()).
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > WDYT?
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 3.1 I'm concerned that the out-of-order
>> > >> execution
>> > >> > > mode,
>> > >> > > > > > along
>> > >> > > > > > > > > with
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > >> >> epoch mechanism, would bring more
>> complexity
>> > to
>> > >> the
>> > >> > > > > > execution
>> > >> > > > > > > > > model
>> > >> > > > > > > > > > > >> >> than the performance improvement it
>> promises.
>> > >> Could
>> > >> > > we
>> > >> > > > > add
>> > >> > > > > > > some
>> > >> > > > > > > > > > > >> >> benchmark results proving the benefit of
>> this
>> > >> mode?
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Agreed, will do.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 3.2 The FLIP might need to add a public API
>> > >> section
>> > >> > > > > > > describing
>> > >> > > > > > > > > how
>> > >> > > > > > > > > > > >> >> users or developers can switch between
>> these
>> > two
>> > >> > > > > execution
>> > >> > > > > > > > modes.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Good point. We will add a Public API
>> section.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 3.3 Apart from the watermark and checkpoint
>> > >> mentioned
>> > >> > > > in
>> > >> > > > > > this
>> > >> > > > > > > > > FLIP,
>> > >> > > > > > > > > > > >> >> there are also more other events that might
>> > >> appear in
>> > >> > > > the
>> > >> > > > > > > > stream
>> > >> > > > > > > > > of
>> > >> > > > > > > > > > > >> >> data records. It might be better to
>> generalize
>> > >> the
>> > >> > > > > > execution
>> > >> > > > > > > > mode
>> > >> > > > > > > > > > > >> >> mechanism to handle all possible events.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Yes, I missed this point. Thanks for the
>> > >> reminder.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 4. It might be better to treat
>> > >> callback-handling as a
>> > >> > > > > > > > > > > >> >> MailboxDefaultAction, instead of Mails, to
>> > >> avoid the
>> > >> > > > > > overhead
>> > >> > > > > > > > of
>> > >> > > > > > > > > > > >> >> repeatedly creating Mail objects.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >  I thought the intermediated wrapper for
>> > >> callback can
>> > >> > > > not
>> > >> > > > > be
>> > >> > > > > > > > > > omitted, since there will be some context switch
>> > before
>> > >> each
>> > >> > > > > > > execution.
>> > >> > > > > > > > > The
>> > >> > > > > > > > > > MailboxDefaultAction in most cases is processInput
>> > >> right?
>> > >> > > While
>> > >> > > > > the
>> > >> > > > > > > > > > callback should be executed with higher priority.
>> I'd
>> > >> suggest
>> > >> > > > not
>> > >> > > > > > > > > changing
>> > >> > > > > > > > > > the basic logic of Mailbox and the default action
>> > since
>> > >> it is
>> > >> > > > > very
>> > >> > > > > > > > > critical
>> > >> > > > > > > > > > for performance. But yes, we will try our best to
>> > avoid
>> > >> > > > creating
>> > >> > > > > > > > > > intermediated objects.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 5. Could this FLIP provide the current
>> default
>> > >> values
>> > >> > > > for
>> > >> > > > > > > > things
>> > >> > > > > > > > > > like
>> > >> > > > > > > > > > > >> >> active buffer size thresholds and timeouts?
>> > >> These
>> > >> > > could
>> > >> > > > > > help
>> > >> > > > > > > > with
>> > >> > > > > > > > > > > >> >> memory consumption and latency analysis.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Sure, we will introduce new configs as well
>> as
>> > >> their
>> > >> > > > > default
>> > >> > > > > > > > > value.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 6. Why do we need to record the hashcode
>> of a
>> > >> record
>> > >> > > in
>> > >> > > > > its
>> > >> > > > > > > > > > > >> >> RecordContext? It seems not used.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > The context switch before each callback
>> > execution
>> > >> > > > involves
>> > >> > > > > > > > > > setCurrentKey, where the hashCode is
>> re-calculated. We
>> > >> cache
>> > >> > > it
>> > >> > > > > for
>> > >> > > > > > > > > > accelerating.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >> 7. In "timers can be stored on the JVM
>> heap or
>> > >> > > > RocksDB",
>> > >> > > > > > the
>> > >> > > > > > > > link
>> > >> > > > > > > > > > > >> >> points to a document in flink-1.15. It
>> might
>> > be
>> > >> > > better
>> > >> > > > to
>> > >> > > > > > > > verify
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > >> >> referenced content is still valid in the
>> > latest
>> > >> Flink
>> > >> > > > and
>> > >> > > > > > > > update
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > >> >> link accordingly. Same for other
>> references if
>> > >> any.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Thanks for the reminder! Will check.
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > Thanks a lot & Best,
>> > >> > > > > > > > > > > >> > Zakelly
>> > >> > > > > > > > > > > >> >
>> > >> > > > > > > > > > > >> > On Sat, Mar 2, 2024 at 6:18 AM Jeyhun
>> Karimov <
>> > >> > > > > > > > > [email protected]>
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> Hi,
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> Thanks for the great proposals. I have a
>> few
>> > >> comments
>> > >> > > > > > > comments:
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> - Backpressure Handling. Flink's original
>> > >> > > backpressure
>> > >> > > > > > > handling
>> > >> > > > > > > > > is
>> > >> > > > > > > > > > quite
>> > >> > > > > > > > > > > >> >> robust and the semantics is quite "simple"
>> > >> (simple is
>> > >> > > > > > > > beautiful).
>> > >> > > > > > > > > > > >> >> This mechanism has proven to perform
>> > >> better/robust
>> > >> > > than
>> > >> > > > > the
>> > >> > > > > > > > other
>> > >> > > > > > > > > > open
>> > >> > > > > > > > > > > >> >> source streaming systems, where they were
>> > >> relying on
>> > >> > > > some
>> > >> > > > > > > > > loopback
>> > >> > > > > > > > > > > >> >> information.
>> > >> > > > > > > > > > > >> >> Now that the proposal also relies on
>> loopback
>> > >> (yield
>> > >> > > in
>> > >> > > > > > this
>> > >> > > > > > > > > > case), it is
>> > >> > > > > > > > > > > >> >> not clear how well the new backpressure
>> > handling
>> > >> > > > proposed
>> > >> > > > > > in
>> > >> > > > > > > > > > FLIP-425 is
>> > >> > > > > > > > > > > >> >> robust and handle fluctuating workloads.
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> - Watermark/Timer Handling: Similar
>> arguments
>> > >> apply
>> > >> > > for
>> > >> > > > > > > > watermark
>> > >> > > > > > > > > > and timer
>> > >> > > > > > > > > > > >> >> handling. IMHO, we need more benchmarks
>> > showing
>> > >> the
>> > >> > > > > > overhead
>> > >> > > > > > > > > > > >> >> of epoch management with different
>> parameters
>> > >> (e.g.,
>> > >> > > > > window
>> > >> > > > > > > > size,
>> > >> > > > > > > > > > watermark
>> > >> > > > > > > > > > > >> >> strategy, etc)
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> - DFS consistency guarantees. The proposal
>> in
>> > >> > > FLIP-427
>> > >> > > > is
>> > >> > > > > > > > > > DFS-agnostic.
>> > >> > > > > > > > > > > >> >> However, different cloud providers have
>> > >> different
>> > >> > > > storage
>> > >> > > > > > > > > > consistency
>> > >> > > > > > > > > > > >> >> models.
>> > >> > > > > > > > > > > >> >> How do we want to deal with them?
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >>  Regards,
>> > >> > > > > > > > > > > >> >> Jeyhun
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> On Fri, Mar 1, 2024 at 6:08 PM Zakelly Lan
>> <
>> > >> > > > > > > > > [email protected]>
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >> >>
>> > >> > > > > > > > > > > >> >> > Thanks Piotr for sharing your thoughts!
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > I guess it depends how we would like to
>> > treat
>> > >> the
>> > >> > > > local
>> > >> > > > > > > > disks.
>> > >> > > > > > > > > > I've always
>> > >> > > > > > > > > > > >> >> > > thought about them that almost always
>> > >> eventually
>> > >> > > > all
>> > >> > > > > > > state
>> > >> > > > > > > > > > from the DFS
>> > >> > > > > > > > > > > >> >> > > should end up cached in the local
>> disks.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > OK I got it. In our proposal we treat
>> local
>> > >> disk as
>> > >> > > > an
>> > >> > > > > > > > optional
>> > >> > > > > > > > > > cache, so
>> > >> > > > > > > > > > > >> >> > the basic design will handle the case
>> with
>> > >> state
>> > >> > > > > residing
>> > >> > > > > > > in
>> > >> > > > > > > > > DFS
>> > >> > > > > > > > > > only. It
>> > >> > > > > > > > > > > >> >> > is a more 'cloud-native' approach that
>> does
>> > >> not
>> > >> > > rely
>> > >> > > > on
>> > >> > > > > > any
>> > >> > > > > > > > > > local storage
>> > >> > > > > > > > > > > >> >> > assumptions, which allow users to
>> > dynamically
>> > >> > > adjust
>> > >> > > > > the
>> > >> > > > > > > > > > capacity or I/O
>> > >> > > > > > > > > > > >> >> > bound of remote storage to gain
>> performance
>> > >> or save
>> > >> > > > the
>> > >> > > > > > > cost,
>> > >> > > > > > > > > > even without
>> > >> > > > > > > > > > > >> >> > a job restart.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > In
>> > >> > > > > > > > > > > >> >> > > the currently proposed more fine
>> grained
>> > >> > > solution,
>> > >> > > > > you
>> > >> > > > > > > > make a
>> > >> > > > > > > > > > single
>> > >> > > > > > > > > > > >> >> > > request to DFS per each state access.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Ah that's not accurate. Actually we
>> buffer
>> > the
>> > >> > > state
>> > >> > > > > > > requests
>> > >> > > > > > > > > > and process
>> > >> > > > > > > > > > > >> >> > them in batch, multiple requests will
>> > >> correspond to
>> > >> > > > one
>> > >> > > > > > DFS
>> > >> > > > > > > > > > access (One
>> > >> > > > > > > > > > > >> >> > block access for multiple keys performed
>> by
>> > >> > > RocksDB).
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > In that benchmark you mentioned, are you
>> > >> requesting
>> > >> > > > the
>> > >> > > > > > > state
>> > >> > > > > > > > > > > >> >> > > asynchronously from local disks into
>> > >> memory? If
>> > >> > > the
>> > >> > > > > > > benefit
>> > >> > > > > > > > > > comes from
>> > >> > > > > > > > > > > >> >> > > parallel I/O, then I would expect the
>> > >> benefit to
>> > >> > > > > > > > > > disappear/shrink when
>> > >> > > > > > > > > > > >> >> > > running multiple subtasks on the same
>> > >> machine, as
>> > >> > > > > they
>> > >> > > > > > > > would
>> > >> > > > > > > > > > be making
>> > >> > > > > > > > > > > >> >> > > their own parallel requests, right?
>> Also
>> > >> enabling
>> > >> > > > > > > > > > checkpointing would
>> > >> > > > > > > > > > > >> >> > > further cut into the available I/O
>> budget.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > That's an interesting topic. Our
>> proposal is
>> > >> > > > > specifically
>> > >> > > > > > > > aimed
>> > >> > > > > > > > > > at the
>> > >> > > > > > > > > > > >> >> > scenario where the machine I/O is not
>> fully
>> > >> loaded
>> > >> > > > but
>> > >> > > > > > the
>> > >> > > > > > > > I/O
>> > >> > > > > > > > > > latency has
>> > >> > > > > > > > > > > >> >> > indeed become a bottleneck for each
>> subtask.
>> > >> While
>> > >> > > > the
>> > >> > > > > > > > > > distributed file
>> > >> > > > > > > > > > > >> >> > system is a prime example of a scenario
>> > >> > > characterized
>> > >> > > > > by
>> > >> > > > > > > > > > abundant and
>> > >> > > > > > > > > > > >> >> > easily scalable I/O bandwidth coupled
>> with
>> > >> higher
>> > >> > > I/O
>> > >> > > > > > > > latency.
>> > >> > > > > > > > > > You may
>> > >> > > > > > > > > > > >> >> > expect to increase the parallelism of a
>> job
>> > to
>> > >> > > > enhance
>> > >> > > > > > the
>> > >> > > > > > > > > > performance as
>> > >> > > > > > > > > > > >> >> > well, but that also brings in more waste
>> of
>> > >> CPU's
>> > >> > > and
>> > >> > > > > > > memory
>> > >> > > > > > > > > for
>> > >> > > > > > > > > > building
>> > >> > > > > > > > > > > >> >> > up more subtasks. This is one drawback
>> for
>> > the
>> > >> > > > > > > > > > computation-storage tightly
>> > >> > > > > > > > > > > >> >> > coupled nodes. While in our proposal, the
>> > >> parallel
>> > >> > > > I/O
>> > >> > > > > > with
>> > >> > > > > > > > all
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > >> >> > callbacks still running in one task,
>> > >> pre-allocated
>> > >> > > > > > > > > computational
>> > >> > > > > > > > > > resources
>> > >> > > > > > > > > > > >> >> > are better utilized. It is a much more
>> > >> lightweight
>> > >> > > > way
>> > >> > > > > to
>> > >> > > > > > > > > > perform parallel
>> > >> > > > > > > > > > > >> >> > I/O.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Just with what granularity those async
>> > >> requests
>> > >> > > > should
>> > >> > > > > be
>> > >> > > > > > > > made.
>> > >> > > > > > > > > > > >> >> > > Making state access asynchronous is
>> > >> definitely
>> > >> > > the
>> > >> > > > > > right
>> > >> > > > > > > > way
>> > >> > > > > > > > > > to go!
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > I think the current proposal is based on
>> > such
>> > >> core
>> > >> > > > > ideas:
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >    - A pure cloud-native disaggregated
>> > state.
>> > >> > > > > > > > > > > >> >> >    - Fully utilize the given resources
>> and
>> > >> try not
>> > >> > > to
>> > >> > > > > > waste
>> > >> > > > > > > > > them
>> > >> > > > > > > > > > (including
>> > >> > > > > > > > > > > >> >> >    I/O).
>> > >> > > > > > > > > > > >> >> >    - The ability to scale isolated
>> resources
>> > >> (I/O
>> > >> > > or
>> > >> > > > > CPU
>> > >> > > > > > or
>> > >> > > > > > > > > > memory)
>> > >> > > > > > > > > > > >> >> >    independently.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > We think a fine-grained granularity is
>> more
>> > >> inline
>> > >> > > > with
>> > >> > > > > > > those
>> > >> > > > > > > > > > ideas,
>> > >> > > > > > > > > > > >> >> > especially without local disk assumptions
>> > and
>> > >> > > without
>> > >> > > > > any
>> > >> > > > > > > > waste
>> > >> > > > > > > > > > of I/O by
>> > >> > > > > > > > > > > >> >> > prefetching. Please note that it is not a
>> > >> > > replacement
>> > >> > > > > of
>> > >> > > > > > > the
>> > >> > > > > > > > > > original local
>> > >> > > > > > > > > > > >> >> > state with synchronous execution. Instead
>> > >> this is a
>> > >> > > > > > > solution
>> > >> > > > > > > > > > embracing the
>> > >> > > > > > > > > > > >> >> > cloud-native era, providing much more
>> > >> scalability
>> > >> > > and
>> > >> > > > > > > > resource
>> > >> > > > > > > > > > efficiency
>> > >> > > > > > > > > > > >> >> > when handling a *huge state*.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > What also worries me a lot in this fine
>> > >> grained
>> > >> > > model
>> > >> > > > > is
>> > >> > > > > > > the
>> > >> > > > > > > > > > effect on the
>> > >> > > > > > > > > > > >> >> > > checkpointing times.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Your concerns are very reasonable. Faster
>> > >> > > > checkpointing
>> > >> > > > > > is
>> > >> > > > > > > > > > always a core
>> > >> > > > > > > > > > > >> >> > advantage of disaggregated state, but
>> only
>> > >> for the
>> > >> > > > > async
>> > >> > > > > > > > phase.
>> > >> > > > > > > > > > There will
>> > >> > > > > > > > > > > >> >> > be some complexity introduced by
>> in-flight
>> > >> > > requests,
>> > >> > > > > but
>> > >> > > > > > > I'd
>> > >> > > > > > > > > > suggest a
>> > >> > > > > > > > > > > >> >> > checkpoint containing those in-flight
>> state
>> > >> > > requests
>> > >> > > > as
>> > >> > > > > > > part
>> > >> > > > > > > > of
>> > >> > > > > > > > > > the state,
>> > >> > > > > > > > > > > >> >> > to accelerate the sync phase by skipping
>> the
>> > >> buffer
>> > >> > > > > > > draining.
>> > >> > > > > > > > > > This makes
>> > >> > > > > > > > > > > >> >> > the buffer size have little impact on
>> > >> checkpoint
>> > >> > > > time.
>> > >> > > > > > And
>> > >> > > > > > > > all
>> > >> > > > > > > > > > the changes
>> > >> > > > > > > > > > > >> >> > keep within the execution model we
>> proposed
>> > >> while
>> > >> > > the
>> > >> > > > > > > > > checkpoint
>> > >> > > > > > > > > > barrier
>> > >> > > > > > > > > > > >> >> > alignment or handling will not be
>> touched in
>> > >> our
>> > >> > > > > > proposal,
>> > >> > > > > > > > so I
>> > >> > > > > > > > > > guess
>> > >> > > > > > > > > > > >> >> > the complexity is relatively
>> controllable. I
>> > >> have
>> > >> > > > faith
>> > >> > > > > > in
>> > >> > > > > > > > that
>> > >> > > > > > > > > > :)
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Also regarding the overheads, it would be
>> > >> great if
>> > >> > > > you
>> > >> > > > > > > could
>> > >> > > > > > > > > > provide
>> > >> > > > > > > > > > > >> >> > > profiling results of the benchmarks
>> that
>> > you
>> > >> > > > > conducted
>> > >> > > > > > to
>> > >> > > > > > > > > > verify the
>> > >> > > > > > > > > > > >> >> > > results. Or maybe if you could describe
>> > the
>> > >> steps
>> > >> > > > to
>> > >> > > > > > > > > reproduce
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > >> >> > results?
>> > >> > > > > > > > > > > >> >> > > Especially "Hashmap (sync)" vs "Hashmap
>> > with
>> > >> > > async
>> > >> > > > > > API".
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Yes we could profile the benchmarks. And
>> for
>> > >> the
>> > >> > > > > > comparison
>> > >> > > > > > > > of
>> > >> > > > > > > > > > "Hashmap
>> > >> > > > > > > > > > > >> >> > (sync)" vs "Hashmap with async API", we
>> > >> conduct a
>> > >> > > > > > Wordcount
>> > >> > > > > > > > job
>> > >> > > > > > > > > > written
>> > >> > > > > > > > > > > >> >> > with async APIs but disabling the async
>> > >> execution
>> > >> > > by
>> > >> > > > > > > directly
>> > >> > > > > > > > > > completing
>> > >> > > > > > > > > > > >> >> > the future using sync state access. This
>> > >> evaluates
>> > >> > > > the
>> > >> > > > > > > > overhead
>> > >> > > > > > > > > > of newly
>> > >> > > > > > > > > > > >> >> > introduced modules like 'AEC' in sync
>> > >> execution
>> > >> > > (even
>> > >> > > > > > > though
>> > >> > > > > > > > > > they are not
>> > >> > > > > > > > > > > >> >> > designed for it). The code will be
>> provided
>> > >> later.
>> > >> > > > For
>> > >> > > > > > > other
>> > >> > > > > > > > > > results of our
>> > >> > > > > > > > > > > >> >> > PoC[1], you can follow the instructions
>> > >> here[2] to
>> > >> > > > > > > reproduce.
>> > >> > > > > > > > > > Since the
>> > >> > > > > > > > > > > >> >> > compilation may take some effort, we will
>> > >> directly
>> > >> > > > > > provide
>> > >> > > > > > > > the
>> > >> > > > > > > > > > jar for
>> > >> > > > > > > > > > > >> >> > testing next week.
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > And @Yunfeng Zhou, I have noticed your
>> mail
>> > >> but it
>> > >> > > > is a
>> > >> > > > > > bit
>> > >> > > > > > > > > late
>> > >> > > > > > > > > > in my
>> > >> > > > > > > > > > > >> >> > local time and the next few days are
>> > >> weekends. So I
>> > >> > > > > will
>> > >> > > > > > > > reply
>> > >> > > > > > > > > > to you
>> > >> > > > > > > > > > > >> >> > later. Thanks for your response!
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > [1]
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-PoCResults
>> > >> > > > > > > > > > > >> >> > [2]
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-Appendix:HowtorunthePoC
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > Best,
>> > >> > > > > > > > > > > >> >> > Zakelly
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > On Fri, Mar 1, 2024 at 6:38 PM Yunfeng
>> Zhou
>> > <
>> > >> > > > > > > > > > [email protected]>
>> > >> > > > > > > > > > > >> >> > wrote:
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > > > >> >> > > Hi,
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > Thanks for proposing this design! I
>> just
>> > >> read
>> > >> > > > > FLIP-424
>> > >> > > > > > > and
>> > >> > > > > > > > > > FLIP-425
>> > >> > > > > > > > > > > >> >> > > and have some questions about the
>> proposed
>> > >> > > changes.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > For Async API (FLIP-424)
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 1. Why do we need a close() method on
>> > >> > > > StateIterator?
>> > >> > > > > > This
>> > >> > > > > > > > > > method seems
>> > >> > > > > > > > > > > >> >> > > unused in the usage example codes.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 2. In FutureUtils.combineAll()'s
>> JavaDoc,
>> > >> it is
>> > >> > > > > stated
>> > >> > > > > > > that
>> > >> > > > > > > > > > "No null
>> > >> > > > > > > > > > > >> >> > > entries are allowed". It might be
>> better
>> > to
>> > >> > > further
>> > >> > > > > > > explain
>> > >> > > > > > > > > > what will
>> > >> > > > > > > > > > > >> >> > > happen if a null value is passed,
>> ignoring
>> > >> the
>> > >> > > > value
>> > >> > > > > in
>> > >> > > > > > > the
>> > >> > > > > > > > > > returned
>> > >> > > > > > > > > > > >> >> > > Collection or throwing exceptions.
>> Given
>> > >> that
>> > >> > > > > > > > > > > >> >> > > FutureUtils.emptyFuture() can be
>> returned
>> > >> in the
>> > >> > > > > > example
>> > >> > > > > > > > > code,
>> > >> > > > > > > > > > I
>> > >> > > > > > > > > > > >> >> > > suppose the former one might be
>> correct.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > For Async Execution (FLIP-425)
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 1. According to Fig 2 of this FLIP, if
>> a
>> > >> recordB
>> > >> > > > has
>> > >> > > > > > its
>> > >> > > > > > > > key
>> > >> > > > > > > > > > collide
>> > >> > > > > > > > > > > >> >> > > with an ongoing recordA, its
>> > >> processElement()
>> > >> > > > method
>> > >> > > > > > can
>> > >> > > > > > > > > still
>> > >> > > > > > > > > > be
>> > >> > > > > > > > > > > >> >> > > triggered immediately, and then it
>> might
>> > be
>> > >> moved
>> > >> > > > to
>> > >> > > > > > the
>> > >> > > > > > > > > > blocking
>> > >> > > > > > > > > > > >> >> > > buffer in AEC if it involves state
>> > >> operations.
>> > >> > > This
>> > >> > > > > > means
>> > >> > > > > > > > > that
>> > >> > > > > > > > > > > >> >> > > recordB's output will precede recordA's
>> > >> output in
>> > >> > > > > > > > downstream
>> > >> > > > > > > > > > > >> >> > > operators, if recordA involves state
>> > >> operations
>> > >> > > > while
>> > >> > > > > > > > recordB
>> > >> > > > > > > > > > does
>> > >> > > > > > > > > > > >> >> > > not. This will harm the correctness of
>> > >> Flink jobs
>> > >> > > > in
>> > >> > > > > > some
>> > >> > > > > > > > use
>> > >> > > > > > > > > > cases.
>> > >> > > > > > > > > > > >> >> > > For example, in dim table join cases,
>> > >> recordA
>> > >> > > could
>> > >> > > > > be
>> > >> > > > > > a
>> > >> > > > > > > > > delete
>> > >> > > > > > > > > > > >> >> > > operation that involves state access,
>> > while
>> > >> > > recordB
>> > >> > > > > > could
>> > >> > > > > > > > be
>> > >> > > > > > > > > > an insert
>> > >> > > > > > > > > > > >> >> > > operation that needs to visit external
>> > >> storage
>> > >> > > > > without
>> > >> > > > > > > > state
>> > >> > > > > > > > > > access.
>> > >> > > > > > > > > > > >> >> > > If recordB's output precedes recordA's,
>> > >> then an
>> > >> > > > entry
>> > >> > > > > > > that
>> > >> > > > > > > > is
>> > >> > > > > > > > > > supposed
>> > >> > > > > > > > > > > >> >> > > to finally exist with recordB's value
>> in
>> > >> the sink
>> > >> > > > > table
>> > >> > > > > > > > might
>> > >> > > > > > > > > > actually
>> > >> > > > > > > > > > > >> >> > > be deleted according to recordA's
>> command.
>> > >> This
>> > >> > > > > > situation
>> > >> > > > > > > > > > should be
>> > >> > > > > > > > > > > >> >> > > avoided and the order of same-key
>> records
>> > >> should
>> > >> > > be
>> > >> > > > > > > > strictly
>> > >> > > > > > > > > > > >> >> > > preserved.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 2. The FLIP says that StateRequests
>> > >> submitted by
>> > >> > > > > > > Callbacks
>> > >> > > > > > > > > > will not
>> > >> > > > > > > > > > > >> >> > > invoke further yield() methods. Given
>> that
>> > >> > > yield()
>> > >> > > > is
>> > >> > > > > > > used
>> > >> > > > > > > > > > when there
>> > >> > > > > > > > > > > >> >> > > is "too much" in-flight data, does it
>> mean
>> > >> > > > > > StateRequests
>> > >> > > > > > > > > > submitted by
>> > >> > > > > > > > > > > >> >> > > Callbacks will never be "too much"?
>> What
>> > if
>> > >> the
>> > >> > > > total
>> > >> > > > > > > > number
>> > >> > > > > > > > > of
>> > >> > > > > > > > > > > >> >> > > StateRequests exceed the capacity of
>> Flink
>> > >> > > > operator's
>> > >> > > > > > > > memory
>> > >> > > > > > > > > > space?
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 3. In the "Watermark" section, this
>> FLIP
>> > >> provided
>> > >> > > > an
>> > >> > > > > > > > > > out-of-order
>> > >> > > > > > > > > > > >> >> > > execution mode apart from the default
>> > >> > > > > strictly-ordered
>> > >> > > > > > > > mode,
>> > >> > > > > > > > > > which can
>> > >> > > > > > > > > > > >> >> > > optimize performance by allowing more
>> > >> concurrent
>> > >> > > > > > > > executions.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 3.1 I'm concerned that the out-of-order
>> > >> execution
>> > >> > > > > mode,
>> > >> > > > > > > > along
>> > >> > > > > > > > > > with the
>> > >> > > > > > > > > > > >> >> > > epoch mechanism, would bring more
>> > >> complexity to
>> > >> > > the
>> > >> > > > > > > > execution
>> > >> > > > > > > > > > model
>> > >> > > > > > > > > > > >> >> > > than the performance improvement it
>> > >> promises.
>> > >> > > Could
>> > >> > > > > we
>> > >> > > > > > > add
>> > >> > > > > > > > > some
>> > >> > > > > > > > > > > >> >> > > benchmark results proving the benefit
>> of
>> > >> this
>> > >> > > mode?
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 3.2 The FLIP might need to add a public
>> > API
>> > >> > > section
>> > >> > > > > > > > > describing
>> > >> > > > > > > > > > how
>> > >> > > > > > > > > > > >> >> > > users or developers can switch between
>> > >> these two
>> > >> > > > > > > execution
>> > >> > > > > > > > > > modes.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 3.3 Apart from the watermark and
>> > checkpoint
>> > >> > > > mentioned
>> > >> > > > > > in
>> > >> > > > > > > > this
>> > >> > > > > > > > > > FLIP,
>> > >> > > > > > > > > > > >> >> > > there are also more other events that
>> > might
>> > >> > > appear
>> > >> > > > in
>> > >> > > > > > the
>> > >> > > > > > > > > > stream of
>> > >> > > > > > > > > > > >> >> > > data records. It might be better to
>> > >> generalize
>> > >> > > the
>> > >> > > > > > > > execution
>> > >> > > > > > > > > > mode
>> > >> > > > > > > > > > > >> >> > > mechanism to handle all possible
>> events.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 4. It might be better to treat
>> > >> callback-handling
>> > >> > > > as a
>> > >> > > > > > > > > > > >> >> > > MailboxDefaultAction, instead of
>> Mails, to
>> > >> avoid
>> > >> > > > the
>> > >> > > > > > > > overhead
>> > >> > > > > > > > > > of
>> > >> > > > > > > > > > > >> >> > > repeatedly creating Mail objects.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 5. Could this FLIP provide the current
>> > >> default
>> > >> > > > values
>> > >> > > > > > for
>> > >> > > > > > > > > > things like
>> > >> > > > > > > > > > > >> >> > > active buffer size thresholds and
>> > timeouts?
>> > >> These
>> > >> > > > > could
>> > >> > > > > > > > help
>> > >> > > > > > > > > > with
>> > >> > > > > > > > > > > >> >> > > memory consumption and latency
>> analysis.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 6. Why do we need to record the
>> hashcode
>> > of
>> > >> a
>> > >> > > > record
>> > >> > > > > in
>> > >> > > > > > > its
>> > >> > > > > > > > > > > >> >> > > RecordContext? It seems not used.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > 7. In "timers can be stored on the JVM
>> > heap
>> > >> or
>> > >> > > > > > RocksDB",
>> > >> > > > > > > > the
>> > >> > > > > > > > > > link
>> > >> > > > > > > > > > > >> >> > > points to a document in flink-1.15. It
>> > >> might be
>> > >> > > > > better
>> > >> > > > > > to
>> > >> > > > > > > > > > verify the
>> > >> > > > > > > > > > > >> >> > > referenced content is still valid in
>> the
>> > >> latest
>> > >> > > > Flink
>> > >> > > > > > and
>> > >> > > > > > > > > > update the
>> > >> > > > > > > > > > > >> >> > > link accordingly. Same for other
>> > references
>> > >> if
>> > >> > > any.
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > Best,
>> > >> > > > > > > > > > > >> >> > > Yunfeng Zhou
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> > > On Thu, Feb 29, 2024 at 2:17 PM Yuan
>> Mei <
>> > >> > > > > > > > > > [email protected]> wrote:
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > Hi Devs,
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > This is a joint work of Yuan Mei,
>> > Zakelly
>> > >> Lan,
>> > >> > > > > > Jinzhong
>> > >> > > > > > > > Li,
>> > >> > > > > > > > > > Hangxiang
>> > >> > > > > > > > > > > >> >> > Yu,
>> > >> > > > > > > > > > > >> >> > > > Yanfei Lei and Feng Wang. We'd like
>> to
>> > >> start a
>> > >> > > > > > > discussion
>> > >> > > > > > > > > > about
>> > >> > > > > > > > > > > >> >> > > introducing
>> > >> > > > > > > > > > > >> >> > > > Disaggregated State Storage and
>> > >> Management in
>> > >> > > > Flink
>> > >> > > > > > > 2.0.
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > The past decade has witnessed a
>> dramatic
>> > >> shift
>> > >> > > in
>> > >> > > > > > > Flink's
>> > >> > > > > > > > > > deployment
>> > >> > > > > > > > > > > >> >> > > mode,
>> > >> > > > > > > > > > > >> >> > > > workload patterns, and hardware
>> > >> improvements.
>> > >> > > > We've
>> > >> > > > > > > moved
>> > >> > > > > > > > > > from the
>> > >> > > > > > > > > > > >> >> > > > map-reduce era where workers are
>> > >> > > > > computation-storage
>> > >> > > > > > > > > tightly
>> > >> > > > > > > > > > coupled
>> > >> > > > > > > > > > > >> >> > > nodes
>> > >> > > > > > > > > > > >> >> > > > to a cloud-native world where
>> > >> containerized
>> > >> > > > > > deployments
>> > >> > > > > > > > on
>> > >> > > > > > > > > > Kubernetes
>> > >> > > > > > > > > > > >> >> > > > become standard. To enable Flink's
>> > >> Cloud-Native
>> > >> > > > > > future,
>> > >> > > > > > > > we
>> > >> > > > > > > > > > introduce
>> > >> > > > > > > > > > > >> >> > > > Disaggregated State Storage and
>> > >> Management that
>> > >> > > > > uses
>> > >> > > > > > > DFS
>> > >> > > > > > > > as
>> > >> > > > > > > > > > primary
>> > >> > > > > > > > > > > >> >> > > storage
>> > >> > > > > > > > > > > >> >> > > > in Flink 2.0, as promised in the
>> Flink
>> > 2.0
>> > >> > > > Roadmap.
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > Design Details can be found in
>> > >> FLIP-423[1].
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > This new architecture is aimed to
>> solve
>> > >> the
>> > >> > > > > following
>> > >> > > > > > > > > > challenges
>> > >> > > > > > > > > > > >> >> > brought
>> > >> > > > > > > > > > > >> >> > > in
>> > >> > > > > > > > > > > >> >> > > > the cloud-native era for Flink.
>> > >> > > > > > > > > > > >> >> > > > 1. Local Disk Constraints in
>> > >> containerization
>> > >> > > > > > > > > > > >> >> > > > 2. Spiky Resource Usage caused by
>> > >> compaction in
>> > >> > > > the
>> > >> > > > > > > > current
>> > >> > > > > > > > > > state model
>> > >> > > > > > > > > > > >> >> > > > 3. Fast Rescaling for jobs with large
>> > >> states
>> > >> > > > > > (hundreds
>> > >> > > > > > > of
>> > >> > > > > > > > > > Terabytes)
>> > >> > > > > > > > > > > >> >> > > > 4. Light and Fast Checkpoint in a
>> native
>> > >> way
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > More specifically, we want to reach a
>> > >> consensus
>> > >> > > > on
>> > >> > > > > > the
>> > >> > > > > > > > > > following issues
>> > >> > > > > > > > > > > >> >> > > in
>> > >> > > > > > > > > > > >> >> > > > this discussion:
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > 1. Overall design
>> > >> > > > > > > > > > > >> >> > > > 2. Proposed Changes
>> > >> > > > > > > > > > > >> >> > > > 3. Design details to achieve
>> Milestone1
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > In M1, we aim to achieve an
>> end-to-end
>> > >> baseline
>> > >> > > > > > version
>> > >> > > > > > > > > > using DFS as
>> > >> > > > > > > > > > > >> >> > > > primary storage and complete core
>> > >> > > > functionalities,
>> > >> > > > > > > > > including:
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > - Asynchronous State APIs
>> (FLIP-424)[2]:
>> > >> > > > Introduce
>> > >> > > > > > new
>> > >> > > > > > > > APIs
>> > >> > > > > > > > > > for
>> > >> > > > > > > > > > > >> >> > > > asynchronous state access.
>> > >> > > > > > > > > > > >> >> > > > - Asynchronous Execution Model
>> > >> (FLIP-425)[3]:
>> > >> > > > > > > Implement a
>> > >> > > > > > > > > > non-blocking
>> > >> > > > > > > > > > > >> >> > > > execution model leveraging the
>> > >> asynchronous
>> > >> > > APIs
>> > >> > > > > > > > introduced
>> > >> > > > > > > > > > in
>> > >> > > > > > > > > > > >> >> > FLIP-424.
>> > >> > > > > > > > > > > >> >> > > > - Grouping Remote State Access
>> > >> (FLIP-426)[4]:
>> > >> > > > > Enable
>> > >> > > > > > > > > > retrieval of
>> > >> > > > > > > > > > > >> >> > remote
>> > >> > > > > > > > > > > >> >> > > > state data in batches to avoid
>> > unnecessary
>> > >> > > > > round-trip
>> > >> > > > > > > > costs
>> > >> > > > > > > > > > for remote
>> > >> > > > > > > > > > > >> >> > > > access
>> > >> > > > > > > > > > > >> >> > > > - Disaggregated State Store
>> > (FLIP-427)[5]:
>> > >> > > > > Introduce
>> > >> > > > > > > the
>> > >> > > > > > > > > > initial
>> > >> > > > > > > > > > > >> >> > version
>> > >> > > > > > > > > > > >> >> > > of
>> > >> > > > > > > > > > > >> >> > > > the ForSt disaggregated state store.
>> > >> > > > > > > > > > > >> >> > > > - Fault Tolerance/Rescale Integration
>> > >> > > > > (FLIP-428)[6]:
>> > >> > > > > > > > > > Integrate
>> > >> > > > > > > > > > > >> >> > > > checkpointing mechanisms with the
>> > >> disaggregated
>> > >> > > > > state
>> > >> > > > > > > > store
>> > >> > > > > > > > > > for fault
>> > >> > > > > > > > > > > >> >> > > > tolerance and fast rescaling.
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > We will vote on each FLIP in separate
>> > >> threads
>> > >> > > to
>> > >> > > > > make
>> > >> > > > > > > > sure
>> > >> > > > > > > > > > each FLIP
>> > >> > > > > > > > > > > >> >> > > > reaches a consensus. But we want to
>> keep
>> > >> the
>> > >> > > > > > discussion
>> > >> > > > > > > > > > within a
>> > >> > > > > > > > > > > >> >> > focused
>> > >> > > > > > > > > > > >> >> > > > thread (this thread) for easier
>> tracking
>> > >> of
>> > >> > > > > contexts
>> > >> > > > > > to
>> > >> > > > > > > > > avoid
>> > >> > > > > > > > > > > >> >> > duplicated
>> > >> > > > > > > > > > > >> >> > > > questions/discussions and also to
>> think
>> > >> of the
>> > >> > > > > > > > > > problem/solution in a
>> > >> > > > > > > > > > > >> >> > full
>> > >> > > > > > > > > > > >> >> > > > picture.
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > Looking forward to your feedback
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > Best,
>> > >> > > > > > > > > > > >> >> > > > Yuan, Zakelly, Jinzhong, Hangxiang,
>> > >> Yanfei and
>> > >> > > > Feng
>> > >> > > > > > > > > > > >> >> > > >
>> > >> > > > > > > > > > > >> >> > > > [1]
>> > >> > > https://cwiki.apache.org/confluence/x/R4p3EQ
>> > >> > > > > > > > > > > >> >> > > > [2]
>> > >> > > https://cwiki.apache.org/confluence/x/SYp3EQ
>> > >> > > > > > > > > > > >> >> > > > [3]
>> > >> > > https://cwiki.apache.org/confluence/x/S4p3EQ
>> > >> > > > > > > > > > > >> >> > > > [4]
>> > >> > > https://cwiki.apache.org/confluence/x/TYp3EQ
>> > >> > > > > > > > > > > >> >> > > > [5]
>> > >> > > https://cwiki.apache.org/confluence/x/T4p3EQ
>> > >> > > > > > > > > > > >> >> > > > [6]
>> > >> > > https://cwiki.apache.org/confluence/x/UYp3EQ
>> > >> > > > > > > > > > > >> >> > >
>> > >> > > > > > > > > > > >> >> >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-423 ～FLIP-428: Introduce Disaggregated State Storage and Management in Flink 2.0

Reply via email to