Re: [DISCUSS] FLIP-325: Support configuring end-to-end allowed latency

Dong Lin Thu, 06 Jul 2023 05:38:29 -0700

Hi Shammon,

Thanks for your comments. Please see my reply inline.



On Thu, Jul 6, 2023 at 12:47 PM Shammon FY <[email protected]> wrote:

> Hi,
>
> Thanks for your replay @Dong. I really agree with Piotr's points and I
> would like to share some thoughts from my side.
>
> About the latency for mini-batch mechanism in Flink SQL, I still think the
> description in the FLIP is not right. If there are N operators and the
> whole process time for data in the job is `t`, then the latency in
> mini-batch will be `table.exec.mini-batch.allow-latency`+`t`, not `
> table.exec.mini-batch.allow-latency`*N. I think this is one of the
> foundations of this FLIP, and you may need to confirm it again.
>

Given that we agree to have a mechanism to support end-to-end latency for
DataStream programs, I think the exact semantics of
table.exec.mini-batch.allow-latency will not affect the motivation or API
design of this FLIP. I have updated the FLIP to remove any mention of
table.exec.mini-batch.allow-latency.


>
> I think supporting similar mechanisms in the runtime and balance latency
> and throughput dynamically for all flink jobs is a very good idea, and I
> have some questions for that.
>
> 1. We encounter a situation where the workload is high when processing
> snapshot data and we need mini-batch in sql for performance reason. But the
> workload is low when processing delta data, we need to automatically adjust
> the mini-batch SQL for them, or even cancel the mini-batch during delta
> processing. I think this FLIP meets our needs, but I think we need a
> general solution which covers all source types in flink, and the
> `isBacklog` in the FLIP is only one strategy.
>

The focus of this FLIP is to allow Flink runtime to adjust the behavior of
operators (e.g. the buffer time) based on the IsBacklog status of sources
and the user-specified execution.end-to-end-latency (effective only when
there is no backlog). The FLIP assumes there is already a strategy for
sources to determine the IsProcessingBacklog status without adding more
strategies.

I agree it is useful to introduce more strategies to determine the the
IsProcessingBacklog status for sources. We can determine the
IsProcessingBacklog status based on the backpressure metrics, the
event-time watermark lag, or anything we find reasonable. I would like to
work on this in follow-up FLIPs and that we don't work on too many things
in the same FLIP.

Would this be OK with you?


> From the FLIP I think there should be two parts: dynamic trigger flush
> event in JM and dynamic trigger flush operations in Operator. We need to
> introduce much more general interfaces for them, such as
> `DynamicFlushStrategy` in JM and `DynamicFlushOperation` in TM? As Piotr
> mentioned above, we can collect many information from TM locally such as
> backpressure, queue size and `Operator` can decide whether to buffer data
> or process it immediately.  JM is also the same, it can decide to send
> flush events on a regular basis or send them based on the collected metrics
> information and other information, such as the isBacklog in the FLIP.
>
> 2. I really don't get enough benefits for `RecordAttribute` in the FLIP and
> as Piotr mentioned above too, it will generate a large number of messages,
>

If there is any sentence in the FLIP that suggests we will emit a lot of
RecordAttribute, sorry for that and I would fix it.

Currently, the FLIP provides the `Output#emitRecordAttributes()` for
operators (e.g. source reader) to emit RecordAttributes. The FLIP leaves
the operator to decide the frequency and value of the emitted
RecordAttributes.

Our plan is to let SourceReader emit RecordAttributes only when its value
(e.g. isBacklog) differs from the value of the RecordAttributes it has
emitted earlier. It should avoid resending RecordAttributes with the same
value, similar to how Flink currently avoids resending
Watermark/WatermarkStatus with the same value.

Would it address your concern?


> affecting performance. FLIP mentions that it will be applied to Operator
> and Sink, I try to understand it's role and please correct me if I'm wrong.
> a) It tells the Operator and Sink that current most of data they are
> processing are from snapshot and are "insert" data? For the out of order in
> flink, the Operator and Sink may receive "upsert" data from other sources.
>

The RecordAttributes currently proposed in the FLIP only provides the
IsBacklog information, which tells the operator (including sink operator)
whether the records received after this RercordAttributes event are
"backlog". Note that snapshot (e.g. MySQL CDC snapshot) is one particular
case which can be classified as backlog. But we might introduce more
strategies to classify recods as backlog in the future.

Currently, RecordAttributes does not specify whether the following records
are insert-only or upsert. We might introduce such an atttribute if there
is a good use-case for having it.


> b) Do Operators and Sink perform any very special operations in the above
> situations? What are the benefits of this special operations for "most data
> are insert"?
>

Hmm.. I don't think the FLIP says something like "most data are insert".
Could you clarify which part of the FLIP you are talking about?

I was told that Hudi Sink can have much higher throughput if all its inputs
are insert-only. One point in the FLIP is that Hudi Sink can take advantage
of the features proposed in the FLIP to increase its throughput when source
are reading backlog data (e.g. MySQL CDC snapshot).



> c) I think the operator and sink can collect the above information locally
> when it receives each record without the need for `RecordAttribute` even
> when we need some special operations.
>

Let's say the job is reading from MySQL CDC, we want to process records in
snapshot phase with high throughput (and likely high latency), and process
records in the binlog phase with low processing latency.

We can achieve this goal by propagating RecordAttributes from source to
downstream operators. This allows an operator to know exactly that those
records received before a RecordAttributes(isBacklog=false) can be
processed with high latency, and those records received after a
RecordAttributes(isBacklog=false) should be processed with low latency.

Could you help explain how to achieve this goal without `RecordAttribute`?


> d) Even if we do need a `RecordAttribute` events in Operator and Sink, I
> think broadcast them from JM is a better choice.
>

Suppose we broadcast it from JM. When MySQL CDC source switches from the
snapshot phase to binlog phase, the RecordAttributes(IsBacklog=false) might
arrive at an operator while it is still processing a queue of records from
the snapshot phase, causing the operator to reduce its buffer time (and
throughput) earlier than expected. It might not be a big deal when this
interval is short. But given that propagating RecordAttributes from sources
to downstream operators does not have much overhead (no more than the
periodic Watermark), it seems useful to make the operator behavior more
accurate.

What do you think?


>
> 3. For the flush event, I also have some questions. What type of operators
> need to buffer data and flush them based on the flush events? In SQL
>

As you mentioned above, typically those operators that rely heavily on
statebackend (e.g. co-group, aggregate, join) can increase throughput by
buffering data. Operators which can not benefit from flush, such as Map, do
not need to be updated.

It is mentioned in the FLIP that we will "update operators in Flink to
override processRecordAttributes() if that helps improve job performance
without sacrificing correctness (e.g. processing latency)".


> mini-batch mechanism, similar processing will be added for the aggregate
> and join operators, while for operators such as map, it is not necessary.
> How can we identify different operator in the runtime layer (`Input` and
> `TwoInputStreamOperator`)? I think buffer data in Map/FlatMap/Filter
>

I am not sure I understand this question. Are you asking how to
automatically identify the operators that can benefit from the flush
operation?

I think we will need to manually identify and optimize operators on a
case-by-case basis. There is no good systematic way to automatically
identify and optimize all such operators. An operator's implementation
needs to be manually updated and benchmarked before we commit the
corresponding code change.

What do you think?

Best,
Dong


> Operator is not a good idea which makes data no longer flowing.
>

>
> Best,
> Shammon FY
>
>
> On Thu, Jul 6, 2023 at 1:54 AM Piotr Nowojski <[email protected]>
> wrote:
>
> > Hi,
> >
> > Thanks for this proposal, this is a very much needed thing that should be
> > addressed in Flink.
> >
> > I think there is one thing that hasn't been discussed neither here nor in
> > FLIP-309. Given that we have
> > three dimensions:
> > - e2e latency/checkpointing interval
> > - enabling some kind of batching/buffering on the operator level
> > - how much resources we want to allocate to the job
> >
> > How do we want Flink to adjust itself between those three? For example:
> > a) Should we assume that given Job has a fixed amount of assigned
> > resources and make it paramount that
> >   Flink doesn't exceed those available resources? So in case of
> > backpressure, we
> >   should extend checkpointing intervals, emit records less frequently and
> > in batches.
> > b) Or should we assume that the amount of resources is flexible (up to a
> > point?), and the desired e2e latency
> >   is the paramount aspect? So in case of backpressure, we should still
> > adhere to the configured e2e latency,
> >   and wait for the user or autoscaler to scale up the job?
> >
> > In case of a), I think the concept of "isProcessingBacklog" is not
> needed,
> > we could steer the behaviour only
> > using the backpressure information.
> >
> > On the other hand, in case of b), "isProcessingBacklog" information might
> > be helpful, to let Flink know that
> > we can safely decrease the e2e latency/checkpoint interval even if there
> > is no backpressure, to use fewer
> > resources (and let the autoscaler scale down the job).
> >
> > Do we want to have both, or only one of those? Do a) and b) complement
> one
> > another? If job is backpressured,
> > we should follow a) and expose to autoscaler/users information "Hey! I'm
> > barely keeping up! I need more resources!".
> > While, when there is no backpressure and latency doesn't matter
> > (isProcessingBacklog=true), we can limit the resource
> > usage.
> >
> > And a couple of more concrete remarks about the current proposal.
> >
> > 1.
> >
> > > I think the goal is to allow users to specify an end-to-end latency
> > budget for the job.
> >
> > I fully agree with this, but in that case, why are you proposing to add
> > `execution.flush.interval`? That's
> > yet another parameter that would affect e2e latency, without actually
> > defining it. We already have things
> > like: execution.checkpointing.interval, execution.buffer-timeout. I'm
> > pretty sure very few Flink users would be
> > able to configure or understand all of them.
> >
> > I think we should simplify configuration and try to define
> > "execution.end-to-end-latency" so the runtime
> > could derive other things from this new configuration.
> >
> > 2. How do you envision `#flush()` and `#snapshotState()` to be connected?
> > So far, `#snapshotState()`
> > was considered as a kind of `#flush()` signal. Do we need both? Shouldn't
> > `#flush()` be implicitly or
> > explicitly attached to the `#snapshotState()` call?
> >
> > 3. What about unaligned checkpoints if we have separate `#flush()`
> > event/signal?
> >
> > 4. How should this be working in at-least-once mode (especially sources
> > that are configured to be working
> > in at-least-once mode)?.
> >
> > 5. How is this FLIP connected with FLI-327? I think they are trying to
> > achieve basically the same thing:
> > optimise when data should be flushed/committed to balance between
> > throughput and latency.
> >
> > 6.
> >
> > > Add RecordAttributesBuilder and RecordAttributes that extends
> > StreamElement to provide operator with essential
> > > information about the records they receive, such as whether the records
> > are already stale due to backlog.
> >
> > Passing along `RecordAttribute` for every `StreamElement` would be an
> > extremely inefficient solution.
> >
> > If at all, this should be a marker propagated through the JobGraph vie
> > Events or sent from JM to TMs via an RPC
> > that would mark "backlog processing started/ended". Note that Events
> might
> > be costly, as they need to be
> > broadcasted. So with a job having 5 keyBy exchanges and parallelism of
> > 1000, the number of events sent is
> > ~4 000 000, while the number of RPCs would be only 5000.
> >
> > In case we want to only check for the backpressure, we don't need any
> > extra signal. Operators/subtasks can
> > get that information very easily from the TMs runtime.
> >
> > Best,
> > Piotrek
> >
> > czw., 29 cze 2023 o 17:19 Dong Lin <[email protected]> napisał(a):
> >
> >> Hi Shammon,
> >>
> >> Thanks for your comments. Please see my reply inline.
> >>
> >> On Thu, Jun 29, 2023 at 6:01 PM Shammon FY <[email protected]> wrote:
> >>
> >> > Hi Dong and Yunfeng,
> >> >
> >> > Thanks for bringing up this discussion.
> >> >
> >> > As described in the FLIP, the differences between `end-to-end latency`
> >> and
> >> > `table.exec.mini-batch.allow-latency` are: "It allows users to specify
> >> the
> >> > end-to-end latency, whereas table.exec.mini-batch.allow-latency
> applies
> >> to
> >> > each operator. If there are N operators on the path from source to
> sink,
> >> > the end-to-end latency could be up to
> >> table.exec.mini-batch.allow-latency *
> >> > N".
> >> >
> >> > If I understand correctly, `table.exec.mini-batch.allow-latency` is
> also
> >> > applied to the end-to-end latency for a job, maybe @Jack Wu can give
> >> more
> >> > information.
> >> >
> >>
> >> Based on what I can tell from the doc/code and offline discussion, I
> >> believe table.exec.mini-batch.allow-latency is not applied to the
> >> end-to-end latency for a job.
> >>
> >> It is mentioned here
> >> <
> >>
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/
> >> >
> >> that
> >> table.exec.mini-batch.allow-latency is "the maximum latency can be used
> >> for
> >> MiniBatch to buffer input records". I think we should have mentioned
> that
> >> the config is applied to the end-to-end latency in this doc if it is
> >> indeed
> >> the case.
> >>
> >>
> >> > So, from my perspective, and please correct me if I'm misunderstand,
> the
> >> > targets of this FLIP may include the following:
> >> >
> >> > 1. Support a mechanism like  `mini-batch` in SQL for `DataStream`,
> which
> >> > will collect data in the operator and flush data when it receives a
> >> `flush`
> >> > event, in the FLIP it is `FlushEvent`.
> >> >
> >>
> >> I think the goal is to allow users to specify an end-to-end latency
> budget
> >> for the job. IMO it is quite different from the `mini-batch` in SQL.
> >>
> >>
> >> >
> >> > 2. Support dynamic `latency` according to the progress of job, such as
> >> > snapshot stage and after that.
> >> >
> >> > To do that, I have some questions:
> >> >
> >> > 1. I didn't understand the purpose of public interface
> >> `RecordAttributes`.
> >> > I think `FlushEvent` in the FLIP is enough, and different
> >> > `DynamicFlushStrategy` can be added to generate flush events to
> address
> >> > different needs, such as a static interval similar to mini-batch in
> SQL
> >> or
> >> > collect the information `isProcessingBacklog` and metrics to generate
> >> > `FlushEvent` which is described in your FLIP? If hudi sink needs the
> >> > `isBacklog` flag, the hudi `SplitEnumerator` can create an operator
> >> event
> >> > and send it to hudi source reader.
> >> >
> >>
> >> Suppose we only have FlushEvent, then operators (e.g. Hudi Sink) will
> not
> >> know they can buffer data in the following scenario:
> >>
> >> - execution.allowed-latency is not configured and use the default value
> >> null.
> >> - The job is reading from HybridSource and HybridSource says
> >> isBacklog=true.
> >>
> >> Also note that Hudi Sink might not be the only operators that can
> benefit
> >> from knowing isBacklog=true. Other sinks and aggregation operators (e.g.
> >> CoGroup) can also increase throughput by buffering/sorting records when
> >> there is backlog. So it seems simpler to pass RecordAttributes to these
> >> operators than asking every operator developer to create operator event
> >> etc.
> >>
> >>
> >> >
> >> > 2. How is this new mechanism unified with SQL's mini-batch mechanism?
> As
> >> > far as I am concerned, SQL implements mini-batch mechanism based on
> >> > watermark, I think it is very unreasonable to have two different
> >> > implementation in SQL and DataStream.
> >> >
> >>
> >> I think we can deprecate table.exec.mini-batch.allow-latency later
> >> once execution.allowed-latency is ready for production usage. This is
> >> mentioned in the "Compatibility, Deprecation, and Migration Plan"
> section.
> >>
> >> If there is a config that supports user specifying the e2e latency, it
> is
> >> probably reasonable for this config to work for both DataStream and SQL.
> >>
> >>
> >> > 3. I notice that the `CheckpointCoordinator` will generate
> `FlushEvent`,
> >> > which information about `FlushEvent` will be stored in
> >> >
> >>
> >> CheckpointCoordinator might need to send FlushEvent before triggering
> >> checkpoint in order to deal with the two-phase commit sinks. The
> algorithm
> >> is specified in the "Proposed Changes" section.
> >>
> >>
> >> > `Checkpoint`? What is the alignment strategy for FlushEvent in the
> >> > operator? The operator will flush the data when it receives all
> >> > `FlushEvent` from upstream with the same ID or do flush for each
> >> > `FlushEvent`? Can you give more detailed proposal about that? We also
> >> have
> >> > a demand for this piece, thanks
> >> >
> >>
> >> After an operator has received a FlushEvent:
> >> - If the ID of the received FlushEvent is larger than the largest ID
> this
> >> operator has received, then flush() is triggered for this operator and
> the
> >> operator should broadcast FlushEvent to downstream operators.
> >> - Otherwise, this FlushEvent is ignored.
> >>
> >> This behavior is specified in the Java doc of the FlushEvent.
> >>
> >> Can you see if this answers your questions?
> >>
> >> Best,
> >> Dong
> >>
> >>
> >> >
> >> >
> >> > Best,
> >> > Shammon FY
> >> >
> >> >
> >> >
> >> > On Thu, Jun 29, 2023 at 4:35 PM Martijn Visser <
> >> [email protected]>
> >> > wrote:
> >> >
> >> >> Hi Dong and Yunfeng,
> >> >>
> >> >> Thanks for the FLIP. What's not clear for me is what's the expected
> >> >> behaviour when the allowed latency can't be met, for whatever reason.
> >> >> Given that we're talking about an "allowed latency", it implies that
> >> >> something has gone wrong and should fail? Isn't this more a minimum
> >> >> latency that you're proposing?
> >> >>
> >> >> There's also the part about the Hudi Sink processing records
> >> >> immediately upon arrival. Given that the SinkV2 API provides the
> >> >> ability for custom post and pre-commit topologies [1], specifically
> >> >> targeted to avoid generating multiple small files, why isn't that
> >> >> sufficient for the Hudi Sink? It would be great to see that added
> >> >> under Rejected Alternatives if this is indeed not sufficient.
> >> >>
> >> >> Best regards,
> >> >>
> >> >> Martijn
> >> >>
> >> >> [1]
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction
> >> >>
> >> >> On Sun, Jun 25, 2023 at 4:25 AM Yunfeng Zhou
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> > Hi all,
> >> >> >
> >> >> > Dong(cc'ed) and I are opening this thread to discuss our proposal
> to
> >> >> > support configuring end-to-end allowed latency for Flink jobs,
> which
> >> >> > has been documented in FLIP-325
> >> >> > <
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-325%3A+Support+configuring+end-to-end+allowed+latency
> >> >> >.
> >> >> >
> >> >> > By configuring the latency requirement for a Flink job, users would
> >> be
> >> >> > able to optimize the throughput and overhead of the job while still
> >> >> > acceptably increasing latency. This approach is particularly useful
> >> >> > when dealing with records that do not require immediate processing
> >> and
> >> >> > emission upon arrival.
> >> >> >
> >> >> > Please refer to the FLIP document for more details about the
> proposed
> >> >> > design and implementation. We welcome any feedback and opinions on
> >> >> > this proposal.
> >> >> >
> >> >> > Best regards.
> >> >> >
> >> >> > Dong and Yunfeng
> >> >>
> >> >
> >>
> >
>

Re: [DISCUSS] FLIP-325: Support configuring end-to-end allowed latency

Reply via email to