Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Hang Ruan Wed, 28 Jun 2023 01:30:00 -0700

Thanks for Dong and Yunfeng's work.

The FLIP looks good to me. This new version is clearer to understand.


Best,
Hang

Dong Lin <[email protected]> 于2023年6月27日周二 16:53写道：

> Thanks Jack, Jingsong, and Zhu for the review!
>
> Thanks Zhu for the suggestion. I have updated the configuration name as
> suggested.
>
> On Tue, Jun 27, 2023 at 4:45 PM Zhu Zhu <[email protected]> wrote:
>
> > Thanks Dong and Yunfeng for creating this FLIP and driving this
> discussion.
> >
> > The new design looks generally good to me. Increasing the checkpoint
> > interval when the job is processing backlogs is easier for users to
> > understand and can help in more scenarios.
> >
> > I have one comment about the new configuration.
> > Naming the new configuration
> > "execution.checkpointing.interval-during-backlog" would be better
> > according to Flink config naming convention.
> > It is also because that nested config keys should be avoided. See
> > FLINK-29372 for more details.
> >
> > Thanks,
> > Zhu
> >
> > Jingsong Li <[email protected]> 于2023年6月27日周二 15:45写道：
> > >
> > > Looks good to me!
> > >
> > > Thanks Dong, Yunfeng and all for your discussion and design.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jun 27, 2023 at 3:35 PM Jark Wu <[email protected]> wrote:
> > > >
> > > > Thank you Dong for driving this FLIP.
> > > >
> > > > The new design looks good to me!
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > > 2023年6月27日 14:38，Dong Lin <[email protected]> 写道：
> > > > >
> > > > > Thank you Leonard for the review!
> > > > >
> > > > > Hi Piotr, do you have any comments on the latest proposal?
> > > > >
> > > > > I am wondering if it is OK to start the voting thread this week.
> > > > >
> > > > > On Mon, Jun 26, 2023 at 4:10 PM Leonard Xu <[email protected]>
> > wrote:
> > > > >
> > > > >> Thanks Dong for driving this FLIP forward!
> > > > >>
> > > > >> Introducing  `backlog status` concept for flink job makes sense to
> > me as
> > > > >> following reasons:
> > > > >>
> > > > >> From concept/API design perspective, it’s more general and natural
> > than
> > > > >> above proposals as it can be used in HybridSource for bounded
> > records, CDC
> > > > >> Source for history snapshot and general sources like KafkaSource
> for
> > > > >> historical messages.
> > > > >>
> > > > >> From user cases/requirements, I’ve seen many users manually to set
> > larger
> > > > >> checkpoint interval during backfilling and then set a shorter
> > checkpoint
> > > > >> interval for real-time processing in their production environments
> > as a
> > > > >> flink application optimization. Now, the flink framework can make
> > this
> > > > >> optimization no longer require the user to set the checkpoint
> > interval and
> > > > >> restart the job multiple times.
> > > > >>
> > > > >> Following supporting using larger checkpoint for job under backlog
> > status
> > > > >> in current FLIP, we can explore supporting larger
> > parallelism/memory/cpu
> > > > >> for job under backlog status in the future.
> > > > >>
> > > > >> In short, the updated FLIP looks good to me.
> > > > >>
> > > > >>
> > > > >> Best,
> > > > >> Leonard
> > > > >>
> > > > >>
> > > > >>> On Jun 22, 2023, at 12:07 PM, Dong Lin <[email protected]>
> > wrote:
> > > > >>>
> > > > >>> Hi Piotr,
> > > > >>>
> > > > >>> Thanks again for proposing the isProcessingBacklog concept.
> > > > >>>
> > > > >>> After discussing with Becket Qin and thinking about this more, I
> > agree it
> > > > >>> is a better idea to add a top-level concept to all source
> > operators to
> > > > >>> address the target use-case.
> > > > >>>
> > > > >>> The main reason that changed my mind is that isProcessingBacklog
> > can be
> > > > >>> described as an inherent/nature attribute of every source
> instance
> > and
> > > > >> its
> > > > >>> semantics does not need to depend on any specific checkpointing
> > policy.
> > > > >>> Also, we can hardcode the isProcessingBacklog behavior for the
> > sources we
> > > > >>> have considered so far (e.g. HybridSource and MySQL CDC source)
> > without
> > > > >>> asking users to explicitly configure the per-source behavior,
> which
> > > > >> indeed
> > > > >>> provides better user experience.
> > > > >>>
> > > > >>> I have updated the FLIP based on the latest suggestions. The
> > latest FLIP
> > > > >> no
> > > > >>> longer introduces per-source config that can be used by
> end-users.
> > While
> > > > >> I
> > > > >>> agree with you that CheckpointTrigger can be a useful feature to
> > address
> > > > >>> additional use-cases, I am not sure it is necessary for the
> > use-case
> > > > >>> targeted by FLIP-309. Maybe we can introduce CheckpointTrigger
> > separately
> > > > >>> in another FLIP?
> > > > >>>
> > > > >>> Can you help take another look at the updated FLIP?
> > > > >>>
> > > > >>> Best,
> > > > >>> Dong
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski <
> > [email protected]>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi Dong,
> > > > >>>>
> > > > >>>>> Suppose there are 1000 subtask and each subtask has 1% chance
> of
> > being
> > > > >>>>> "backpressured" at a given time (due to random traffic spikes).
> > Then at
> > > > >>>> any
> > > > >>>>> given time, the chance of the job
> > > > >>>>> being considered not-backpressured = (1-0.01)^1000. Since we
> > evaluate
> > > > >> the
> > > > >>>>> backpressure metric once a second, the estimated time for the
> job
> > > > >>>>> to be considered not-backpressured is roughly 1 /
> > ((1-0.01)^1000) =
> > > > >> 23163
> > > > >>>>> sec = 6.4 hours.
> > > > >>>>>
> > > > >>>>> This means that the job will effectively always use the longer
> > > > >>>>> checkpointing interval. It looks like a real concern, right?
> > > > >>>>
> > > > >>>> Sorry I don't understand where you are getting those numbers
> from.
> > > > >>>> Instead of trying to find loophole after loophole, could you try
> > to
> > > > >> think
> > > > >>>> how a given loophole could be improved/solved?
> > > > >>>>
> > > > >>>>> Hmm... I honestly think it will be useful to know the APIs due
> > to the
> > > > >>>>> following reasons.
> > > > >>>>
> > > > >>>> Please propose something. I don't think it's needed.
> > > > >>>>
> > > > >>>>> - For the use-case mentioned in FLIP-309 motivation section,
> > would the
> > > > >>>> APIs
> > > > >>>>> of this alternative approach be more or less usable?
> > > > >>>>
> > > > >>>> Everything that you originally wanted to achieve in FLIP-309,
> you
> > could
> > > > >> do
> > > > >>>> as well in my proposal.
> > > > >>>> Vide my many mentions of the "hacky solution".
> > > > >>>>
> > > > >>>>> - Can these APIs reliably address the extra use-case (e.g.
> allow
> > > > >>>>> checkpointing interval to change dynamically even during the
> > unbounded
> > > > >>>>> phase) as it claims?
> > > > >>>>
> > > > >>>> I don't see why not.
> > > > >>>>
> > > > >>>>> - Can these APIs be decoupled from the APIs currently proposed
> in
> > > > >>>> FLIP-309?
> > > > >>>>
> > > > >>>> Yes
> > > > >>>>
> > > > >>>>> For example, if the APIs of this alternative approach can be
> > decoupled
> > > > >>>> from
> > > > >>>>> the APIs currently proposed in FLIP-309, then it might be
> > reasonable to
> > > > >>>>> work on this extra use-case with a more advanced/complicated
> > design
> > > > >>>>> separately in a followup work.
> > > > >>>>
> > > > >>>> As I voiced my concerns previously, the current design of
> > FLIP-309 would
> > > > >>>> clog the public API and in the long run confuse the users. IMO
> > It's
> > > > >>>> addressing the
> > > > >>>> problem in the wrong place.
> > > > >>>>
> > > > >>>>> Hmm.. do you mean we can do the following:
> > > > >>>>> - Have all source operators emit a metric named
> > "processingBacklog".
> > > > >>>>> - Add a job-level config that specifies "the checkpointing
> > interval to
> > > > >> be
> > > > >>>>> used when any source is processing backlog".
> > > > >>>>> - The JM collects the "processingBacklog" periodically from all
> > source
> > > > >>>>> operators and uses the newly added config value as appropriate.
> > > > >>>>
> > > > >>>> Yes.
> > > > >>>>
> > > > >>>>> The challenge with this approach is that we need to define the
> > > > >> semantics
> > > > >>>> of
> > > > >>>>> this "processingBacklog" metric and have all source operators
> > > > >>>>> implement this metric. I am not sure we are able to do this yet
> > without
> > > > >>>>> having users explicitly provide this information on a
> per-source
> > basis.
> > > > >>>>>
> > > > >>>>> Suppose the job read from a bounded Kafka source, should it
> emit
> > > > >>>>> "processingBacklog=true"? If yes, then the job might use long
> > > > >>>> checkpointing
> > > > >>>>> interval even
> > > > >>>>> if the job is asked to process data starting from now to the
> > next 1
> > > > >> hour.
> > > > >>>>> If no, then the job might use the short checkpointing interval
> > > > >>>>> even if the job is asked to re-process data starting from 7
> days
> > ago.
> > > > >>>>
> > > > >>>> Yes. The same can be said of your proposal. Your proposal has
> the
> > very
> > > > >> same
> > > > >>>> issues
> > > > >>>> that every source would have to implement it differently, most
> > sources
> > > > >>>> would
> > > > >>>> have no idea how to properly calculate the new requested
> > checkpoint
> > > > >>>> interval,
> > > > >>>> for those that do know how to do that, user would have to
> > configure
> > > > >> every
> > > > >>>> source
> > > > >>>> individually and yet again we would end up with a system, that
> > works
> > > > >> only
> > > > >>>> partially in
> > > > >>>> some special use cases (HybridSource), that's confusing the
> users
> > even
> > > > >>>> more.
> > > > >>>>
> > > > >>>> That's why I think the more generic solution, working primarily
> > on the
> > > > >> same
> > > > >>>> metrics that are used by various auto scaling solutions (like
> > Flink K8s
> > > > >>>> operator's
> > > > >>>> autosaler) would be better. The hacky solution I proposed to:
> > > > >>>> 1. show you that the generic solution is simply a superset of
> your
> > > > >> proposal
> > > > >>>> 2. if you are adamant that busyness/backpressured/records
> > processing
> > > > >>>> rate/pending records
> > > > >>>>   metrics wouldn't cover your use case sufficiently (imo they
> > can),
> > > > >> then
> > > > >>>> you can very easily
> > > > >>>>   enhance this algorithm with using some hints from the sources.
> > Like
> > > > >>>> "processingBacklog==true"
> > > > >>>>   to short circuit the main algorithm, if `processingBacklog` is
> > > > >>>> available.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Piotrek
> > > > >>>>
> > > > >>>>
> > > > >>>> pt., 16 cze 2023 o 04:45 Dong Lin <[email protected]>
> > napisał(a):
> > > > >>>>
> > > > >>>>> Hi again Piotr,
> > > > >>>>>
> > > > >>>>> Thank you for the reply. Please see my reply inline.
> > > > >>>>>
> > > > >>>>> On Fri, Jun 16, 2023 at 12:11 AM Piotr Nowojski <
> > > > >>>> [email protected]>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi again Dong,
> > > > >>>>>>
> > > > >>>>>>> I understand that JM will get the backpressure-related
> metrics
> > every
> > > > >>>>> time
> > > > >>>>>>> the RestServerEndpoint receives the REST request to get these
> > > > >>>> metrics.
> > > > >>>>>> But
> > > > >>>>>>> I am not sure if RestServerEndpoint is already always
> > receiving the
> > > > >>>>> REST
> > > > >>>>>>> metrics at regular interval (suppose there is no human
> manually
> > > > >>>>>>> opening/clicking the Flink Web UI). And if it does, what is
> the
> > > > >>>>> interval?
> > > > >>>>>>
> > > > >>>>>> Good catch, I've thought that metrics are pre-emptively sent
> to
> > JM
> > > > >>>> every
> > > > >>>>> 10
> > > > >>>>>> seconds.
> > > > >>>>>> Indeed that's not the case at the moment, and that would have
> > to be
> > > > >>>>>> improved.
> > > > >>>>>>
> > > > >>>>>>> I would be surprised if Flink is already paying this much
> > overhead
> > > > >>>> just
> > > > >>>>>> for
> > > > >>>>>>> metrics monitoring. That is the main reason I still doubt it
> > is true.
> > > > >>>>> Can
> > > > >>>>>>> you show where this 100 ms is currently configured?
> > > > >>>>>>>
> > > > >>>>>>> Alternatively, maybe you mean that we should add extra code
> to
> > invoke
> > > > >>>>> the
> > > > >>>>>>> REST API at 100 ms interval. Then that means we need to
> > considerably
> > > > >>>>>>> increase the network/cpu overhead at JM, where the overhead
> > will
> > > > >>>>> increase
> > > > >>>>>>> as the number of TM/slots increase, which may pose risk to
> the
> > > > >>>>>> scalability
> > > > >>>>>>> of the proposed design. I am not sure we should do this. What
> > do you
> > > > >>>>>> think?
> > > > >>>>>>
> > > > >>>>>> Sorry. I didn't mean metric should be reported every 100ms. I
> > meant
> > > > >>>> that
> > > > >>>>>> "backPressuredTimeMsPerSecond (metric) would report (a value
> of)
> > > > >>>>> 100ms/s."
> > > > >>>>>> once per metric interval (10s?).
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Suppose there are 1000 subtask and each subtask has 1% chance
> of
> > being
> > > > >>>>> "backpressured" at a given time (due to random traffic spikes).
> > Then at
> > > > >>>> any
> > > > >>>>> given time, the chance of the job
> > > > >>>>> being considered not-backpressured = (1-0.01)^1000. Since we
> > evaluate
> > > > >> the
> > > > >>>>> backpressure metric once a second, the estimated time for the
> job
> > > > >>>>> to be considered not-backpressured is roughly 1 /
> > ((1-0.01)^1000) =
> > > > >> 23163
> > > > >>>>> sec = 6.4 hours.
> > > > >>>>>
> > > > >>>>> This means that the job will effectively always use the longer
> > > > >>>>> checkpointing interval. It looks like a real concern, right?
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>> - What is the interface of this CheckpointTrigger? For
> > example, are
> > > > >>>> we
> > > > >>>>>>> going to give CheckpointTrigger a context that it can use to
> > fetch
> > > > >>>>>>> arbitrary metric values? This can help us understand what
> > information
> > > > >>>>>> this
> > > > >>>>>>> user-defined CheckpointTrigger can use to make the checkpoint
> > > > >>>> decision.
> > > > >>>>>>
> > > > >>>>>> I honestly don't think this is important at this stage of the
> > > > >>>> discussion.
> > > > >>>>>> It could have
> > > > >>>>>> whatever interface we would deem to be best. Required things:
> > > > >>>>>>
> > > > >>>>>> - access to at least a subset of metrics that the given
> > > > >>>>> `CheckpointTrigger`
> > > > >>>>>> requests,
> > > > >>>>>> for example via some registration mechanism, so we don't have
> to
> > > > >>>> fetch
> > > > >>>>>> all of the
> > > > >>>>>> metrics all the time from TMs.
> > > > >>>>>> - some way to influence `CheckpointCoordinator`. Either via
> > manually
> > > > >>>>>> triggering
> > > > >>>>>> checkpoints, and/or ability to change the checkpointing
> > interval.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Hmm... I honestly think it will be useful to know the APIs due
> > to the
> > > > >>>>> following reasons.
> > > > >>>>>
> > > > >>>>> We would need to know the concrete APIs to gauge the following:
> > > > >>>>> - For the use-case mentioned in FLIP-309 motivation section,
> > would the
> > > > >>>> APIs
> > > > >>>>> of this alternative approach be more or less usable?
> > > > >>>>> - Can these APIs reliably address the extra use-case (e.g.
> allow
> > > > >>>>> checkpointing interval to change dynamically even during the
> > unbounded
> > > > >>>>> phase) as it claims?
> > > > >>>>> - Can these APIs be decoupled from the APIs currently proposed
> in
> > > > >>>> FLIP-309?
> > > > >>>>>
> > > > >>>>> For example, if the APIs of this alternative approach can be
> > decoupled
> > > > >>>> from
> > > > >>>>> the APIs currently proposed in FLIP-309, then it might be
> > reasonable to
> > > > >>>>> work on this extra use-case with a more advanced/complicated
> > design
> > > > >>>>> separately in a followup work.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>> - Where is this CheckpointTrigger running? For example, is it
> > going
> > > > >>>> to
> > > > >>>>>> run
> > > > >>>>>>> on the subtask of every source operator? Or is it going to
> run
> > on the
> > > > >>>>> JM?
> > > > >>>>>>
> > > > >>>>>> IMO on the JM.
> > > > >>>>>>
> > > > >>>>>>> - Are we going to provide a default implementation of this
> > > > >>>>>>> CheckpointTrigger in Flink that implements the algorithm
> > described
> > > > >>>>> below,
> > > > >>>>>>> or do we expect each source operator developer to implement
> > their own
> > > > >>>>>>> CheckpointTrigger?
> > > > >>>>>>
> > > > >>>>>> As I mentioned before, I think we should provide at the very
> > least the
> > > > >>>>>> implementation
> > > > >>>>>> that replaces the current triggering mechanism (statically
> > configured
> > > > >>>>>> checkpointing interval)
> > > > >>>>>> and it would be great to provide the backpressure monitoring
> > trigger
> > > > >> as
> > > > >>>>>> well.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> I agree that if there is a good use-case that can be addressed
> > by the
> > > > >>>>> proposed CheckpointTrigger, then it is reasonable
> > > > >>>>> to add CheckpointTrigger and replace the current triggering
> > mechanism
> > > > >>>> with
> > > > >>>>> it.
> > > > >>>>>
> > > > >>>>> I also agree that we will likely find such a use-case. For
> > example,
> > > > >>>> suppose
> > > > >>>>> the source records have event timestamps, then it is likely
> > > > >>>>> that we can use the trigger to dynamically control the
> > checkpointing
> > > > >>>>> interval based on the difference between the watermark and
> > current
> > > > >> system
> > > > >>>>> time.
> > > > >>>>>
> > > > >>>>> But I am not sure the addition of this CheckpointTrigger should
> > be
> > > > >>>> coupled
> > > > >>>>> with FLIP-309. Whether or not it is coupled probably depends on
> > the
> > > > >>>>> concrete API design around CheckpointTrigger.
> > > > >>>>>
> > > > >>>>> If you would be adamant that the backpressure monitoring
> doesn't
> > cover
> > > > >>>> well
> > > > >>>>>> enough your use case, I would be ok to provide the hacky
> > version that
> > > > >> I
> > > > >>>>>> also mentioned
> > > > >>>>>> before:
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>> """
> > > > >>>>>> Especially that if my proposed algorithm wouldn't work good
> > enough,
> > > > >>>> there
> > > > >>>>>> is
> > > > >>>>>> an obvious solution, that any source could add a metric, like
> > let say
> > > > >>>>>> "processingBacklog: true/false", and the `CheckpointTrigger`
> > > > >>>>>> could use this as an override to always switch to the
> > > > >>>>>> "slowCheckpointInterval". I don't think we need it, but that's
> > always
> > > > >>>> an
> > > > >>>>>> option
> > > > >>>>>> that would be basically equivalent to your original proposal.
> > > > >>>>>> """
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Hmm.. do you mean we can do the following:
> > > > >>>>> - Have all source operators emit a metric named
> > "processingBacklog".
> > > > >>>>> - Add a job-level config that specifies "the checkpointing
> > interval to
> > > > >> be
> > > > >>>>> used when any source is processing backlog".
> > > > >>>>> - The JM collects the "processingBacklog" periodically from all
> > source
> > > > >>>>> operators and uses the newly added config value as appropriate.
> > > > >>>>>
> > > > >>>>> The challenge with this approach is that we need to define the
> > > > >> semantics
> > > > >>>> of
> > > > >>>>> this "processingBacklog" metric and have all source operators
> > > > >>>>> implement this metric. I am not sure we are able to do this yet
> > without
> > > > >>>>> having users explicitly provide this information on a
> per-source
> > basis.
> > > > >>>>>
> > > > >>>>> Suppose the job read from a bounded Kafka source, should it
> emit
> > > > >>>>> "processingBacklog=true"? If yes, then the job might use long
> > > > >>>> checkpointing
> > > > >>>>> interval even
> > > > >>>>> if the job is asked to process data starting from now to the
> > next 1
> > > > >> hour.
> > > > >>>>> If no, then the job might use the short checkpointing interval
> > > > >>>>> even if the job is asked to re-process data starting from 7
> days
> > ago.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>
> > > > >>>>>>> - How can users specify the
> > > > >>>>>> fastCheckpointInterval/slowCheckpointInterval?
> > > > >>>>>>> For example, will we provide APIs on the CheckpointTrigger
> that
> > > > >>>>> end-users
> > > > >>>>>>> can use to specify the checkpointing interval? What would
> that
> > look
> > > > >>>>> like?
> > > > >>>>>>
> > > > >>>>>> Also as I mentioned before, just like metric reporters are
> > configured:
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/
> > > > >>>>>> Every CheckpointTrigger could have its own custom
> configuration.
> > > > >>>>>>
> > > > >>>>>>> Overall, my gut feel is that the alternative approach based
> on
> > > > >>>>>>> CheckpointTrigger is more complicated
> > > > >>>>>>
> > > > >>>>>> Yes, as usual, more generic things are more complicated, but
> > often
> > > > >> more
> > > > >>>>>> useful in the long run.
> > > > >>>>>>
> > > > >>>>>>> and harder to use.
> > > > >>>>>>
> > > > >>>>>> I don't agree. Why setting in config
> > > > >>>>>>
> > > > >>>>>> execution.checkpointing.trigger:
> > > > >>>> BackPressureMonitoringCheckpointTrigger
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> >
> execution.checkpointing.BackPressureMonitoringCheckpointTrigger.fast-interval:
> > > > >>>>>> 1s
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> >
> execution.checkpointing.BackPressureMonitoringCheckpointTrigger.slow-interval:
> > > > >>>>>> 30s
> > > > >>>>>>
> > > > >>>>>> that we could even provide a shortcut to the above construct
> > via:
> > > > >>>>>>
> > > > >>>>>> execution.checkpointing.fast-interval: 1s
> > > > >>>>>> execution.checkpointing.slow-interval: 30s
> > > > >>>>>>
> > > > >>>>>> is harder compared to setting two/three checkpoint intervals,
> > one in
> > > > >>>> the
> > > > >>>>>> config/or via `env.enableCheckpointing(x)`,
> > > > >>>>>> secondly passing one/two (fast/slow) values on the source
> > itself?
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> If we can address the use-case by providing just the two
> > job-level
> > > > >> config
> > > > >>>>> as described above, I agree it will indeed be simpler.
> > > > >>>>>
> > > > >>>>> I have tried to achieve this goal. But the caveat is that it
> > requires
> > > > >>>> much
> > > > >>>>> more work than described above in order to give the configs
> > > > >> well-defined
> > > > >>>>> semantics. So I find it simpler to just use the approach in
> > FLIP-309.
> > > > >>>>>
> > > > >>>>> Let me explain my concern below. It will be great if you or
> > someone
> > > > >> else
> > > > >>>>> can help provide a solution.
> > > > >>>>>
> > > > >>>>> 1) We need to clearly document when the fast-interval and
> > slow-interval
> > > > >>>>> will be used so that users can derive the expected behavior of
> > the job
> > > > >>>> and
> > > > >>>>> be able to config these values.
> > > > >>>>>
> > > > >>>>> 2) The trigger of fast/slow interval depends on the behavior of
> > the
> > > > >>>> source
> > > > >>>>> (e.g. MySQL CDC, HybridSource). However, no existing concepts
> of
> > source
> > > > >>>>> operator (e.g. boundedness) can describe the target behavior.
> For
> > > > >>>> example,
> > > > >>>>> MySQL CDC internally has two phases, namely snapshot phase and
> > binlog
> > > > >>>>> phase, which are not explicitly exposed to its users via source
> > > > >> operator
> > > > >>>>> API. And we probably should not enumerate all internal phases
> of
> > all
> > > > >>>> source
> > > > >>>>> operators that are affected by fast/slow interval.
> > > > >>>>>
> > > > >>>>> 3) An alternative approach might be to define a new concept
> (e.g.
> > > > >>>>> processingBacklog) that is applied to all source operators.
> Then
> > the
> > > > >>>>> fast/slow interval's documentation can depend on this concept.
> > That
> > > > >> means
> > > > >>>>> we have to add a top-level concept (similar to source
> > boundedness) and
> > > > >>>>> require all source operators to specify how they enforce this
> > concept
> > > > >>>> (e.g.
> > > > >>>>> FileSystemSource always emits processingBacklog=true). And
> there
> > might
> > > > >> be
> > > > >>>>> cases where the source itself (e.g. a bounded Kafka Source) can
> > not
> > > > >>>>> automatically derive the value of this concept, in which case
> we
> > need
> > > > >> to
> > > > >>>>> provide option for users to explicitly specify the value for
> this
> > > > >> concept
> > > > >>>>> on a per-source basis.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>> And it probably also has the issues of "having two places to
> > > > >>>> configure
> > > > >>>>>> checkpointing
> > > > >>>>>>> interval" and "giving flexibility for every source to
> > implement a
> > > > >>>>>> different
> > > > >>>>>>> API" (as mentioned below).
> > > > >>>>>>
> > > > >>>>>> No, it doesn't.
> > > > >>>>>>
> > > > >>>>>>> IMO, it is a hard-requirement for the user-facing API to be
> > > > >>>>>>> clearly defined and users should be able to use the API
> without
> > > > >>>> concern
> > > > >>>>>> of
> > > > >>>>>>> regression. And this requirement is more important than the
> > other
> > > > >>>> goals
> > > > >>>>>>> discussed above because it is related to the
> > stability/performance of
> > > > >>>>> the
> > > > >>>>>>> production job. What do you think?
> > > > >>>>>>
> > > > >>>>>> I don't agree with this. There are many things that work
> > something in
> > > > >>>>>> between perfectly and well enough
> > > > >>>>>> in some fraction of use cases (maybe in 99%, maybe 95% or
> maybe
> > 60%),
> > > > >>>>> while
> > > > >>>>>> still being very useful.
> > > > >>>>>> Good examples are: selection of state backend, unaligned
> > checkpoints,
> > > > >>>>>> buffer debloating but frankly if I go
> > > > >>>>>> through list of currently available config options, something
> > like
> > > > >> half
> > > > >>>>> of
> > > > >>>>>> them can cause regressions. Heck,
> > > > >>>>>> even Flink itself doesn't work perfectly in 100% of the use
> > cases, due
> > > > >>>>> to a
> > > > >>>>>> variety of design choices. Of
> > > > >>>>>> course, the more use cases are fine with said feature, the
> > better, but
> > > > >>>> we
> > > > >>>>>> shouldn't fixate to perfectly cover
> > > > >>>>>> 100% of the cases, as that's impossible.
> > > > >>>>>>
> > > > >>>>>> In this particular case, if back pressure monitoring  trigger
> > can work
> > > > >>>>> well
> > > > >>>>>> enough in 95% of cases, I would
> > > > >>>>>> say that's already better than the originally proposed
> > alternative,
> > > > >>>> which
> > > > >>>>>> doesn't work at all if user has a large
> > > > >>>>>> backlog to reprocess from Kafka, including when using
> > HybridSource
> > > > >>>> AFTER
> > > > >>>>>> the switch to Kafka has
> > > > >>>>>> happened. For the remaining 5%, we should try to improve the
> > behaviour
> > > > >>>>> over
> > > > >>>>>> time, but ultimately, users can
> > > > >>>>>> decide to just run a fixed checkpoint interval (or at worst
> use
> > the
> > > > >>>> hacky
> > > > >>>>>> checkpoint trigger that I mentioned
> > > > >>>>>> before a couple of times).
> > > > >>>>>>
> > > > >>>>>> Also to be pedantic, if a user naively selects slow-interval
> in
> > your
> > > > >>>>>> proposal to 30 minutes, when that user's
> > > > >>>>>> job fails on average every 15-20minutes, his job can end up in
> > a state
> > > > >>>>> that
> > > > >>>>>> it can not make any progress,
> > > > >>>>>> this arguably is quite serious regression.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> I probably should not say it is "hard requirement". After all
> > there are
> > > > >>>>> pros/cons. We will need to consider implementation complexity,
> > > > >> usability,
> > > > >>>>> extensibility etc.
> > > > >>>>>
> > > > >>>>> I just don't think we should take it for granted to introduce
> > > > >> regression
> > > > >>>>> for one use-case in order to support another use-case. If we
> can
> > not
> > > > >> find
> > > > >>>>> an algorithm/solution that addresses
> > > > >>>>> both use-case well, I hope we can be open to tackle them
> > separately so
> > > > >>>> that
> > > > >>>>> users can choose the option that best fits their needs.
> > > > >>>>>
> > > > >>>>> All things else being equal, I think it is preferred for
> > user-facing
> > > > >> API
> > > > >>>> to
> > > > >>>>> be clearly defined and let users should be able to use the API
> > without
> > > > >>>>> concern of regression.
> > > > >>>>>
> > > > >>>>> Maybe we can list pros/cons for the alternative approaches we
> > have been
> > > > >>>>> discussing and see choose the best approach. And maybe we will
> > end up
> > > > >>>>> finding that use-case
> > > > >>>>> which needs CheckpointTrigger can be tackled separately from
> the
> > > > >> use-case
> > > > >>>>> in FLIP-309.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>> I am not sure if there is a typo. Because if
> > > > >>>>> backPressuredTimeMsPerSecond
> > > > >>>>>> =
> > > > >>>>>>> 0, then maxRecordsConsumedWithoutBackpressure =
> > > > >>>> numRecordsInPerSecond /
> > > > >>>>>>> 1000 * metricsUpdateInterval according to the above
> algorithm.
> > > > >>>>>>>
> > > > >>>>>>> Do you mean "maxRecordsConsumedWithoutBackpressure =
> > > > >>>>>> (numRecordsInPerSecond
> > > > >>>>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) *
> > > > >>>> metricsUpdateInterval"?
> > > > >>>>>>
> > > > >>>>>> It looks like there is indeed some mistake in my proposal
> > above. Yours
> > > > >>>>> look
> > > > >>>>>> more correct, it probably
> > > > >>>>>> still needs some safeguard/special handling if
> > > > >>>>>> `backPressuredTimeMsPerSecond > 950`
> > > > >>>>>>
> > > > >>>>>>> The only information it can access is the backlog.
> > > > >>>>>>
> > > > >>>>>> Again no. It can access whatever we want to provide to it.
> > > > >>>>>>
> > > > >>>>>> Regarding the rest of your concerns. It's a matter of tweaking
> > the
> > > > >>>>>> parameters and the algorithm itself,
> > > > >>>>>> and how much safety-net do we want to have. Ultimately, I'm
> > pretty
> > > > >> sure
> > > > >>>>>> that's a (for 95-99% of cases)
> > > > >>>>>> solvable problem. If not, there is always the hacky solution,
> > that
> > > > >>>> could
> > > > >>>>> be
> > > > >>>>>> even integrated into this above
> > > > >>>>>> mentioned algorithm as a short circuit to always reach
> > > > >> `slow-interval`.
> > > > >>>>>>
> > > > >>>>>> Apart of that, you picked 3 minutes as the checkpointing
> > interval in
> > > > >>>> your
> > > > >>>>>> counter example. In most cases
> > > > >>>>>> any interval above 1 minute would inflict pretty negligible
> > overheads,
> > > > >>>> so
> > > > >>>>>> all in all, I would doubt there is
> > > > >>>>>> a significant benefit (in most cases) of increasing 3 minute
> > > > >> checkpoint
> > > > >>>>>> interval to anything more, let alone
> > > > >>>>>> 30 minutes.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> I am not sure we should design the algorithm with the
> assumption
> > that
> > > > >> the
> > > > >>>>> short checkpointing interval will always be higher than 1
> minute
> > etc.
> > > > >>>>>
> > > > >>>>> I agree the proposed algorithm can solve most cases where the
> > resource
> > > > >> is
> > > > >>>>> sufficient and there is always no backlog in source subtasks.
> On
> > the
> > > > >>>> other
> > > > >>>>> hand, what makes SRE
> > > > >>>>> life hard is probably the remaining 1-5% cases where the
> traffic
> > is
> > > > >> spiky
> > > > >>>>> and the cluster is reaching its capacity limit. The ability to
> > predict
> > > > >>>> and
> > > > >>>>> control Flink job's behavior (including checkpointing interval)
> > can
> > > > >>>>> considerably reduce the burden of manging Flink jobs.
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> Dong
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>>
> > > > >>>>>> Best,
> > > > >>>>>> Piotrek
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> sob., 3 cze 2023 o 05:44 Dong Lin <[email protected]>
> > napisał(a):
> > > > >>>>>>
> > > > >>>>>>> Hi Piotr,
> > > > >>>>>>>
> > > > >>>>>>> Thanks for the explanations. I have some followup questions
> > below.
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Jun 2, 2023 at 10:55 PM Piotr Nowojski <
> > [email protected]
> > > > >>>>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi All,
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for chipping in the discussion Ahmed!
> > > > >>>>>>>>
> > > > >>>>>>>> Regarding using the REST API. Currently I'm leaning towards
> > > > >>>>>> implementing
> > > > >>>>>>>> this feature inside the Flink itself, via some pluggable
> > interface.
> > > > >>>>>>>> REST API solution would be tempting, but I guess not
> everyone
> > is
> > > > >>>>> using
> > > > >>>>>>>> Flink Kubernetes Operator.
> > > > >>>>>>>>
> > > > >>>>>>>> @Dong
> > > > >>>>>>>>
> > > > >>>>>>>>> I am not sure metrics such as isBackPressured are already
> > sent to
> > > > >>>>> JM.
> > > > >>>>>>>>
> > > > >>>>>>>> Fetching code path on the JM:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> >
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl#queryTmMetricsFuture
> > > > >>>>>>>>
> > > > >>>>
> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore#add
> > > > >>>>>>>>
> > > > >>>>>>>> Example code path accessing Task level metrics via JM using
> > the
> > > > >>>>>>>> `MetricStore`:
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> >
> org.apache.flink.runtime.rest.handler.job.metrics.AggregatingSubtasksMetricsHandler
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Thanks for the code reference. I checked the code that
> invoked
> > these
> > > > >>>>> two
> > > > >>>>>>> classes and found the following information:
> > > > >>>>>>>
> > > > >>>>>>> - AggregatingSubtasksMetricsHandler#getStoresis currently
> > invoked
> > > > >>>> only
> > > > >>>>>>> when AggregatingJobsMetricsHandler is invoked.
> > > > >>>>>>> - AggregatingJobsMetricsHandler is only instantiated and
> > returned by
> > > > >>>>>>> WebMonitorEndpoint#initializeHandlers
> > > > >>>>>>> - WebMonitorEndpoint#initializeHandlers is only used by
> > > > >>>>>> RestServerEndpoint.
> > > > >>>>>>> And RestServerEndpoint invokes these handlers in response to
> > external
> > > > >>>>>> REST
> > > > >>>>>>> request.
> > > > >>>>>>>
> > > > >>>>>>> I understand that JM will get the backpressure-related
> metrics
> > every
> > > > >>>>> time
> > > > >>>>>>> the RestServerEndpoint receives the REST request to get these
> > > > >>>> metrics.
> > > > >>>>>> But
> > > > >>>>>>> I am not sure if RestServerEndpoint is already always
> > receiving the
> > > > >>>>> REST
> > > > >>>>>>> metrics at regular interval (suppose there is no human
> manually
> > > > >>>>>>> opening/clicking the Flink Web UI). And if it does, what is
> the
> > > > >>>>> interval?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>> For example, let's say every source operator subtask
> reports
> > this
> > > > >>>>>>> metric
> > > > >>>>>>>> to
> > > > >>>>>>>>> JM once every 10 seconds. There are 100 source subtasks.
> And
> > each
> > > > >>>>>>> subtask
> > > > >>>>>>>>> is backpressured roughly 10% of the total time due to
> traffic
> > > > >>>>> spikes
> > > > >>>>>>> (and
> > > > >>>>>>>>> limited buffer). Then at any given time, there are 1 -
> > 0.9^100 =
> > > > >>>>>>> 99.997%
> > > > >>>>>>>>> chance that there is at least one subtask that is
> > backpressured.
> > > > >>>>> Then
> > > > >>>>>>> we
> > > > >>>>>>>>> have to wait for at least 10 seconds to check again.
> > > > >>>>>>>>
> > > > >>>>>>>> backPressuredTimeMsPerSecond and other related metrics (like
> > > > >>>>>>>> busyTimeMsPerSecond) are not subject to that problem.
> > > > >>>>>>>> They are recalculated once every metric fetching interval,
> > and they
> > > > >>>>>>> report
> > > > >>>>>>>> accurately on average the given subtask spent
> > > > >>>>>> busy/idling/backpressured.
> > > > >>>>>>>> In your example, backPressuredTimeMsPerSecond would report
> > 100ms/s.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Suppose every subtask is already reporting
> > > > >>>> backPressuredTimeMsPerSecond
> > > > >>>>>> to
> > > > >>>>>>> JM once every 100 ms. If a job has 10 operators (that are not
> > > > >>>> chained)
> > > > >>>>>> and
> > > > >>>>>>> each operator has 100 subtasks, then JM would need to handle
> > 10000
> > > > >>>>>> requests
> > > > >>>>>>> per second to receive metrics from these 1000 subtasks. It
> > seems
> > > > >>>> like a
> > > > >>>>>>> non-trivial overhead for medium-to-large sized jobs and can
> > make JM
> > > > >>>> the
> > > > >>>>>>> performance bottleneck during job execution.
> > > > >>>>>>>
> > > > >>>>>>> I would be surprised if Flink is already paying this much
> > overhead
> > > > >>>> just
> > > > >>>>>> for
> > > > >>>>>>> metrics monitoring. That is the main reason I still doubt it
> > is true.
> > > > >>>>> Can
> > > > >>>>>>> you show where this 100 ms is currently configured?
> > > > >>>>>>>
> > > > >>>>>>> Alternatively, maybe you mean that we should add extra code
> to
> > invoke
> > > > >>>>> the
> > > > >>>>>>> REST API at 100 ms interval. Then that means we need to
> > considerably
> > > > >>>>>>> increase the network/cpu overhead at JM, where the overhead
> > will
> > > > >>>>> increase
> > > > >>>>>>> as the number of TM/slots increase, which may pose risk to
> the
> > > > >>>>>> scalability
> > > > >>>>>>> of the proposed design. I am not sure we should do this. What
> > do you
> > > > >>>>>> think?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> While it will be nice to support additional use-cases
> > > > >>>>>>>>> with one proposal, it is probably also reasonable to make
> > > > >>>>> incremental
> > > > >>>>>>>>> progress and support the low-hanging-fruit use-case first.
> > The
> > > > >>>>> choice
> > > > >>>>>>>>> really depends on the complexity and the importance of
> > supporting
> > > > >>>>> the
> > > > >>>>>>>> extra
> > > > >>>>>>>>> use-cases.
> > > > >>>>>>>>
> > > > >>>>>>>> That would be true, if that was a private implementation
> > detail or
> > > > >>>> if
> > > > >>>>>> the
> > > > >>>>>>>> low-hanging-fruit-solution would be on the direct path to
> the
> > final
> > > > >>>>>>>> solution.
> > > > >>>>>>>> That's unfortunately not the case here. This will add public
> > facing
> > > > >>>>>> API,
> > > > >>>>>>>> that we will later need to maintain, no matter what the
> final
> > > > >>>>> solution
> > > > >>>>>>> will
> > > > >>>>>>>> be,
> > > > >>>>>>>> and at the moment at least I don't see it being related to a
> > > > >>>>> "perfect"
> > > > >>>>>>>> solution.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Sure. Then let's decide the final solution first.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>> I guess the point is that the suggested approach, which
> > > > >>>> dynamically
> > > > >>>>>>>>> determines the checkpointing interval based on the
> > backpressure,
> > > > >>>>> may
> > > > >>>>>>>> cause
> > > > >>>>>>>>> regression when the checkpointing interval is relatively
> low.
> > > > >>>> This
> > > > >>>>>>> makes
> > > > >>>>>>>> it
> > > > >>>>>>>>> hard for users to enable this feature in production. It is
> > like
> > > > >>>> an
> > > > >>>>>>>>> auto-driving system that is not guaranteed to work
> > > > >>>>>>>>
> > > > >>>>>>>> Yes, creating a more generic solution that would require
> less
> > > > >>>>>>> configuration
> > > > >>>>>>>> is usually more difficult then static configurations.
> > > > >>>>>>>> It doesn't mean we shouldn't try to do that. Especially that
> > if my
> > > > >>>>>>> proposed
> > > > >>>>>>>> algorithm wouldn't work good enough, there is
> > > > >>>>>>>> an obvious solution, that any source could add a metric,
> like
> > let
> > > > >>>> say
> > > > >>>>>>>> "processingBacklog: true/false", and the `CheckpointTrigger`
> > > > >>>>>>>> could use this as an override to always switch to the
> > > > >>>>>>>> "slowCheckpointInterval". I don't think we need it, but
> that's
> > > > >>>> always
> > > > >>>>>> an
> > > > >>>>>>>> option
> > > > >>>>>>>> that would be basically equivalent to your original
> proposal.
> > Or
> > > > >>>> even
> > > > >>>>>>>> source could add "suggestedCheckpointInterval : int", and
> > > > >>>>>>>> `CheckpointTrigger` could use that value if present as a
> hint
> > in
> > > > >>>> one
> > > > >>>>>> way
> > > > >>>>>>> or
> > > > >>>>>>>> another.
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> So far we have talked about the possibility of using
> > > > >>>> CheckpointTrigger
> > > > >>>>>> and
> > > > >>>>>>> mentioned the CheckpointTrigger
> > > > >>>>>>> and read metric values.
> > > > >>>>>>>
> > > > >>>>>>> Can you help answer the following questions so that I can
> > understand
> > > > >>>>> the
> > > > >>>>>>> alternative solution more concretely:
> > > > >>>>>>>
> > > > >>>>>>> - What is the interface of this CheckpointTrigger? For
> > example, are
> > > > >>>> we
> > > > >>>>>>> going to give CheckpointTrigger a context that it can use to
> > fetch
> > > > >>>>>>> arbitrary metric values? This can help us understand what
> > information
> > > > >>>>>> this
> > > > >>>>>>> user-defined CheckpointTrigger can use to make the checkpoint
> > > > >>>> decision.
> > > > >>>>>>> - Where is this CheckpointTrigger running? For example, is it
> > going
> > > > >>>> to
> > > > >>>>>> run
> > > > >>>>>>> on the subtask of every source operator? Or is it going to
> run
> > on the
> > > > >>>>> JM?
> > > > >>>>>>> - Are we going to provide a default implementation of this
> > > > >>>>>>> CheckpointTrigger in Flink that implements the algorithm
> > described
> > > > >>>>> below,
> > > > >>>>>>> or do we expect each source operator developer to implement
> > their own
> > > > >>>>>>> CheckpointTrigger?
> > > > >>>>>>> - How can users specify the
> > > > >>>>>> fastCheckpointInterval/slowCheckpointInterval?
> > > > >>>>>>> For example, will we provide APIs on the CheckpointTrigger
> that
> > > > >>>>> end-users
> > > > >>>>>>> can use to specify the checkpointing interval? What would
> that
> > look
> > > > >>>>> like?
> > > > >>>>>>>
> > > > >>>>>>> Overall, my gut feel is that the alternative approach based
> on
> > > > >>>>>>> CheckpointTrigger is more complicated and harder to use. And
> it
> > > > >>>>> probably
> > > > >>>>>>> also has the issues of "having two places to configure
> > checkpointing
> > > > >>>>>>> interval" and "giving flexibility for every source to
> > implement a
> > > > >>>>>> different
> > > > >>>>>>> API" (as mentioned below).
> > > > >>>>>>>
> > > > >>>>>>> Maybe we can evaluate it more after knowing the answers to
> the
> > above
> > > > >>>>>>> questions.
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> On the other hand, the approach currently proposed in the
> > FLIP is
> > > > >>>>>> much
> > > > >>>>>>>>> simpler as it does not depend on backpressure. Users
> specify
> > the
> > > > >>>>>> extra
> > > > >>>>>>>>> interval requirement on the specific sources (e.g.
> > HybridSource,
> > > > >>>>>> MySQL
> > > > >>>>>>>> CDC
> > > > >>>>>>>>> Source) and can easily know the checkpointing interval will
> > be
> > > > >>>> used
> > > > >>>>>> on
> > > > >>>>>>>> the
> > > > >>>>>>>>> continuous phase of the corresponding source. This is
> pretty
> > much
> > > > >>>>>> same
> > > > >>>>>>> as
> > > > >>>>>>>>> how users use the existing execution.checkpointing.interval
> > > > >>>> config.
> > > > >>>>>> So
> > > > >>>>>>>>> there is no extra concern of regression caused by this
> > approach.
> > > > >>>>>>>>
> > > > >>>>>>>> To an extent, but as I have already previously mentioned I
> > really
> > > > >>>>>> really
> > > > >>>>>>> do
> > > > >>>>>>>> not like idea of:
> > > > >>>>>>>> - having two places to configure checkpointing interval
> > (config
> > > > >>>>> file
> > > > >>>>>>> and
> > > > >>>>>>>> in the Source builders)
> > > > >>>>>>>> - giving flexibility for every source to implement a
> different
> > > > >>>> API
> > > > >>>>>> for
> > > > >>>>>>>> that purpose
> > > > >>>>>>>> - creating a solution that is not generic enough, so that we
> > will
> > > > >>>>>> need
> > > > >>>>>>> a
> > > > >>>>>>>> completely different mechanism in the future anyway
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Yeah, I understand different developers might have different
> > > > >>>>>>> concerns/tastes for these APIs. Ultimately, there might not
> be
> > a
> > > > >>>>> perfect
> > > > >>>>>>> solution and we have to choose based on the pros/cons of
> these
> > > > >>>>> solutions.
> > > > >>>>>>>
> > > > >>>>>>> I agree with you that, all things being equal, it is
> > preferable to 1)
> > > > >>>>>> have
> > > > >>>>>>> one place to configure checkpointing intervals, 2) have all
> > source
> > > > >>>>>>> operators use the same API, and 3) create a solution that is
> > generic
> > > > >>>>> and
> > > > >>>>>>> last lasting. Note that these three goals affects the
> > usability and
> > > > >>>>>>> extensibility of the API, but not necessarily the
> > > > >>>> stability/performance
> > > > >>>>>> of
> > > > >>>>>>> the production job.
> > > > >>>>>>>
> > > > >>>>>>> BTW, there are also other preferrable goals. For example, it
> > is very
> > > > >>>>>> useful
> > > > >>>>>>> for the job's behavior to be predictable and interpretable so
> > that
> > > > >>>> SRE
> > > > >>>>>> can
> > > > >>>>>>> operator/debug the Flink in an easier way. We can list these
> > > > >>>> pros/cons
> > > > >>>>>>> altogether later.
> > > > >>>>>>>
> > > > >>>>>>> I am wondering if we can first agree on the priority of goals
> > we want
> > > > >>>>> to
> > > > >>>>>>> achieve. IMO, it is a hard-requirement for the user-facing
> API
> > to be
> > > > >>>>>>> clearly defined and users should be able to use the API
> without
> > > > >>>> concern
> > > > >>>>>> of
> > > > >>>>>>> regression. And this requirement is more important than the
> > other
> > > > >>>> goals
> > > > >>>>>>> discussed above because it is related to the
> > stability/performance of
> > > > >>>>> the
> > > > >>>>>>> production job. What do you think?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>> Sounds good. Looking forward to learning more ideas.
> > > > >>>>>>>>
> > > > >>>>>>>> I have thought about this a bit more, and I think we don't
> > need to
> > > > >>>>>> check
> > > > >>>>>>>> for the backpressure status, or how much overloaded all of
> the
> > > > >>>>>> operators
> > > > >>>>>>>> are.
> > > > >>>>>>>> We could just check three things for source operators:
> > > > >>>>>>>> 1. pendingRecords (backlog length)
> > > > >>>>>>>> 2. numRecordsInPerSecond
> > > > >>>>>>>> 3. backPressuredTimeMsPerSecond
> > > > >>>>>>>>
> > > > >>>>>>>> // int metricsUpdateInterval = 10s // obtained from config
> > > > >>>>>>>> // Next line calculates how many records can we consume from
> > the
> > > > >>>>>> backlog,
> > > > >>>>>>>> assuming
> > > > >>>>>>>> // that magically the reason behind a backpressure vanishes.
> > We
> > > > >>>> will
> > > > >>>>>> use
> > > > >>>>>>>> this only as
> > > > >>>>>>>> // a safeguard  against scenarios like for example if
> > backpressure
> > > > >>>>> was
> > > > >>>>>>>> caused by some
> > > > >>>>>>>> // intermittent failure/performance degradation.
> > > > >>>>>>>> maxRecordsConsumedWithoutBackpressure =
> > (numRecordsInPerSecond /
> > > > >>>>> (1000
> > > > >>>>>>>> - backPressuredTimeMsPerSecond / 1000)) *
> > metricsUpdateInterval
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> I am not sure if there is a typo. Because if
> > > > >>>>>> backPressuredTimeMsPerSecond =
> > > > >>>>>>> 0, then maxRecordsConsumedWithoutBackpressure =
> > > > >>>> numRecordsInPerSecond /
> > > > >>>>>>> 1000 * metricsUpdateInterval according to the above
> algorithm.
> > > > >>>>>>>
> > > > >>>>>>> Do you mean "maxRecordsConsumedWithoutBackpressure =
> > > > >>>>>> (numRecordsInPerSecond
> > > > >>>>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) *
> > > > >>>> metricsUpdateInterval"?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> // we are excluding maxRecordsConsumedWithoutBackpressure
> > from the
> > > > >>>>>>> backlog
> > > > >>>>>>>> as
> > > > >>>>>>>> // a safeguard against an intermittent back pressure
> > problems, so
> > > > >>>>> that
> > > > >>>>>> we
> > > > >>>>>>>> don't
> > > > >>>>>>>> // calculate next checkpoint interval far far in the future,
> > while
> > > > >>>>> the
> > > > >>>>>>>> backpressure
> > > > >>>>>>>> // goes away before we will recalculate metrics and new
> > > > >>>> checkpointing
> > > > >>>>>>>> interval
> > > > >>>>>>>> timeToConsumeBacklog = (pendingRecords -
> > > > >>>>>>>> maxRecordsConsumedWithoutBackpressure) /
> numRecordsInPerSecond
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Then we can use those numbers to calculate desired
> > checkpointed
> > > > >>>>>> interval
> > > > >>>>>>>> for example like this:
> > > > >>>>>>>>
> > > > >>>>>>>> long calculatedCheckpointInterval = timeToConsumeBacklog /
> 10;
> > > > >>>> //this
> > > > >>>>>> may
> > > > >>>>>>>> need some refining
> > > > >>>>>>>> long nextCheckpointInterval =
> min(max(fastCheckpointInterval,
> > > > >>>>>>>> calculatedCheckpointInterval), slowCheckpointInterval);
> > > > >>>>>>>> long nextCheckpointTs = lastCheckpointTs +
> > nextCheckpointInterval;
> > > > >>>>>>>>
> > > > >>>>>>>> WDYT?
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> I think the idea of the above algorithm is to incline to use
> > the
> > > > >>>>>>> fastCheckpointInterval unless we are very sure the backlog
> > will take
> > > > >>>> a
> > > > >>>>>> long
> > > > >>>>>>> time to process. This can alleviate the concern of regression
> > during
> > > > >>>>> the
> > > > >>>>>>> continuous_bounded phase since we are more likely to use the
> > > > >>>>>>> fastCheckpointInterval. However, it can cause regression
> > during the
> > > > >>>>>> bounded
> > > > >>>>>>> phase.
> > > > >>>>>>>
> > > > >>>>>>> I will use a concrete example to explain the risk of
> > regression:
> > > > >>>>>>> - The user is using HybridSource to read from HDFS followed
> by
> > Kafka.
> > > > >>>>> The
> > > > >>>>>>> data in HDFS is old and there is no need for data freshness
> > for the
> > > > >>>>> data
> > > > >>>>>> in
> > > > >>>>>>> HDFS.
> > > > >>>>>>> - The user configures the job as below:
> > > > >>>>>>> - fastCheckpointInterval = 3 minutes
> > > > >>>>>>> - slowCheckpointInterval = 30 minutes
> > > > >>>>>>> - metricsUpdateInterval = 100 ms
> > > > >>>>>>>
> > > > >>>>>>> Using the above formulate, we can know that once
> pendingRecords
> > > > >>>>>>> <= numRecordsInPerSecond * 30-minutes, then
> > > > >>>>> calculatedCheckpointInterval
> > > > >>>>>> <=
> > > > >>>>>>> 3 minutes, meaning that we will use slowCheckpointInterval as
> > the
> > > > >>>>>>> checkpointing interval. Then in the last 30 minutes of the
> > bounded
> > > > >>>>> phase,
> > > > >>>>>>> the checkpointing frequency will be 10X higher than what the
> > user
> > > > >>>>> wants.
> > > > >>>>>>>
> > > > >>>>>>> Also note that the same issue would also considerably limit
> the
> > > > >>>>> benefits
> > > > >>>>>> of
> > > > >>>>>>> the algorithm. For example, during the continuous phase, the
> > > > >>>> algorithm
> > > > >>>>>> will
> > > > >>>>>>> only be better than the approach in FLIP-309 when there is at
> > least
> > > > >>>>>>> 30-minutes worth of backlog in the source.
> > > > >>>>>>>
> > > > >>>>>>> Sure, having a slower checkpointing interval in this extreme
> > case
> > > > >>>>> (where
> > > > >>>>>>> there is 30-minutes backlog in the continous-unbounded phase)
> > is
> > > > >>>> still
> > > > >>>>>>> useful when this happens. But since this is the un-common
> > case, and
> > > > >>>> the
> > > > >>>>>>> right solution is probably to do capacity planning to avoid
> > this from
> > > > >>>>>>> happening in the first place, I am not sure it is worth
> > optimizing
> > > > >>>> for
> > > > >>>>>> this
> > > > >>>>>>> case at the cost of regression in the bounded phase and the
> > reduced
> > > > >>>>>>> operational predictability for users (e.g. what checkpointing
> > > > >>>> interval
> > > > >>>>>>> should I expect at this stage of the job).
> > > > >>>>>>>
> > > > >>>>>>> I think the fundamental issue with this algorithm is that it
> is
> > > > >>>> applied
> > > > >>>>>> to
> > > > >>>>>>> both the bounded phases and the continous_unbounded phases
> > without
> > > > >>>>>> knowing
> > > > >>>>>>> which phase the job is running at. The only information it
> can
> > access
> > > > >>>>> is
> > > > >>>>>>> the backlog. But two sources with the same amount of backlog
> > do not
> > > > >>>>>>> necessarily mean they have the same data freshness
> requirement.
> > > > >>>>>>>
> > > > >>>>>>> In this particular example, users know that the data in HDFS
> > is very
> > > > >>>>> old
> > > > >>>>>>> and there is no need for data freshness. Users can express
> > signals
> > > > >>>> via
> > > > >>>>>> the
> > > > >>>>>>> per-source API proposed in the FLIP. This is why the current
> > approach
> > > > >>>>> in
> > > > >>>>>>> FLIP-309 can be better in this case.
> > > > >>>>>>>
> > > > >>>>>>> What do you think?
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Dong
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> Piotrek
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
> >
>

Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Reply via email to