Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Dong Lin Mon, 26 Jun 2023 23:39:11 -0700

Thank you Leonard for the review!

Hi Piotr, do you have any comments on the latest proposal?


I am wondering if it is OK to start the voting thread this week.

On Mon, Jun 26, 2023 at 4:10 PM Leonard Xu <xbjt...@gmail.com> wrote:

> Thanks Dong for driving this FLIP forward!
>
> Introducing  `backlog status` concept for flink job makes sense to me as
> following reasons:
>
> From concept/API design perspective, it’s more general and natural than
> above proposals as it can be used in HybridSource for bounded records, CDC
> Source for history snapshot and general sources like KafkaSource for
> historical messages.
>
> From user cases/requirements, I’ve seen many users manually to set larger
> checkpoint interval during backfilling and then set a shorter checkpoint
> interval for real-time processing in their production environments as a
> flink application optimization. Now, the flink framework can make this
> optimization no longer require the user to set the checkpoint interval and
> restart the job multiple times.
>
> Following supporting using larger checkpoint for job under backlog status
> in current FLIP, we can explore supporting larger parallelism/memory/cpu
> for job under backlog status in the future.
>
> In short, the updated FLIP looks good to me.
>
>
> Best,
> Leonard
>
>
> > On Jun 22, 2023, at 12:07 PM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > Hi Piotr,
> >
> > Thanks again for proposing the isProcessingBacklog concept.
> >
> > After discussing with Becket Qin and thinking about this more, I agree it
> > is a better idea to add a top-level concept to all source operators to
> > address the target use-case.
> >
> > The main reason that changed my mind is that isProcessingBacklog can be
> > described as an inherent/nature attribute of every source instance and
> its
> > semantics does not need to depend on any specific checkpointing policy.
> > Also, we can hardcode the isProcessingBacklog behavior for the sources we
> > have considered so far (e.g. HybridSource and MySQL CDC source) without
> > asking users to explicitly configure the per-source behavior, which
> indeed
> > provides better user experience.
> >
> > I have updated the FLIP based on the latest suggestions. The latest FLIP
> no
> > longer introduces per-source config that can be used by end-users. While
> I
> > agree with you that CheckpointTrigger can be a useful feature to address
> > additional use-cases, I am not sure it is necessary for the use-case
> > targeted by FLIP-309. Maybe we can introduce CheckpointTrigger separately
> > in another FLIP?
> >
> > Can you help take another look at the updated FLIP?
> >
> > Best,
> > Dong
> >
> >
> >
> > On Fri, Jun 16, 2023 at 11:59 PM Piotr Nowojski <pnowoj...@apache.org>
> > wrote:
> >
> >> Hi Dong,
> >>
> >>> Suppose there are 1000 subtask and each subtask has 1% chance of being
> >>> "backpressured" at a given time (due to random traffic spikes). Then at
> >> any
> >>> given time, the chance of the job
> >>> being considered not-backpressured = (1-0.01)^1000. Since we evaluate
> the
> >>> backpressure metric once a second, the estimated time for the job
> >>> to be considered not-backpressured is roughly 1 / ((1-0.01)^1000) =
> 23163
> >>> sec = 6.4 hours.
> >>>
> >>> This means that the job will effectively always use the longer
> >>> checkpointing interval. It looks like a real concern, right?
> >>
> >> Sorry I don't understand where you are getting those numbers from.
> >> Instead of trying to find loophole after loophole, could you try to
> think
> >> how a given loophole could be improved/solved?
> >>
> >>> Hmm... I honestly think it will be useful to know the APIs due to the
> >>> following reasons.
> >>
> >> Please propose something. I don't think it's needed.
> >>
> >>> - For the use-case mentioned in FLIP-309 motivation section, would the
> >> APIs
> >>> of this alternative approach be more or less usable?
> >>
> >> Everything that you originally wanted to achieve in FLIP-309, you could
> do
> >> as well in my proposal.
> >> Vide my many mentions of the "hacky solution".
> >>
> >>> - Can these APIs reliably address the extra use-case (e.g. allow
> >>> checkpointing interval to change dynamically even during the unbounded
> >>> phase) as it claims?
> >>
> >> I don't see why not.
> >>
> >>> - Can these APIs be decoupled from the APIs currently proposed in
> >> FLIP-309?
> >>
> >> Yes
> >>
> >>> For example, if the APIs of this alternative approach can be decoupled
> >> from
> >>> the APIs currently proposed in FLIP-309, then it might be reasonable to
> >>> work on this extra use-case with a more advanced/complicated design
> >>> separately in a followup work.
> >>
> >> As I voiced my concerns previously, the current design of FLIP-309 would
> >> clog the public API and in the long run confuse the users. IMO It's
> >> addressing the
> >> problem in the wrong place.
> >>
> >>> Hmm.. do you mean we can do the following:
> >>> - Have all source operators emit a metric named "processingBacklog".
> >>> - Add a job-level config that specifies "the checkpointing interval to
> be
> >>> used when any source is processing backlog".
> >>> - The JM collects the "processingBacklog" periodically from all source
> >>> operators and uses the newly added config value as appropriate.
> >>
> >> Yes.
> >>
> >>> The challenge with this approach is that we need to define the
> semantics
> >> of
> >>> this "processingBacklog" metric and have all source operators
> >>> implement this metric. I am not sure we are able to do this yet without
> >>> having users explicitly provide this information on a per-source basis.
> >>>
> >>> Suppose the job read from a bounded Kafka source, should it emit
> >>> "processingBacklog=true"? If yes, then the job might use long
> >> checkpointing
> >>> interval even
> >>> if the job is asked to process data starting from now to the next 1
> hour.
> >>> If no, then the job might use the short checkpointing interval
> >>> even if the job is asked to re-process data starting from 7 days ago.
> >>
> >> Yes. The same can be said of your proposal. Your proposal has the very
> same
> >> issues
> >> that every source would have to implement it differently, most sources
> >> would
> >> have no idea how to properly calculate the new requested checkpoint
> >> interval,
> >> for those that do know how to do that, user would have to configure
> every
> >> source
> >> individually and yet again we would end up with a system, that works
> only
> >> partially in
> >> some special use cases (HybridSource), that's confusing the users even
> >> more.
> >>
> >> That's why I think the more generic solution, working primarily on the
> same
> >> metrics that are used by various auto scaling solutions (like Flink K8s
> >> operator's
> >> autosaler) would be better. The hacky solution I proposed to:
> >> 1. show you that the generic solution is simply a superset of your
> proposal
> >> 2. if you are adamant that busyness/backpressured/records processing
> >> rate/pending records
> >>    metrics wouldn't cover your use case sufficiently (imo they can),
> then
> >> you can very easily
> >>    enhance this algorithm with using some hints from the sources. Like
> >> "processingBacklog==true"
> >>    to short circuit the main algorithm, if `processingBacklog` is
> >> available.
> >>
> >> Best,
> >> Piotrek
> >>
> >>
> >> pt., 16 cze 2023 o 04:45 Dong Lin <lindon...@gmail.com> napisał(a):
> >>
> >>> Hi again Piotr,
> >>>
> >>> Thank you for the reply. Please see my reply inline.
> >>>
> >>> On Fri, Jun 16, 2023 at 12:11 AM Piotr Nowojski <
> >> piotr.nowoj...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi again Dong,
> >>>>
> >>>>> I understand that JM will get the backpressure-related metrics every
> >>> time
> >>>>> the RestServerEndpoint receives the REST request to get these
> >> metrics.
> >>>> But
> >>>>> I am not sure if RestServerEndpoint is already always receiving the
> >>> REST
> >>>>> metrics at regular interval (suppose there is no human manually
> >>>>> opening/clicking the Flink Web UI). And if it does, what is the
> >>> interval?
> >>>>
> >>>> Good catch, I've thought that metrics are pre-emptively sent to JM
> >> every
> >>> 10
> >>>> seconds.
> >>>> Indeed that's not the case at the moment, and that would have to be
> >>>> improved.
> >>>>
> >>>>> I would be surprised if Flink is already paying this much overhead
> >> just
> >>>> for
> >>>>> metrics monitoring. That is the main reason I still doubt it is true.
> >>> Can
> >>>>> you show where this 100 ms is currently configured?
> >>>>>
> >>>>> Alternatively, maybe you mean that we should add extra code to invoke
> >>> the
> >>>>> REST API at 100 ms interval. Then that means we need to considerably
> >>>>> increase the network/cpu overhead at JM, where the overhead will
> >>> increase
> >>>>> as the number of TM/slots increase, which may pose risk to the
> >>>> scalability
> >>>>> of the proposed design. I am not sure we should do this. What do you
> >>>> think?
> >>>>
> >>>> Sorry. I didn't mean metric should be reported every 100ms. I meant
> >> that
> >>>> "backPressuredTimeMsPerSecond (metric) would report (a value of)
> >>> 100ms/s."
> >>>> once per metric interval (10s?).
> >>>>
> >>>
> >>> Suppose there are 1000 subtask and each subtask has 1% chance of being
> >>> "backpressured" at a given time (due to random traffic spikes). Then at
> >> any
> >>> given time, the chance of the job
> >>> being considered not-backpressured = (1-0.01)^1000. Since we evaluate
> the
> >>> backpressure metric once a second, the estimated time for the job
> >>> to be considered not-backpressured is roughly 1 / ((1-0.01)^1000) =
> 23163
> >>> sec = 6.4 hours.
> >>>
> >>> This means that the job will effectively always use the longer
> >>> checkpointing interval. It looks like a real concern, right?
> >>>
> >>>
> >>>>> - What is the interface of this CheckpointTrigger? For example, are
> >> we
> >>>>> going to give CheckpointTrigger a context that it can use to fetch
> >>>>> arbitrary metric values? This can help us understand what information
> >>>> this
> >>>>> user-defined CheckpointTrigger can use to make the checkpoint
> >> decision.
> >>>>
> >>>> I honestly don't think this is important at this stage of the
> >> discussion.
> >>>> It could have
> >>>> whatever interface we would deem to be best. Required things:
> >>>>
> >>>> - access to at least a subset of metrics that the given
> >>> `CheckpointTrigger`
> >>>> requests,
> >>>>  for example via some registration mechanism, so we don't have to
> >> fetch
> >>>> all of the
> >>>>  metrics all the time from TMs.
> >>>> - some way to influence `CheckpointCoordinator`. Either via manually
> >>>> triggering
> >>>>  checkpoints, and/or ability to change the checkpointing interval.
> >>>>
> >>>
> >>> Hmm... I honestly think it will be useful to know the APIs due to the
> >>> following reasons.
> >>>
> >>> We would need to know the concrete APIs to gauge the following:
> >>> - For the use-case mentioned in FLIP-309 motivation section, would the
> >> APIs
> >>> of this alternative approach be more or less usable?
> >>> - Can these APIs reliably address the extra use-case (e.g. allow
> >>> checkpointing interval to change dynamically even during the unbounded
> >>> phase) as it claims?
> >>> - Can these APIs be decoupled from the APIs currently proposed in
> >> FLIP-309?
> >>>
> >>> For example, if the APIs of this alternative approach can be decoupled
> >> from
> >>> the APIs currently proposed in FLIP-309, then it might be reasonable to
> >>> work on this extra use-case with a more advanced/complicated design
> >>> separately in a followup work.
> >>>
> >>>
> >>>>> - Where is this CheckpointTrigger running? For example, is it going
> >> to
> >>>> run
> >>>>> on the subtask of every source operator? Or is it going to run on the
> >>> JM?
> >>>>
> >>>> IMO on the JM.
> >>>>
> >>>>> - Are we going to provide a default implementation of this
> >>>>> CheckpointTrigger in Flink that implements the algorithm described
> >>> below,
> >>>>> or do we expect each source operator developer to implement their own
> >>>>> CheckpointTrigger?
> >>>>
> >>>> As I mentioned before, I think we should provide at the very least the
> >>>> implementation
> >>>> that replaces the current triggering mechanism (statically configured
> >>>> checkpointing interval)
> >>>> and it would be great to provide the backpressure monitoring trigger
> as
> >>>> well.
> >>>>
> >>>
> >>> I agree that if there is a good use-case that can be addressed by the
> >>> proposed CheckpointTrigger, then it is reasonable
> >>> to add CheckpointTrigger and replace the current triggering mechanism
> >> with
> >>> it.
> >>>
> >>> I also agree that we will likely find such a use-case. For example,
> >> suppose
> >>> the source records have event timestamps, then it is likely
> >>> that we can use the trigger to dynamically control the checkpointing
> >>> interval based on the difference between the watermark and current
> system
> >>> time.
> >>>
> >>> But I am not sure the addition of this CheckpointTrigger should be
> >> coupled
> >>> with FLIP-309. Whether or not it is coupled probably depends on the
> >>> concrete API design around CheckpointTrigger.
> >>>
> >>> If you would be adamant that the backpressure monitoring doesn't cover
> >> well
> >>>> enough your use case, I would be ok to provide the hacky version that
> I
> >>>> also mentioned
> >>>> before:
> >>>
> >>>
> >>>> """
> >>>> Especially that if my proposed algorithm wouldn't work good enough,
> >> there
> >>>> is
> >>>> an obvious solution, that any source could add a metric, like let say
> >>>> "processingBacklog: true/false", and the `CheckpointTrigger`
> >>>> could use this as an override to always switch to the
> >>>> "slowCheckpointInterval". I don't think we need it, but that's always
> >> an
> >>>> option
> >>>> that would be basically equivalent to your original proposal.
> >>>> """
> >>>>
> >>>
> >>> Hmm.. do you mean we can do the following:
> >>> - Have all source operators emit a metric named "processingBacklog".
> >>> - Add a job-level config that specifies "the checkpointing interval to
> be
> >>> used when any source is processing backlog".
> >>> - The JM collects the "processingBacklog" periodically from all source
> >>> operators and uses the newly added config value as appropriate.
> >>>
> >>> The challenge with this approach is that we need to define the
> semantics
> >> of
> >>> this "processingBacklog" metric and have all source operators
> >>> implement this metric. I am not sure we are able to do this yet without
> >>> having users explicitly provide this information on a per-source basis.
> >>>
> >>> Suppose the job read from a bounded Kafka source, should it emit
> >>> "processingBacklog=true"? If yes, then the job might use long
> >> checkpointing
> >>> interval even
> >>> if the job is asked to process data starting from now to the next 1
> hour.
> >>> If no, then the job might use the short checkpointing interval
> >>> even if the job is asked to re-process data starting from 7 days ago.
> >>>
> >>>
> >>>>
> >>>>> - How can users specify the
> >>>> fastCheckpointInterval/slowCheckpointInterval?
> >>>>> For example, will we provide APIs on the CheckpointTrigger that
> >>> end-users
> >>>>> can use to specify the checkpointing interval? What would that look
> >>> like?
> >>>>
> >>>> Also as I mentioned before, just like metric reporters are configured:
> >>>>
> >>>>
> >>>
> >>
> https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/deployment/metric_reporters/
> >>>> Every CheckpointTrigger could have its own custom configuration.
> >>>>
> >>>>> Overall, my gut feel is that the alternative approach based on
> >>>>> CheckpointTrigger is more complicated
> >>>>
> >>>> Yes, as usual, more generic things are more complicated, but often
> more
> >>>> useful in the long run.
> >>>>
> >>>>> and harder to use.
> >>>>
> >>>> I don't agree. Why setting in config
> >>>>
> >>>> execution.checkpointing.trigger:
> >> BackPressureMonitoringCheckpointTrigger
> >>>>
> >>>>
> >>>
> >>
> execution.checkpointing.BackPressureMonitoringCheckpointTrigger.fast-interval:
> >>>> 1s
> >>>>
> >>>>
> >>>
> >>
> execution.checkpointing.BackPressureMonitoringCheckpointTrigger.slow-interval:
> >>>> 30s
> >>>>
> >>>> that we could even provide a shortcut to the above construct via:
> >>>>
> >>>> execution.checkpointing.fast-interval: 1s
> >>>> execution.checkpointing.slow-interval: 30s
> >>>>
> >>>> is harder compared to setting two/three checkpoint intervals, one in
> >> the
> >>>> config/or via `env.enableCheckpointing(x)`,
> >>>> secondly passing one/two (fast/slow) values on the source itself?
> >>>>
> >>>
> >>> If we can address the use-case by providing just the two job-level
> config
> >>> as described above, I agree it will indeed be simpler.
> >>>
> >>> I have tried to achieve this goal. But the caveat is that it requires
> >> much
> >>> more work than described above in order to give the configs
> well-defined
> >>> semantics. So I find it simpler to just use the approach in FLIP-309.
> >>>
> >>> Let me explain my concern below. It will be great if you or someone
> else
> >>> can help provide a solution.
> >>>
> >>> 1) We need to clearly document when the fast-interval and slow-interval
> >>> will be used so that users can derive the expected behavior of the job
> >> and
> >>> be able to config these values.
> >>>
> >>> 2) The trigger of fast/slow interval depends on the behavior of the
> >> source
> >>> (e.g. MySQL CDC, HybridSource). However, no existing concepts of source
> >>> operator (e.g. boundedness) can describe the target behavior. For
> >> example,
> >>> MySQL CDC internally has two phases, namely snapshot phase and binlog
> >>> phase, which are not explicitly exposed to its users via source
> operator
> >>> API. And we probably should not enumerate all internal phases of all
> >> source
> >>> operators that are affected by fast/slow interval.
> >>>
> >>> 3) An alternative approach might be to define a new concept (e.g.
> >>> processingBacklog) that is applied to all source operators. Then the
> >>> fast/slow interval's documentation can depend on this concept. That
> means
> >>> we have to add a top-level concept (similar to source boundedness) and
> >>> require all source operators to specify how they enforce this concept
> >> (e.g.
> >>> FileSystemSource always emits processingBacklog=true). And there might
> be
> >>> cases where the source itself (e.g. a bounded Kafka Source) can not
> >>> automatically derive the value of this concept, in which case we need
> to
> >>> provide option for users to explicitly specify the value for this
> concept
> >>> on a per-source basis.
> >>>
> >>>
> >>>
> >>>>> And it probably also has the issues of "having two places to
> >> configure
> >>>> checkpointing
> >>>>> interval" and "giving flexibility for every source to implement a
> >>>> different
> >>>>> API" (as mentioned below).
> >>>>
> >>>> No, it doesn't.
> >>>>
> >>>>> IMO, it is a hard-requirement for the user-facing API to be
> >>>>> clearly defined and users should be able to use the API without
> >> concern
> >>>> of
> >>>>> regression. And this requirement is more important than the other
> >> goals
> >>>>> discussed above because it is related to the stability/performance of
> >>> the
> >>>>> production job. What do you think?
> >>>>
> >>>> I don't agree with this. There are many things that work something in
> >>>> between perfectly and well enough
> >>>> in some fraction of use cases (maybe in 99%, maybe 95% or maybe 60%),
> >>> while
> >>>> still being very useful.
> >>>> Good examples are: selection of state backend, unaligned checkpoints,
> >>>> buffer debloating but frankly if I go
> >>>> through list of currently available config options, something like
> half
> >>> of
> >>>> them can cause regressions. Heck,
> >>>> even Flink itself doesn't work perfectly in 100% of the use cases, due
> >>> to a
> >>>> variety of design choices. Of
> >>>> course, the more use cases are fine with said feature, the better, but
> >> we
> >>>> shouldn't fixate to perfectly cover
> >>>> 100% of the cases, as that's impossible.
> >>>>
> >>>> In this particular case, if back pressure monitoring  trigger can work
> >>> well
> >>>> enough in 95% of cases, I would
> >>>> say that's already better than the originally proposed alternative,
> >> which
> >>>> doesn't work at all if user has a large
> >>>> backlog to reprocess from Kafka, including when using HybridSource
> >> AFTER
> >>>> the switch to Kafka has
> >>>> happened. For the remaining 5%, we should try to improve the behaviour
> >>> over
> >>>> time, but ultimately, users can
> >>>> decide to just run a fixed checkpoint interval (or at worst use the
> >> hacky
> >>>> checkpoint trigger that I mentioned
> >>>> before a couple of times).
> >>>>
> >>>> Also to be pedantic, if a user naively selects slow-interval in your
> >>>> proposal to 30 minutes, when that user's
> >>>> job fails on average every 15-20minutes, his job can end up in a state
> >>> that
> >>>> it can not make any progress,
> >>>> this arguably is quite serious regression.
> >>>>
> >>>
> >>> I probably should not say it is "hard requirement". After all there are
> >>> pros/cons. We will need to consider implementation complexity,
> usability,
> >>> extensibility etc.
> >>>
> >>> I just don't think we should take it for granted to introduce
> regression
> >>> for one use-case in order to support another use-case. If we can not
> find
> >>> an algorithm/solution that addresses
> >>> both use-case well, I hope we can be open to tackle them separately so
> >> that
> >>> users can choose the option that best fits their needs.
> >>>
> >>> All things else being equal, I think it is preferred for user-facing
> API
> >> to
> >>> be clearly defined and let users should be able to use the API without
> >>> concern of regression.
> >>>
> >>> Maybe we can list pros/cons for the alternative approaches we have been
> >>> discussing and see choose the best approach. And maybe we will end up
> >>> finding that use-case
> >>> which needs CheckpointTrigger can be tackled separately from the
> use-case
> >>> in FLIP-309.
> >>>
> >>>
> >>>>> I am not sure if there is a typo. Because if
> >>> backPressuredTimeMsPerSecond
> >>>> =
> >>>>> 0, then maxRecordsConsumedWithoutBackpressure =
> >> numRecordsInPerSecond /
> >>>>> 1000 * metricsUpdateInterval according to the above algorithm.
> >>>>>
> >>>>> Do you mean "maxRecordsConsumedWithoutBackpressure =
> >>>> (numRecordsInPerSecond
> >>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) *
> >> metricsUpdateInterval"?
> >>>>
> >>>> It looks like there is indeed some mistake in my proposal above. Yours
> >>> look
> >>>> more correct, it probably
> >>>> still needs some safeguard/special handling if
> >>>> `backPressuredTimeMsPerSecond > 950`
> >>>>
> >>>>> The only information it can access is the backlog.
> >>>>
> >>>> Again no. It can access whatever we want to provide to it.
> >>>>
> >>>> Regarding the rest of your concerns. It's a matter of tweaking the
> >>>> parameters and the algorithm itself,
> >>>> and how much safety-net do we want to have. Ultimately, I'm pretty
> sure
> >>>> that's a (for 95-99% of cases)
> >>>> solvable problem. If not, there is always the hacky solution, that
> >> could
> >>> be
> >>>> even integrated into this above
> >>>> mentioned algorithm as a short circuit to always reach
> `slow-interval`.
> >>>>
> >>>> Apart of that, you picked 3 minutes as the checkpointing interval in
> >> your
> >>>> counter example. In most cases
> >>>> any interval above 1 minute would inflict pretty negligible overheads,
> >> so
> >>>> all in all, I would doubt there is
> >>>> a significant benefit (in most cases) of increasing 3 minute
> checkpoint
> >>>> interval to anything more, let alone
> >>>> 30 minutes.
> >>>>
> >>>
> >>> I am not sure we should design the algorithm with the assumption that
> the
> >>> short checkpointing interval will always be higher than 1 minute etc.
> >>>
> >>> I agree the proposed algorithm can solve most cases where the resource
> is
> >>> sufficient and there is always no backlog in source subtasks. On the
> >> other
> >>> hand, what makes SRE
> >>> life hard is probably the remaining 1-5% cases where the traffic is
> spiky
> >>> and the cluster is reaching its capacity limit. The ability to predict
> >> and
> >>> control Flink job's behavior (including checkpointing interval) can
> >>> considerably reduce the burden of manging Flink jobs.
> >>>
> >>> Best,
> >>> Dong
> >>>
> >>>
> >>>>
> >>>> Best,
> >>>> Piotrek
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> sob., 3 cze 2023 o 05:44 Dong Lin <lindon...@gmail.com> napisał(a):
> >>>>
> >>>>> Hi Piotr,
> >>>>>
> >>>>> Thanks for the explanations. I have some followup questions below.
> >>>>>
> >>>>> On Fri, Jun 2, 2023 at 10:55 PM Piotr Nowojski <pnowoj...@apache.org
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Thanks for chipping in the discussion Ahmed!
> >>>>>>
> >>>>>> Regarding using the REST API. Currently I'm leaning towards
> >>>> implementing
> >>>>>> this feature inside the Flink itself, via some pluggable interface.
> >>>>>> REST API solution would be tempting, but I guess not everyone is
> >>> using
> >>>>>> Flink Kubernetes Operator.
> >>>>>>
> >>>>>> @Dong
> >>>>>>
> >>>>>>> I am not sure metrics such as isBackPressured are already sent to
> >>> JM.
> >>>>>>
> >>>>>> Fetching code path on the JM:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl#queryTmMetricsFuture
> >>>>>>
> >> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore#add
> >>>>>>
> >>>>>> Example code path accessing Task level metrics via JM using the
> >>>>>> `MetricStore`:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> org.apache.flink.runtime.rest.handler.job.metrics.AggregatingSubtasksMetricsHandler
> >>>>>>
> >>>>>
> >>>>> Thanks for the code reference. I checked the code that invoked these
> >>> two
> >>>>> classes and found the following information:
> >>>>>
> >>>>> - AggregatingSubtasksMetricsHandler#getStoresis currently invoked
> >> only
> >>>>> when AggregatingJobsMetricsHandler is invoked.
> >>>>> - AggregatingJobsMetricsHandler is only instantiated and returned by
> >>>>> WebMonitorEndpoint#initializeHandlers
> >>>>> - WebMonitorEndpoint#initializeHandlers is only used by
> >>>> RestServerEndpoint.
> >>>>> And RestServerEndpoint invokes these handlers in response to external
> >>>> REST
> >>>>> request.
> >>>>>
> >>>>> I understand that JM will get the backpressure-related metrics every
> >>> time
> >>>>> the RestServerEndpoint receives the REST request to get these
> >> metrics.
> >>>> But
> >>>>> I am not sure if RestServerEndpoint is already always receiving the
> >>> REST
> >>>>> metrics at regular interval (suppose there is no human manually
> >>>>> opening/clicking the Flink Web UI). And if it does, what is the
> >>> interval?
> >>>>>
> >>>>>
> >>>>>
> >>>>>>> For example, let's say every source operator subtask reports this
> >>>>> metric
> >>>>>> to
> >>>>>>> JM once every 10 seconds. There are 100 source subtasks. And each
> >>>>> subtask
> >>>>>>> is backpressured roughly 10% of the total time due to traffic
> >>> spikes
> >>>>> (and
> >>>>>>> limited buffer). Then at any given time, there are 1 - 0.9^100 =
> >>>>> 99.997%
> >>>>>>> chance that there is at least one subtask that is backpressured.
> >>> Then
> >>>>> we
> >>>>>>> have to wait for at least 10 seconds to check again.
> >>>>>>
> >>>>>> backPressuredTimeMsPerSecond and other related metrics (like
> >>>>>> busyTimeMsPerSecond) are not subject to that problem.
> >>>>>> They are recalculated once every metric fetching interval, and they
> >>>>> report
> >>>>>> accurately on average the given subtask spent
> >>>> busy/idling/backpressured.
> >>>>>> In your example, backPressuredTimeMsPerSecond would report 100ms/s.
> >>>>>
> >>>>>
> >>>>> Suppose every subtask is already reporting
> >> backPressuredTimeMsPerSecond
> >>>> to
> >>>>> JM once every 100 ms. If a job has 10 operators (that are not
> >> chained)
> >>>> and
> >>>>> each operator has 100 subtasks, then JM would need to handle 10000
> >>>> requests
> >>>>> per second to receive metrics from these 1000 subtasks. It seems
> >> like a
> >>>>> non-trivial overhead for medium-to-large sized jobs and can make JM
> >> the
> >>>>> performance bottleneck during job execution.
> >>>>>
> >>>>> I would be surprised if Flink is already paying this much overhead
> >> just
> >>>> for
> >>>>> metrics monitoring. That is the main reason I still doubt it is true.
> >>> Can
> >>>>> you show where this 100 ms is currently configured?
> >>>>>
> >>>>> Alternatively, maybe you mean that we should add extra code to invoke
> >>> the
> >>>>> REST API at 100 ms interval. Then that means we need to considerably
> >>>>> increase the network/cpu overhead at JM, where the overhead will
> >>> increase
> >>>>> as the number of TM/slots increase, which may pose risk to the
> >>>> scalability
> >>>>> of the proposed design. I am not sure we should do this. What do you
> >>>> think?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> While it will be nice to support additional use-cases
> >>>>>>> with one proposal, it is probably also reasonable to make
> >>> incremental
> >>>>>>> progress and support the low-hanging-fruit use-case first. The
> >>> choice
> >>>>>>> really depends on the complexity and the importance of supporting
> >>> the
> >>>>>> extra
> >>>>>>> use-cases.
> >>>>>>
> >>>>>> That would be true, if that was a private implementation detail or
> >> if
> >>>> the
> >>>>>> low-hanging-fruit-solution would be on the direct path to the final
> >>>>>> solution.
> >>>>>> That's unfortunately not the case here. This will add public facing
> >>>> API,
> >>>>>> that we will later need to maintain, no matter what the final
> >>> solution
> >>>>> will
> >>>>>> be,
> >>>>>> and at the moment at least I don't see it being related to a
> >>> "perfect"
> >>>>>> solution.
> >>>>>
> >>>>>
> >>>>> Sure. Then let's decide the final solution first.
> >>>>>
> >>>>>
> >>>>>>> I guess the point is that the suggested approach, which
> >> dynamically
> >>>>>>> determines the checkpointing interval based on the backpressure,
> >>> may
> >>>>>> cause
> >>>>>>> regression when the checkpointing interval is relatively low.
> >> This
> >>>>> makes
> >>>>>> it
> >>>>>>> hard for users to enable this feature in production. It is like
> >> an
> >>>>>>> auto-driving system that is not guaranteed to work
> >>>>>>
> >>>>>> Yes, creating a more generic solution that would require less
> >>>>> configuration
> >>>>>> is usually more difficult then static configurations.
> >>>>>> It doesn't mean we shouldn't try to do that. Especially that if my
> >>>>> proposed
> >>>>>> algorithm wouldn't work good enough, there is
> >>>>>> an obvious solution, that any source could add a metric, like let
> >> say
> >>>>>> "processingBacklog: true/false", and the `CheckpointTrigger`
> >>>>>> could use this as an override to always switch to the
> >>>>>> "slowCheckpointInterval". I don't think we need it, but that's
> >> always
> >>>> an
> >>>>>> option
> >>>>>> that would be basically equivalent to your original proposal. Or
> >> even
> >>>>>> source could add "suggestedCheckpointInterval : int", and
> >>>>>> `CheckpointTrigger` could use that value if present as a hint in
> >> one
> >>>> way
> >>>>> or
> >>>>>> another.
> >>>>>>
> >>>>>
> >>>>> So far we have talked about the possibility of using
> >> CheckpointTrigger
> >>>> and
> >>>>> mentioned the CheckpointTrigger
> >>>>> and read metric values.
> >>>>>
> >>>>> Can you help answer the following questions so that I can understand
> >>> the
> >>>>> alternative solution more concretely:
> >>>>>
> >>>>> - What is the interface of this CheckpointTrigger? For example, are
> >> we
> >>>>> going to give CheckpointTrigger a context that it can use to fetch
> >>>>> arbitrary metric values? This can help us understand what information
> >>>> this
> >>>>> user-defined CheckpointTrigger can use to make the checkpoint
> >> decision.
> >>>>> - Where is this CheckpointTrigger running? For example, is it going
> >> to
> >>>> run
> >>>>> on the subtask of every source operator? Or is it going to run on the
> >>> JM?
> >>>>> - Are we going to provide a default implementation of this
> >>>>> CheckpointTrigger in Flink that implements the algorithm described
> >>> below,
> >>>>> or do we expect each source operator developer to implement their own
> >>>>> CheckpointTrigger?
> >>>>> - How can users specify the
> >>>> fastCheckpointInterval/slowCheckpointInterval?
> >>>>> For example, will we provide APIs on the CheckpointTrigger that
> >>> end-users
> >>>>> can use to specify the checkpointing interval? What would that look
> >>> like?
> >>>>>
> >>>>> Overall, my gut feel is that the alternative approach based on
> >>>>> CheckpointTrigger is more complicated and harder to use. And it
> >>> probably
> >>>>> also has the issues of "having two places to configure checkpointing
> >>>>> interval" and "giving flexibility for every source to implement a
> >>>> different
> >>>>> API" (as mentioned below).
> >>>>>
> >>>>> Maybe we can evaluate it more after knowing the answers to the above
> >>>>> questions.
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> On the other hand, the approach currently proposed in the FLIP is
> >>>> much
> >>>>>>> simpler as it does not depend on backpressure. Users specify the
> >>>> extra
> >>>>>>> interval requirement on the specific sources (e.g. HybridSource,
> >>>> MySQL
> >>>>>> CDC
> >>>>>>> Source) and can easily know the checkpointing interval will be
> >> used
> >>>> on
> >>>>>> the
> >>>>>>> continuous phase of the corresponding source. This is pretty much
> >>>> same
> >>>>> as
> >>>>>>> how users use the existing execution.checkpointing.interval
> >> config.
> >>>> So
> >>>>>>> there is no extra concern of regression caused by this approach.
> >>>>>>
> >>>>>> To an extent, but as I have already previously mentioned I really
> >>>> really
> >>>>> do
> >>>>>> not like idea of:
> >>>>>>  - having two places to configure checkpointing interval (config
> >>> file
> >>>>> and
> >>>>>> in the Source builders)
> >>>>>>  - giving flexibility for every source to implement a different
> >> API
> >>>> for
> >>>>>> that purpose
> >>>>>>  - creating a solution that is not generic enough, so that we will
> >>>> need
> >>>>> a
> >>>>>> completely different mechanism in the future anyway
> >>>>>>
> >>>>>
> >>>>> Yeah, I understand different developers might have different
> >>>>> concerns/tastes for these APIs. Ultimately, there might not be a
> >>> perfect
> >>>>> solution and we have to choose based on the pros/cons of these
> >>> solutions.
> >>>>>
> >>>>> I agree with you that, all things being equal, it is preferable to 1)
> >>>> have
> >>>>> one place to configure checkpointing intervals, 2) have all source
> >>>>> operators use the same API, and 3) create a solution that is generic
> >>> and
> >>>>> last lasting. Note that these three goals affects the usability and
> >>>>> extensibility of the API, but not necessarily the
> >> stability/performance
> >>>> of
> >>>>> the production job.
> >>>>>
> >>>>> BTW, there are also other preferrable goals. For example, it is very
> >>>> useful
> >>>>> for the job's behavior to be predictable and interpretable so that
> >> SRE
> >>>> can
> >>>>> operator/debug the Flink in an easier way. We can list these
> >> pros/cons
> >>>>> altogether later.
> >>>>>
> >>>>> I am wondering if we can first agree on the priority of goals we want
> >>> to
> >>>>> achieve. IMO, it is a hard-requirement for the user-facing API to be
> >>>>> clearly defined and users should be able to use the API without
> >> concern
> >>>> of
> >>>>> regression. And this requirement is more important than the other
> >> goals
> >>>>> discussed above because it is related to the stability/performance of
> >>> the
> >>>>> production job. What do you think?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> Sounds good. Looking forward to learning more ideas.
> >>>>>>
> >>>>>> I have thought about this a bit more, and I think we don't need to
> >>>> check
> >>>>>> for the backpressure status, or how much overloaded all of the
> >>>> operators
> >>>>>> are.
> >>>>>> We could just check three things for source operators:
> >>>>>> 1. pendingRecords (backlog length)
> >>>>>> 2. numRecordsInPerSecond
> >>>>>> 3. backPressuredTimeMsPerSecond
> >>>>>>
> >>>>>> // int metricsUpdateInterval = 10s // obtained from config
> >>>>>> // Next line calculates how many records can we consume from the
> >>>> backlog,
> >>>>>> assuming
> >>>>>> // that magically the reason behind a backpressure vanishes. We
> >> will
> >>>> use
> >>>>>> this only as
> >>>>>> // a safeguard  against scenarios like for example if backpressure
> >>> was
> >>>>>> caused by some
> >>>>>> // intermittent failure/performance degradation.
> >>>>>> maxRecordsConsumedWithoutBackpressure = (numRecordsInPerSecond /
> >>> (1000
> >>>>>> - backPressuredTimeMsPerSecond / 1000)) * metricsUpdateInterval
> >>>>>>
> >>>>>
> >>>>> I am not sure if there is a typo. Because if
> >>>> backPressuredTimeMsPerSecond =
> >>>>> 0, then maxRecordsConsumedWithoutBackpressure =
> >> numRecordsInPerSecond /
> >>>>> 1000 * metricsUpdateInterval according to the above algorithm.
> >>>>>
> >>>>> Do you mean "maxRecordsConsumedWithoutBackpressure =
> >>>> (numRecordsInPerSecond
> >>>>> / (1 - backPressuredTimeMsPerSecond / 1000)) *
> >> metricsUpdateInterval"?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> // we are excluding maxRecordsConsumedWithoutBackpressure from the
> >>>>> backlog
> >>>>>> as
> >>>>>> // a safeguard against an intermittent back pressure problems, so
> >>> that
> >>>> we
> >>>>>> don't
> >>>>>> // calculate next checkpoint interval far far in the future, while
> >>> the
> >>>>>> backpressure
> >>>>>> // goes away before we will recalculate metrics and new
> >> checkpointing
> >>>>>> interval
> >>>>>> timeToConsumeBacklog = (pendingRecords -
> >>>>>> maxRecordsConsumedWithoutBackpressure) / numRecordsInPerSecond
> >>>>>>
> >>>>>>
> >>>>>> Then we can use those numbers to calculate desired checkpointed
> >>>> interval
> >>>>>> for example like this:
> >>>>>>
> >>>>>> long calculatedCheckpointInterval = timeToConsumeBacklog / 10;
> >> //this
> >>>> may
> >>>>>> need some refining
> >>>>>> long nextCheckpointInterval = min(max(fastCheckpointInterval,
> >>>>>> calculatedCheckpointInterval), slowCheckpointInterval);
> >>>>>> long nextCheckpointTs = lastCheckpointTs + nextCheckpointInterval;
> >>>>>>
> >>>>>> WDYT?
> >>>>>
> >>>>>
> >>>>> I think the idea of the above algorithm is to incline to use the
> >>>>> fastCheckpointInterval unless we are very sure the backlog will take
> >> a
> >>>> long
> >>>>> time to process. This can alleviate the concern of regression during
> >>> the
> >>>>> continuous_bounded phase since we are more likely to use the
> >>>>> fastCheckpointInterval. However, it can cause regression during the
> >>>> bounded
> >>>>> phase.
> >>>>>
> >>>>> I will use a concrete example to explain the risk of regression:
> >>>>> - The user is using HybridSource to read from HDFS followed by Kafka.
> >>> The
> >>>>> data in HDFS is old and there is no need for data freshness for the
> >>> data
> >>>> in
> >>>>> HDFS.
> >>>>> - The user configures the job as below:
> >>>>>  - fastCheckpointInterval = 3 minutes
> >>>>>  - slowCheckpointInterval = 30 minutes
> >>>>>  - metricsUpdateInterval = 100 ms
> >>>>>
> >>>>> Using the above formulate, we can know that once pendingRecords
> >>>>> <= numRecordsInPerSecond * 30-minutes, then
> >>> calculatedCheckpointInterval
> >>>> <=
> >>>>> 3 minutes, meaning that we will use slowCheckpointInterval as the
> >>>>> checkpointing interval. Then in the last 30 minutes of the bounded
> >>> phase,
> >>>>> the checkpointing frequency will be 10X higher than what the user
> >>> wants.
> >>>>>
> >>>>> Also note that the same issue would also considerably limit the
> >>> benefits
> >>>> of
> >>>>> the algorithm. For example, during the continuous phase, the
> >> algorithm
> >>>> will
> >>>>> only be better than the approach in FLIP-309 when there is at least
> >>>>> 30-minutes worth of backlog in the source.
> >>>>>
> >>>>> Sure, having a slower checkpointing interval in this extreme case
> >>> (where
> >>>>> there is 30-minutes backlog in the continous-unbounded phase) is
> >> still
> >>>>> useful when this happens. But since this is the un-common case, and
> >> the
> >>>>> right solution is probably to do capacity planning to avoid this from
> >>>>> happening in the first place, I am not sure it is worth optimizing
> >> for
> >>>> this
> >>>>> case at the cost of regression in the bounded phase and the reduced
> >>>>> operational predictability for users (e.g. what checkpointing
> >> interval
> >>>>> should I expect at this stage of the job).
> >>>>>
> >>>>> I think the fundamental issue with this algorithm is that it is
> >> applied
> >>>> to
> >>>>> both the bounded phases and the continous_unbounded phases without
> >>>> knowing
> >>>>> which phase the job is running at. The only information it can access
> >>> is
> >>>>> the backlog. But two sources with the same amount of backlog do not
> >>>>> necessarily mean they have the same data freshness requirement.
> >>>>>
> >>>>> In this particular example, users know that the data in HDFS is very
> >>> old
> >>>>> and there is no need for data freshness. Users can express signals
> >> via
> >>>> the
> >>>>> per-source API proposed in the FLIP. This is why the current approach
> >>> in
> >>>>> FLIP-309 can be better in this case.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> Best,
> >>>>> Dong
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Best,
> >>>>>> Piotrek
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-309: Enable operators to trigger checkpoints dynamically

Reply via email to