Re: [DISCUSS] Checkpointing (partially) failing jobs

Piotr Nowojski Tue, 08 Feb 2022 03:05:10 -0800

Hi,

I second Chesnay's comment and would like to better understand the
motivation behind this. At the surface it sounds to me like this might
require quite a bit of work for a very narrow use case.


At the same time I have a feeling that Yuan, you are mixing this feature
request (checkpointing subgraphs/pipeline regions independently) and a very
very different issue of "task local checkpoints"? Those problems are kind
of similar, but not quite.

Best,
Piotrek

wt., 8 lut 2022 o 11:44 Chesnay Schepler <ches...@apache.org> napisał(a):

> Could someone expand on these operational issues you're facing when
> achieving this via separate jobs?
>
> I feel like we're skipping a step, arguing about solutions without even
> having discussed the underlying problem.
>
> On 08/02/2022 11:25, Gen Luo wrote:
> > Hi,
> >
> > @Yuan
> > Do you mean that there should be no shared state between source subtasks?
> > Sharing state between checkpoints of a specific subtask should be fine.
> >
> > Sharing state between subtasks of a task can be an issue, no matter
> whether
> > it's a source. That's also what I was afraid of in the previous replies.
> In
> > one word, if the behavior of a pipeline region can somehow influence the
> > state of other pipeline regions, their checkpoints have to be aligned
> > before rescaling.
> >
> > On Tue, Feb 8, 2022 at 5:27 PM Yuan Mei <yuanmei.w...@gmail.com> wrote:
> >
> >> Hey Folks,
> >>
> >> Thanks for the discussion!
> >>
> >> *Motiviation and use cases*
> >> I think motiviation and use cases are very clear and I do not have
> doubts
> >> on this part.
> >> A typical use case is ETL with two-phase-commit, hundreds of partitions
> can
> >> be blocked by a single straggler (a single task's checkpoint abortion
> can
> >> affect all, not necessary failure).
> >>
> >> *Source offset redistribution*
> >> As for the known sources & implementation for Flink, I can not find a
> case
> >> that does not work, *for now*.
> >> I need to dig a bit more: how splits are tracked assigned, not
> successfully
> >> processed, succesffully processed e.t.c.
> >> I guess it is a single shared source OPCoordinator. And how this
> *shared*
> >> state (between tasks) is preserved?
> >>
> >> *Input partition/splits treated completely independent from each other*
> >> This part I am still not sure, as mentioned if we have shared state of
> >> source in the above section.
> >>
> >> To Thomas:
> >>> In Yuan's example, is there a reason why CP8 could not be promoted to
> >>> CP10 by the coordinator for PR2 once it receives the notification that
> >>> CP10 did not complete? It appears that should be possible since in its
> >>> effect it should be no different than no data processed between CP8
> >>>   and CP10?
> >> Not sure what "promoted" means here, but
> >> 1. I guess it does not matter whether it is CP8 or CP10 any more,
> >> if no shared state in source, as exactly what you meantinoed,
> >> "it should be no different than no data processed between CP8 and CP10"
> >>
> >> 2. I've noticed that from this question there is a gap between
> >> "*allow aborted/failed checkpoint in independent sub-graph*" and
> >> my intention: "*independent sub-graph checkpointing indepently*"
> >>
> >> Best
> >> Yuan
> >>
> >>
> >> On Tue, Feb 8, 2022 at 11:34 AM Gen Luo <luogen...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> I'm thinking about Yuan's case. Let's assume that the case is running
> in
> >>> current Flink:
> >>> 1. CP8 finishes
> >>> 2. For some reason, PR2 stops consuming records from the source (but is
> >> not
> >>> stuck), and PR1 continues consuming new records.
> >>> 3. CP9 and CP10 finish
> >>> 4. PR2 starts to consume quickly to catch up with PR1, and reaches the
> >> same
> >>> final status with that in Yuan's case before CP11 starts.
> >>>
> >>> I support that in this case, the status of the job can be the same as
> in
> >>> Yuan's case, and the snapshot (including source states) taken at CP10
> >>> should be the same as the composed global snapshot in Yuan's case,
> which
> >> is
> >>> combining CP10 of PR1 and CP8 of PR2. This should be true if neither
> >> failed
> >>> checkpointing nor uncommitted consuming have side effects, both of
> which
> >>> can break the exactly-once semantics when replaying. So I think there
> >>> should be no difference between rescaling the combined global snapshot
> or
> >>> the globally taken one, i.e. if the input partitions are not
> independent,
> >>> we are probably not able to rescale the source state in the current
> Flink
> >>> eiter.
> >>>
> >>> And @Thomas, I do agree that the operational burden is
> >>> significantly reduced, while I'm a little afraid that checkpointing the
> >>> subgraphs individually may increase most of the runtime overhead back
> >>> again. Maybe we can find a better way to implement this.
> >>>
> >>> On Tue, Feb 8, 2022 at 5:11 AM Thomas Weise <t...@apache.org> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Thanks for opening this discussion! The proposed enhancement would be
> >>>> interesting for use cases in our infrastructure as well.
> >>>>
> >>>> There are scenarios where it makes sense to have multiple disconnected
> >>>> subgraphs in a single job because it can significantly reduce the
> >>>> operational burden as well as the runtime overhead. Since we allow
> >>>> subgraphs to recover independently, then why not allow them to make
> >>>> progress independently also, which would imply that checkpointing must
> >>>> succeed for non affected subgraphs as certain behavior is tied to
> >>>> checkpoint completion, like Kafka offset commit, file output etc.
> >>>>
> >>>> As for source offset redistribution, offset/position needs to be tied
> >>>> to splits (in FLIP-27) and legacy sources. (It applies to both Kafka
> >>>> and Kinesis legacy sources and FLIP-27 Kafka source.). With the new
> >>>> source framework, it would be hard to implement a source with correct
> >>>> behavior that does not track the position along with the split.
> >>>>
> >>>> In Yuan's example, is there a reason why CP8 could not be promoted to
> >>>> CP10 by the coordinator for PR2 once it receives the notification that
> >>>> CP10 did not complete? It appears that should be possible since in its
> >>>> effect it should be no different than no data processed between CP8
> >>>> and CP10?
> >>>>
> >>>> Thanks,
> >>>> Thomas
> >>>>
> >>>> On Mon, Feb 7, 2022 at 2:36 AM Till Rohrmann <trohrm...@apache.org>
> >>> wrote:
> >>>>> Thanks for the clarification Yuan and Gen,
> >>>>>
> >>>>> I agree that the checkpointing of the sources needs to support the
> >>>>> rescaling case, otherwise it does not work. Is there currently a
> >> source
> >>>>> implementation where this wouldn't work? For Kafka it should work
> >>> because
> >>>>> we store the offset per assigned partition. For Kinesis it is
> >> probably
> >>>> the
> >>>>> same. For the Filesource we store the set of unread input splits in
> >> the
> >>>>> source coordinator and the state of the assigned splits in the
> >> sources.
> >>>>> This should probably also work since new splits are only handed out
> >> to
> >>>>> running tasks.
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>> On Mon, Feb 7, 2022 at 10:29 AM Yuan Mei <yuanmei.w...@gmail.com>
> >>> wrote:
> >>>>>> Hey Till,
> >>>>>>
> >>>>>>> Why rescaling is a problem for pipelined regions/independent
> >>>> execution
> >>>>>> subgraphs:
> >>>>>>
> >>>>>> Take a simplified example :
> >>>>>> job graph : source  (2 instances) -> sink (2 instances)
> >>>>>> execution graph:
> >>>>>> source (1/2)  -> sink (1/2)   [pieplined region 1]
> >>>>>> source (2/2)  -> sink (2/2)   [pieplined region 2]
> >>>>>>
> >>>>>> Let's assume checkpoints are still triggered globally, meaning
> >>>> different
> >>>>>> pipelined regions share the global checkpoint id (PR1 CP1 matches
> >>> with
> >>>> PR2
> >>>>>> CP1).
> >>>>>>
> >>>>>> Now let's assume PR1 completes CP10 and PR2 completes CP8.
> >>>>>>
> >>>>>> Let's say we want to rescale to parallelism 3 due to increased
> >> input.
> >>>>>> - Notice that we can not simply rescale based on the latest
> >> completed
> >>>>>> checkpoint (CP8), because PR1 has already had data (CP8 -> CP10)
> >>> output
> >>>>>> externally.
> >>>>>> - Can we take CP10 from PR1 and CP8 from PR2? I think it depends on
> >>>> how the
> >>>>>> source's offset redistribution is implemented.
> >>>>>>     The answer is yes if we treat each input partition as
> >> independent
> >>>> from
> >>>>>> each other, *but I am not sure whether we can make that
> >> assumption*.
> >>>>>> If not, the rescaling cannot happen until PR1 and PR2 are aligned
> >>> with
> >>>> CPs.
> >>>>>> Best
> >>>>>> -Yuan
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Feb 7, 2022 at 4:17 PM Till Rohrmann <trohrm...@apache.org
> >>>> wrote:
> >>>>>>> Hi everyone,
> >>>>>>>
> >>>>>>> Yuan and Gen could you elaborate why rescaling is a problem if we
> >>> say
> >>>>>> that
> >>>>>>> separate pipelined regions can take checkpoints independently?
> >>>>>>> Conceptually, I somehow think that a pipelined region that is
> >>> failed
> >>>> and
> >>>>>>> cannot create a new checkpoint is more or less the same as a
> >>>> pipelined
> >>>>>>> region that didn't get new input or a very very slow pipelined
> >>> region
> >>>>>> which
> >>>>>>> couldn't read new records since the last checkpoint (assuming
> >> that
> >>>> the
> >>>>>>> checkpoint coordinator can create a global checkpoint by
> >> combining
> >>>>>>> individual checkpoints (e.g. taking the last completed checkpoint
> >>>> from
> >>>>>> each
> >>>>>>> pipelined region)). If this comparison is correct, then this
> >> would
> >>>> mean
> >>>>>>> that we have rescaling problems under the latter two cases.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Till
> >>>>>>>
> >>>>>>> On Mon, Feb 7, 2022 at 8:55 AM Gen Luo <luogen...@gmail.com>
> >>> wrote:
> >>>>>>>> Hi Gyula,
> >>>>>>>>
> >>>>>>>> Thanks for sharing the idea. As Yuan mentioned, I think we can
> >>>> discuss
> >>>>>>> this
> >>>>>>>> within two scopes. One is the job subgraph, the other is the
> >>>> execution
> >>>>>>>> subgraph, which I suppose is the same as PipelineRegion.
> >>>>>>>>
> >>>>>>>> An idea is to individually checkpoint the PipelineRegions, for
> >>> the
> >>>>>>>> recovering in a single run.
> >>>>>>>>
> >>>>>>>> Flink has now supported PipelineRegion based failover, with a
> >>>> subset
> >>>>>> of a
> >>>>>>>> global checkpoint snapshot. The checkpoint barriers are spread
> >>>> within a
> >>>>>>>> PipelineRegion, so the checkpointing of individual
> >>> PipelineRegions
> >>>> is
> >>>>>>>> actually independent. Since in a single run of a job, the
> >>>>>> PipelineRegions
> >>>>>>>> are fixed, we can individually checkpoint separated
> >>>> PipelineRegions,
> >>>>>>>> despite what status the other PipelineRegions are, and use a
> >>>> snapshot
> >>>>>> of
> >>>>>>> a
> >>>>>>>> failing region to recover, instead of the subset of a global
> >>>> snapshot.
> >>>>>>> This
> >>>>>>>> can support separated job subgraphs as well, since they will
> >> also
> >>>> be
> >>>>>>>> separated into different PipelineRegions. I think this can
> >>> fulfill
> >>>> your
> >>>>>>>> needs.
> >>>>>>>>
> >>>>>>>> In fact the individual snapshots of all PipelineRegions can
> >> form
> >>> a
> >>>>>> global
> >>>>>>>> snapshot, and the alignment of snapshots of individual regions
> >> is
> >>>> not
> >>>>>>>> necessary. But rescaling this global snapshot can be
> >> potentially
> >>>>>>> complex. I
> >>>>>>>> think it's better to use the individual snapshots in a single
> >>> run,
> >>>> and
> >>>>>>> take
> >>>>>>>> a global checkpoint/savepoint before restarting the job,
> >>> rescaling
> >>>> it
> >>>>>> or
> >>>>>>>> not.
> >>>>>>>>
> >>>>>>>> A major issue of this plan is that it breaks the checkpoint
> >>>> mechanism
> >>>>>> of
> >>>>>>>> Flink. As far as I know, even in the approximate recovery, the
> >>>> snapshot
> >>>>>>>> used to recover a single task is still a part of a global
> >>>> snapshot. To
> >>>>>>>> implement the individual checkpointing of PipelineRegions,
> >> there
> >>>> may
> >>>>>> need
> >>>>>>>> to be a checkpoint coordinator for each PipelineRegion, and a
> >> new
> >>>>>> global
> >>>>>>>> checkpoint coordinator. When the scale goes up, there can be
> >> many
> >>>>>>>> individual regions, which can be a big burden to the job
> >> manager.
> >>>> The
> >>>>>>>> meaning of the checkpoint id will also be changed, which can
> >>> affect
> >>>>>> many
> >>>>>>>> aspects. There can be lots of work and risks, and the risks
> >> still
> >>>> exist
> >>>>>>> if
> >>>>>>>> we only individually checkpoint separated job subgraphs, since
> >>> the
> >>>>>>>> mechanism is still broken. If that is what you need, maybe
> >>>> separating
> >>>>>>> them
> >>>>>>>> into different jobs is an easier and better choice, as Caizhi
> >> and
> >>>> Yuan
> >>>>>>>> mentioned.
> >>>>>>>>
> >>>>>>>> On Mon, Feb 7, 2022 at 11:39 AM Yuan Mei <
> >> yuanmei.w...@gmail.com
> >>>>>> wrote:
> >>>>>>>>> Hey Gyula,
> >>>>>>>>>
> >>>>>>>>> That's a very interesting idea. The discussion about the
> >>>> `Individual`
> >>>>>>> vs
> >>>>>>>>> `Global` checkpoint was raised before, but the main concern
> >> was
> >>>> from
> >>>>>>> two
> >>>>>>>>> aspects:
> >>>>>>>>>
> >>>>>>>>> - Non-deterministic replaying may lead to an inconsistent
> >> view
> >>> of
> >>>>>>>>> checkpoint
> >>>>>>>>> - It is not easy to form a clear cut of past and future and
> >>>> hence no
> >>>>>>>> clear
> >>>>>>>>> cut of where the start point of the source should begin to
> >>> replay
> >>>>>> from.
> >>>>>>>>> Starting from independent subgraphs as you proposed may be a
> >>> good
> >>>>>>>> starting
> >>>>>>>>> point. However, when we talk about subgraph, do we mention it
> >>> as
> >>>> a
> >>>>>> job
> >>>>>>>>> subgraph (each vertex is one or more operators) or execution
> >>>> subgraph
> >>>>>>>> (each
> >>>>>>>>> vertex is a task instance)?
> >>>>>>>>>
> >>>>>>>>> If it is a job subgraph, then indeed, why not separate it
> >> into
> >>>>>> multiple
> >>>>>>>>> jobs as Caizhi mentioned.
> >>>>>>>>> If it is an execution subgraph, then it is difficult to
> >> handle
> >>>>>>> rescaling
> >>>>>>>>> due to inconsistent views of checkpoints between tasks of the
> >>>> same
> >>>>>>>>> operator.
> >>>>>>>>>
> >>>>>>>>> `Individual/Subgraph Checkpointing` is definitely an
> >>> interesting
> >>>>>>>> direction
> >>>>>>>>> to think of, and I'd love to hear more from you!
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>>
> >>>>>>>>> Yuan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Feb 7, 2022 at 10:16 AM Caizhi Weng <
> >>>> tsreape...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>> Hi Gyula!
> >>>>>>>>>>
> >>>>>>>>>> Thanks for raising this discussion. I agree that this will
> >> be
> >>>> an
> >>>>>>>>>> interesting feature but I actually have some doubts about
> >> the
> >>>>>>>> motivation
> >>>>>>>>>> and use case. If there are multiple individual subgraphs in
> >>> the
> >>>>>> same
> >>>>>>>> job,
> >>>>>>>>>> why not just distribute them to multiple jobs so that each
> >>> job
> >>>>>>> contains
> >>>>>>>>>> only one individual graph and can now fail without
> >> disturbing
> >>>> the
> >>>>>>>> others?
> >>>>>>>>>>
> >>>>>>>>>> Gyula Fóra <gyf...@apache.org> 于2022年2月7日周一 05:22写道：
> >>>>>>>>>>
> >>>>>>>>>>> Hi all!
> >>>>>>>>>>>
> >>>>>>>>>>> At the moment checkpointing only works for healthy jobs
> >>> with
> >>>> all
> >>>>>>>>> running
> >>>>>>>>>>> (or some finished) tasks. This sounds reasonable in most
> >>>> cases
> >>>>>> but
> >>>>>>>>> there
> >>>>>>>>>>> are a few applications where it would make sense to
> >>>> checkpoint
> >>>>>>>> failing
> >>>>>>>>>> jobs
> >>>>>>>>>>> as well.
> >>>>>>>>>>>
> >>>>>>>>>>> Due to how the checkpointing mechanism works, subgraphs
> >>> that
> >>>>>> have a
> >>>>>>>>>> failing
> >>>>>>>>>>> task cannot be checkpointed without violating the
> >>>> exactly-once
> >>>>>>>>> semantics.
> >>>>>>>>>>> However if the job has multiple independent subgraphs
> >> (that
> >>>> are
> >>>>>> not
> >>>>>>>>>>> connected to each other), even if one subgraph is
> >> failing,
> >>>> the
> >>>>>>> other
> >>>>>>>>>>> completely running one could be checkpointed. In these
> >>> cases
> >>>> the
> >>>>>>>> tasks
> >>>>>>>>> of
> >>>>>>>>>>> the failing subgraph could simply inherit the last
> >>> successful
> >>>>>>>>> checkpoint
> >>>>>>>>>>> metadata (before they started failing). This logic would
> >>>> produce
> >>>>>> a
> >>>>>>>>>>> consistent checkpoint.
> >>>>>>>>>>>
> >>>>>>>>>>> The job as a whole could now make stateful progress even
> >> if
> >>>> some
> >>>>>>>>>> subgraphs
> >>>>>>>>>>> are constantly failing. This can be very valuable if for
> >>> some
> >>>>>>> reason
> >>>>>>>>> the
> >>>>>>>>>>> job has a larger number of independent subgraphs that are
> >>>>>> expected
> >>>>>>> to
> >>>>>>>>>> fail
> >>>>>>>>>>> every once in a while, or if some subgraphs can have
> >> longer
> >>>>>>> downtimes
> >>>>>>>>>> that
> >>>>>>>>>>> would now cause the whole job to stall.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Gyula
> >>>>>>>>>>>
>
>

Re: [DISCUSS] Checkpointing (partially) failing jobs

Reply via email to