Re: [DISCUSS] Checkpointing (partially) failing jobs

Chesnay Schepler Tue, 08 Feb 2022 02:44:48 -0800

Could someone expand on these operational issues you're facing whenachieving this via separate jobs?

I feel like we're skipping a step, arguing about solutions without evenhaving discussed the underlying problem.


On 08/02/2022 11:25, Gen Luo wrote:

Hi,

@Yuan
Do you mean that there should be no shared state between source subtasks?
Sharing state between checkpoints of a specific subtask should be fine.

Sharing state between subtasks of a task can be an issue, no matter whether
it's a source. That's also what I was afraid of in the previous replies. In
one word, if the behavior of a pipeline region can somehow influence the
state of other pipeline regions, their checkpoints have to be aligned
before rescaling.

On Tue, Feb 8, 2022 at 5:27 PM Yuan Mei <[email protected]> wrote:

Hey Folks,

Thanks for the discussion!

*Motiviation and use cases*
I think motiviation and use cases are very clear and I do not have doubts
on this part.
A typical use case is ETL with two-phase-commit, hundreds of partitions can
be blocked by a single straggler (a single task's checkpoint abortion can
affect all, not necessary failure).

*Source offset redistribution*
As for the known sources & implementation for Flink, I can not find a case
that does not work, *for now*.
I need to dig a bit more: how splits are tracked assigned, not successfully
processed, succesffully processed e.t.c.
I guess it is a single shared source OPCoordinator. And how this *shared*
state (between tasks) is preserved?

*Input partition/splits treated completely independent from each other*
This part I am still not sure, as mentioned if we have shared state of
source in the above section.

To Thomas:

In Yuan's example, is there a reason why CP8 could not be promoted to
CP10 by the coordinator for PR2 once it receives the notification that
CP10 did not complete? It appears that should be possible since in its
effect it should be no different than no data processed between CP8
  and CP10?

Not sure what "promoted" means here, but
1. I guess it does not matter whether it is CP8 or CP10 any more,
if no shared state in source, as exactly what you meantinoed,
"it should be no different than no data processed between CP8 and CP10"

2. I've noticed that from this question there is a gap between
"*allow aborted/failed checkpoint in independent sub-graph*" and
my intention: "*independent sub-graph checkpointing indepently*"

Best
Yuan


On Tue, Feb 8, 2022 at 11:34 AM Gen Luo <[email protected]> wrote:

Hi,

I'm thinking about Yuan's case. Let's assume that the case is running in
current Flink:
1. CP8 finishes
2. For some reason, PR2 stops consuming records from the source (but is

not

stuck), and PR1 continues consuming new records.
3. CP9 and CP10 finish
4. PR2 starts to consume quickly to catch up with PR1, and reaches the

same

final status with that in Yuan's case before CP11 starts.

I support that in this case, the status of the job can be the same as in
Yuan's case, and the snapshot (including source states) taken at CP10
should be the same as the composed global snapshot in Yuan's case, which

is

combining CP10 of PR1 and CP8 of PR2. This should be true if neither

failed

checkpointing nor uncommitted consuming have side effects, both of which
can break the exactly-once semantics when replaying. So I think there
should be no difference between rescaling the combined global snapshot or
the globally taken one, i.e. if the input partitions are not independent,
we are probably not able to rescale the source state in the current Flink
eiter.

And @Thomas, I do agree that the operational burden is
significantly reduced, while I'm a little afraid that checkpointing the
subgraphs individually may increase most of the runtime overhead back
again. Maybe we can find a better way to implement this.

On Tue, Feb 8, 2022 at 5:11 AM Thomas Weise <[email protected]> wrote:

Hi,

Thanks for opening this discussion! The proposed enhancement would be
interesting for use cases in our infrastructure as well.

There are scenarios where it makes sense to have multiple disconnected
subgraphs in a single job because it can significantly reduce the
operational burden as well as the runtime overhead. Since we allow
subgraphs to recover independently, then why not allow them to make
progress independently also, which would imply that checkpointing must
succeed for non affected subgraphs as certain behavior is tied to
checkpoint completion, like Kafka offset commit, file output etc.

As for source offset redistribution, offset/position needs to be tied
to splits (in FLIP-27) and legacy sources. (It applies to both Kafka
and Kinesis legacy sources and FLIP-27 Kafka source.). With the new
source framework, it would be hard to implement a source with correct
behavior that does not track the position along with the split.

In Yuan's example, is there a reason why CP8 could not be promoted to
CP10 by the coordinator for PR2 once it receives the notification that
CP10 did not complete? It appears that should be possible since in its
effect it should be no different than no data processed between CP8
and CP10?

Thanks,
Thomas

On Mon, Feb 7, 2022 at 2:36 AM Till Rohrmann <[email protected]>

wrote:

Thanks for the clarification Yuan and Gen,

I agree that the checkpointing of the sources needs to support the
rescaling case, otherwise it does not work. Is there currently a

source

implementation where this wouldn't work? For Kafka it should work

because

we store the offset per assigned partition. For Kinesis it is

probably

the

same. For the Filesource we store the set of unread input splits in

the

source coordinator and the state of the assigned splits in the

sources.

This should probably also work since new splits are only handed out

to

running tasks.

Cheers,
Till

On Mon, Feb 7, 2022 at 10:29 AM Yuan Mei <[email protected]>

wrote:

Hey Till,

Why rescaling is a problem for pipelined regions/independent

execution

subgraphs:

Take a simplified example :
job graph : source  (2 instances) -> sink (2 instances)
execution graph:
source (1/2)  -> sink (1/2)   [pieplined region 1]
source (2/2)  -> sink (2/2)   [pieplined region 2]

Let's assume checkpoints are still triggered globally, meaning

different

pipelined regions share the global checkpoint id (PR1 CP1 matches

with

PR2

CP1).

Now let's assume PR1 completes CP10 and PR2 completes CP8.

Let's say we want to rescale to parallelism 3 due to increased

input.

- Notice that we can not simply rescale based on the latest

completed

checkpoint (CP8), because PR1 has already had data (CP8 -> CP10)

output

externally.
- Can we take CP10 from PR1 and CP8 from PR2? I think it depends on

how the

source's offset redistribution is implemented.
    The answer is yes if we treat each input partition as

independent

from

each other, *but I am not sure whether we can make that

assumption*.

If not, the rescaling cannot happen until PR1 and PR2 are aligned

with

CPs.

Best
-Yuan







On Mon, Feb 7, 2022 at 4:17 PM Till Rohrmann <[email protected]

wrote:

Hi everyone,

Yuan and Gen could you elaborate why rescaling is a problem if we

say

that

separate pipelined regions can take checkpoints independently?
Conceptually, I somehow think that a pipelined region that is

failed

and

cannot create a new checkpoint is more or less the same as a

pipelined

region that didn't get new input or a very very slow pipelined

region

which

couldn't read new records since the last checkpoint (assuming

that

the

checkpoint coordinator can create a global checkpoint by

combining

individual checkpoints (e.g. taking the last completed checkpoint

from

each

pipelined region)). If this comparison is correct, then this

would

mean

that we have rescaling problems under the latter two cases.

Cheers,
Till

On Mon, Feb 7, 2022 at 8:55 AM Gen Luo <[email protected]>

wrote:

Hi Gyula,

Thanks for sharing the idea. As Yuan mentioned, I think we can

discuss

this

within two scopes. One is the job subgraph, the other is the

execution

subgraph, which I suppose is the same as PipelineRegion.

An idea is to individually checkpoint the PipelineRegions, for

the

recovering in a single run.

Flink has now supported PipelineRegion based failover, with a

subset

of a

global checkpoint snapshot. The checkpoint barriers are spread

within a

PipelineRegion, so the checkpointing of individual

PipelineRegions

is

actually independent. Since in a single run of a job, the

PipelineRegions

are fixed, we can individually checkpoint separated

PipelineRegions,

despite what status the other PipelineRegions are, and use a

snapshot

of

failing region to recover, instead of the subset of a global

snapshot.

This

can support separated job subgraphs as well, since they will

also

be

separated into different PipelineRegions. I think this can

fulfill

your

needs.

In fact the individual snapshots of all PipelineRegions can

form

global

snapshot, and the alignment of snapshots of individual regions

is

not

necessary. But rescaling this global snapshot can be

potentially

complex. I

think it's better to use the individual snapshots in a single

run,

and

take

a global checkpoint/savepoint before restarting the job,

rescaling

it

or

not.

A major issue of this plan is that it breaks the checkpoint

mechanism

of

Flink. As far as I know, even in the approximate recovery, the

snapshot

used to recover a single task is still a part of a global

snapshot. To

implement the individual checkpointing of PipelineRegions,

there

may

need

to be a checkpoint coordinator for each PipelineRegion, and a

new

global

checkpoint coordinator. When the scale goes up, there can be

many

individual regions, which can be a big burden to the job

manager.

The

meaning of the checkpoint id will also be changed, which can

affect

many

aspects. There can be lots of work and risks, and the risks

still

exist

if

we only individually checkpoint separated job subgraphs, since

the

mechanism is still broken. If that is what you need, maybe

separating

them

into different jobs is an easier and better choice, as Caizhi

and

Yuan

mentioned.

On Mon, Feb 7, 2022 at 11:39 AM Yuan Mei <

[email protected]

wrote:

Hey Gyula,

That's a very interesting idea. The discussion about the

`Individual`

vs

`Global` checkpoint was raised before, but the main concern

was

from

two

aspects:

- Non-deterministic replaying may lead to an inconsistent

view

of

checkpoint
- It is not easy to form a clear cut of past and future and

hence no

clear

cut of where the start point of the source should begin to

replay

from.

Starting from independent subgraphs as you proposed may be a

good

starting

point. However, when we talk about subgraph, do we mention it

as

job

subgraph (each vertex is one or more operators) or execution

subgraph

(each

vertex is a task instance)?

If it is a job subgraph, then indeed, why not separate it

into

multiple

jobs as Caizhi mentioned.
If it is an execution subgraph, then it is difficult to

handle

rescaling

due to inconsistent views of checkpoints between tasks of the

same

operator.

`Individual/Subgraph Checkpointing` is definitely an

interesting

direction

to think of, and I'd love to hear more from you!

Best,

Yuan







On Mon, Feb 7, 2022 at 10:16 AM Caizhi Weng <

[email protected]>

wrote:

Hi Gyula!

Thanks for raising this discussion. I agree that this will

be

an

interesting feature but I actually have some doubts about

the

motivation

and use case. If there are multiple individual subgraphs in

the

same

job,

why not just distribute them to multiple jobs so that each

job

contains

only one individual graph and can now fail without

disturbing

the

others?


Gyula Fóra <[email protected]> 于2022年2月7日周一 05:22写道：

Hi all!

At the moment checkpointing only works for healthy jobs

with

all

running

(or some finished) tasks. This sounds reasonable in most

cases

but

there

are a few applications where it would make sense to

checkpoint

failing

jobs

as well.

Due to how the checkpointing mechanism works, subgraphs

that

have a

failing

task cannot be checkpointed without violating the

exactly-once

semantics.

However if the job has multiple independent subgraphs

(that

are

not

connected to each other), even if one subgraph is

failing,

the

other

completely running one could be checkpointed. In these

cases

the

tasks

of

the failing subgraph could simply inherit the last

successful

checkpoint

metadata (before they started failing). This logic would

produce

consistent checkpoint.

The job as a whole could now make stateful progress even

if

some

subgraphs

are constantly failing. This can be very valuable if for

some

reason

the

job has a larger number of independent subgraphs that are

expected

to

fail

every once in a while, or if some subgraphs can have

longer

downtimes

that

would now cause the whole job to stall.

What do you think?

Cheers,
Gyula

Re: [DISCUSS] Checkpointing (partially) failing jobs

Reply via email to