Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Gen Luo
Hi Chesney and Piotr, I have seen some jobs with tens of independent vertices that process data for the same business. The sub jobs should be started or stopped together. Splitting them into separate jobs means the user has to manage them separately. But in fact the jobs were running in per-job

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread 丛鹏
hi guys,If I understand it correctly, will only some checkpoints be recovered when there is an error in the Flink batch? Piotr Nowojski 于2022年2月8日周二 19:05写道: > Hi, > > I second Chesnay's comment and would like to better understand the > motivation behind this. At the surface it sounds to me

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Piotr Nowojski
Hi, I second Chesnay's comment and would like to better understand the motivation behind this. At the surface it sounds to me like this might require quite a bit of work for a very narrow use case. At the same time I have a feeling that Yuan, you are mixing this feature request (checkpointing

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Chesnay Schepler
Could someone expand on these operational issues you're facing when achieving this via separate jobs? I feel like we're skipping a step, arguing about solutions without even having discussed the underlying problem. On 08/02/2022 11:25, Gen Luo wrote: Hi, @Yuan Do you mean that there should

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Gen Luo
Hi, @Yuan Do you mean that there should be no shared state between source subtasks? Sharing state between checkpoints of a specific subtask should be fine. Sharing state between subtasks of a task can be an issue, no matter whether it's a source. That's also what I was afraid of in the previous

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-08 Thread Yuan Mei
Hey Folks, Thanks for the discussion! *Motiviation and use cases* I think motiviation and use cases are very clear and I do not have doubts on this part. A typical use case is ETL with two-phase-commit, hundreds of partitions can be blocked by a single straggler (a single task's checkpoint

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Gen Luo
Hi, I'm thinking about Yuan's case. Let's assume that the case is running in current Flink: 1. CP8 finishes 2. For some reason, PR2 stops consuming records from the source (but is not stuck), and PR1 continues consuming new records. 3. CP9 and CP10 finish 4. PR2 starts to consume quickly to catch

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Thomas Weise
Hi, Thanks for opening this discussion! The proposed enhancement would be interesting for use cases in our infrastructure as well. There are scenarios where it makes sense to have multiple disconnected subgraphs in a single job because it can significantly reduce the operational burden as well

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Till Rohrmann
Thanks for the clarification Yuan and Gen, I agree that the checkpointing of the sources needs to support the rescaling case, otherwise it does not work. Is there currently a source implementation where this wouldn't work? For Kafka it should work because we store the offset per assigned

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Yuan Mei
Hey Till, > Why rescaling is a problem for pipelined regions/independent execution subgraphs: Take a simplified example : job graph : source (2 instances) -> sink (2 instances) execution graph: source (1/2) -> sink (1/2) [pieplined region 1] source (2/2) -> sink (2/2) [pieplined region 2]

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Gen Luo
Hi Till, I agree that a failing task is much like a very slow or deadlock task to the checkpointing. The main difference is whether a checkpoint of the region the task in can be triggered. Triggering a checkpoint on a failing region makes no sense since the checkpoint should be discarded right

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-07 Thread Till Rohrmann
Hi everyone, Yuan and Gen could you elaborate why rescaling is a problem if we say that separate pipelined regions can take checkpoints independently? Conceptually, I somehow think that a pipelined region that is failed and cannot create a new checkpoint is more or less the same as a pipelined

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-06 Thread Gen Luo
Hi Gyula, Thanks for sharing the idea. As Yuan mentioned, I think we can discuss this within two scopes. One is the job subgraph, the other is the execution subgraph, which I suppose is the same as PipelineRegion. An idea is to individually checkpoint the PipelineRegions, for the recovering in a

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-06 Thread Yuan Mei
Hey Gyula, That's a very interesting idea. The discussion about the `Individual` vs `Global` checkpoint was raised before, but the main concern was from two aspects: - Non-deterministic replaying may lead to an inconsistent view of checkpoint - It is not easy to form a clear cut of past and

Re: [DISCUSS] Checkpointing (partially) failing jobs

2022-02-06 Thread Caizhi Weng
Hi Gyula! Thanks for raising this discussion. I agree that this will be an interesting feature but I actually have some doubts about the motivation and use case. If there are multiple individual subgraphs in the same job, why not just distribute them to multiple jobs so that each job contains

[DISCUSS] Checkpointing (partially) failing jobs

2022-02-06 Thread Gyula Fóra
Hi all! At the moment checkpointing only works for healthy jobs with all running (or some finished) tasks. This sounds reasonable in most cases but there are a few applications where it would make sense to checkpoint failing jobs as well. Due to how the checkpointing mechanism works, subgraphs