Hi Gyula!
Thanks for raising this discussion. I agree that this will be an
interesting feature but I actually have some doubts about the motivation
and use case. If there are multiple individual subgraphs in the same job,
why not just distribute them to multiple jobs so that each job contains
only
Hey Gyula,
That's a very interesting idea. The discussion about the `Individual` vs
`Global` checkpoint was raised before, but the main concern was from two
aspects:
- Non-deterministic replaying may lead to an inconsistent view of checkpoint
- It is not easy to form a clear cut of past and futur
Hi Gyula,
Thanks for sharing the idea. As Yuan mentioned, I think we can discuss this
within two scopes. One is the job subgraph, the other is the execution
subgraph, which I suppose is the same as PipelineRegion.
An idea is to individually checkpoint the PipelineRegions, for the
recovering in a
Hi everyone,
Yuan and Gen could you elaborate why rescaling is a problem if we say that
separate pipelined regions can take checkpoints independently?
Conceptually, I somehow think that a pipelined region that is failed and
cannot create a new checkpoint is more or less the same as a pipelined
reg
Hi Till,
I agree that a failing task is much like a very slow or deadlock task to
the checkpointing. The main difference is whether a checkpoint of the
region the task in can be triggered. Triggering a checkpoint on a failing
region makes no sense since the checkpoint should be discarded right awa
Hey Till,
> Why rescaling is a problem for pipelined regions/independent execution
subgraphs:
Take a simplified example :
job graph : source (2 instances) -> sink (2 instances)
execution graph:
source (1/2) -> sink (1/2) [pieplined region 1]
source (2/2) -> sink (2/2) [pieplined region 2]
Thanks for the clarification Yuan and Gen,
I agree that the checkpointing of the sources needs to support the
rescaling case, otherwise it does not work. Is there currently a source
implementation where this wouldn't work? For Kafka it should work because
we store the offset per assigned partition
Hi,
Thanks for opening this discussion! The proposed enhancement would be
interesting for use cases in our infrastructure as well.
There are scenarios where it makes sense to have multiple disconnected
subgraphs in a single job because it can significantly reduce the
operational burden as well as
Hi,
I'm thinking about Yuan's case. Let's assume that the case is running in
current Flink:
1. CP8 finishes
2. For some reason, PR2 stops consuming records from the source (but is not
stuck), and PR1 continues consuming new records.
3. CP9 and CP10 finish
4. PR2 starts to consume quickly to catch
Hey Folks,
Thanks for the discussion!
*Motiviation and use cases*
I think motiviation and use cases are very clear and I do not have doubts
on this part.
A typical use case is ETL with two-phase-commit, hundreds of partitions can
be blocked by a single straggler (a single task's checkpoint aborti
Hi,
@Yuan
Do you mean that there should be no shared state between source subtasks?
Sharing state between checkpoints of a specific subtask should be fine.
Sharing state between subtasks of a task can be an issue, no matter whether
it's a source. That's also what I was afraid of in the previous r
Could someone expand on these operational issues you're facing when
achieving this via separate jobs?
I feel like we're skipping a step, arguing about solutions without even
having discussed the underlying problem.
On 08/02/2022 11:25, Gen Luo wrote:
Hi,
@Yuan
Do you mean that there should
Hi,
I second Chesnay's comment and would like to better understand the
motivation behind this. At the surface it sounds to me like this might
require quite a bit of work for a very narrow use case.
At the same time I have a feeling that Yuan, you are mixing this feature
request (checkpointing sub
hi guys,If I understand it correctly, will only some checkpoints be
recovered when there is an error in the Flink batch?
Piotr Nowojski 于2022年2月8日周二 19:05写道:
> Hi,
>
> I second Chesnay's comment and would like to better understand the
> motivation behind this. At the surface it sounds to me like
Hi Chesney and Piotr,
I have seen some jobs with tens of independent vertices that process data
for the same business. The sub jobs should be started or stopped together.
Splitting them into separate jobs means the user has to manage them
separately. But in fact the jobs were running in per-job mo
15 matches
Mail list logo