Re: instable checkpointing after migration to flink 1.8 (production issue)

2019-07-18 Thread Congxian Qiu
Hi, Bekir

First, The e2e time for a sub task is the $ack_time_received_in_JM -
$trigger_time_in_JM. And checkpoint includes some steps on task side such
as 1) receive first barrier; 2) barrier align(for exactly once); 3)
operator snapshot sync part; 4) operator snapshot async part, the images
you shared yesterday show that the sync part took a too long time, now the
sync part and async part took some time long, and e2e time is much longer
than sync_time + async_time.
1. you can checkpoint whether your job has backpressure
problems(backpressure may lead the barrier flows too slowly to the downside
task.), if it has such a problem, you should better solve it first.
2. If do not have a backpressure problem, you can check the `Alignment
Duration` to see if the barriers align took a too long time.
3. for sync part, maybe you can checkpoint the disk performance(if there
did not have the metric, you can find the `sar` log in your machine)
4. for the async part, we can check the network performance(or some client
network flow control)

Hope this can help you.

Best,
Congxian


Bekir Oguz  于2019年7月18日周四 下午6:05写道:

> Hi Congxian,
> Starting from this morning we have more issues with checkpointing in
> production. What we see is sync and async duration for some subtasks are
> very long but what strange is the total of sync and async durations are
> much less than the total end to end duration. Please check the following
> snapshot:
>
>
> For example, for the subtask 14: Sync duration is 4 mins, async duration 3
> mins, end-to-end duration is 53 mins!!!
> We have a very long timeout value (1 hour) for checkpointing, but still
> many checkpoints are failing, some subtasks cannot finish checkpointing in
> 1 hour.
>
> We really appreciate your help here, this is a critical production problem
> for us at the moment.
>
> Regards,
> Bekir
>
>
> On 17 Jul 2019, at 17:46, Bekir Oguz  wrote:
>
>
> And I also extracted events fr
>
>
>


Re: instable checkpointing after migration to flink 1.8 (production issue)

2019-07-18 Thread Bekir Oguz
Hi Congxian,
Starting from this morning we have more issues with checkpointing in 
production. What we see is sync and async duration for some subtasks are very 
long but what strange is the total of sync and async durations are much less 
than the total end to end duration. Please check the following snapshot:



For example, for the subtask 14: Sync duration is 4 mins, async duration 3 
mins, end-to-end duration is 53 mins!!!
We have a very long timeout value (1 hour) for checkpointing, but still many 
checkpoints are failing, some subtasks cannot finish checkpointing in 1 hour.

We really appreciate your help here, this is a critical production problem for 
us at the moment.

Regards,
Bekir


> On 17 Jul 2019, at 17:46, Bekir Oguz  wrote:
> 
> 
> And I also extracted events fr