Re: What is checkpoint start delay?

2021-01-19 Thread Piotr Nowojski
Hey Rex,

What do you mean by "Start Delay" when recovering from a checkpoint? Did
you mean when taking a checkpoint? If so:

1. https://www.google.com/search?q=flink+checkpoint+start+delay
2. top 3 result (at least for me)
https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
> Start Delay: The time it took for the first checkpoint barrier to reach
this subtasks since the checkpoint barrier has been created.

3. https://www.google.com/search?q=flink+checkpoint+barrier
4. top 2 result (at least for me)
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers
> A core element in Flink’s distributed snapshotting are the stream
barriers. These barriers are injected into the data stream and flow with
the records as part of the data stream.

Long start delay or alignment time means checkpoint barriers are
propagating slowly through the job graph, usually a symptom of a
back-pressure. It's best to solve the back-pressure problem, via optimising
your job or scaling it up.

Alternatively you could use unaligned checkpoints [1], at a cost of larger
checkpoint size and higher IO usage. Note here that if you are using Flink
1.12.x, I would refrain from using unaligned checkpoints on the production
because of some bugs [2] that we are fixing right now. On Flink 1.11.x it
should be fine.

Cheers,
Piotrek

[1]
https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
[2] https://issues.apache.org/jira/browse/FLINK-20654



pon., 18 sty 2021 o 21:32 Rex Fenley  napisał(a):

> Hello,
>
> When we are recovering on a checkpoint it will take multiple minutes. The
> time is usually taken by "Start Delay". What is Start Delay and how can we
> optimize for it?
>
> Thanks!
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com  |  BLOG   |
>  FOLLOW US   |  LIKE US
> 
>


Re: What is checkpoint start delay?

2021-01-19 Thread Rex Fenley
Thanks for the input.

This seems odd though, if start delay is the same as alignment then (1) why
is it only ever prominent when right after recovering from a checkpoint?
(2) Why is the first checkpoint during the recovery process 10x as long as
every other checkpoint? Something else must be going on that's in addition
to the normal alignment process.

On Tue, Jan 19, 2021 at 8:14 AM Piotr Nowojski  wrote:

> Hey Rex,
>
> What do you mean by "Start Delay" when recovering from a checkpoint? Did
> you mean when taking a checkpoint? If so:
>
> 1. https://www.google.com/search?q=flink+checkpoint+start+delay
> 2. top 3 result (at least for me)
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
> > Start Delay: The time it took for the first checkpoint barrier to reach
> this subtasks since the checkpoint barrier has been created.
>
> 3. https://www.google.com/search?q=flink+checkpoint+barrier
> 4. top 2 result (at least for me)
> https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers
> > A core element in Flink’s distributed snapshotting are the stream
> barriers. These barriers are injected into the data stream and flow with
> the records as part of the data stream.
>
> Long start delay or alignment time means checkpoint barriers are
> propagating slowly through the job graph, usually a symptom of a
> back-pressure. It's best to solve the back-pressure problem, via optimising
> your job or scaling it up.
>
> Alternatively you could use unaligned checkpoints [1], at a cost of larger
> checkpoint size and higher IO usage. Note here that if you are using Flink
> 1.12.x, I would refrain from using unaligned checkpoints on the production
> because of some bugs [2] that we are fixing right now. On Flink 1.11.x it
> should be fine.
>
> Cheers,
> Piotrek
>
> [1]
> https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
> [2] https://issues.apache.org/jira/browse/FLINK-20654
>
>
>
> pon., 18 sty 2021 o 21:32 Rex Fenley  napisał(a):
>
>> Hello,
>>
>> When we are recovering on a checkpoint it will take multiple minutes. The
>> time is usually taken by "Start Delay". What is Start Delay and how can we
>> optimize for it?
>>
>> Thanks!
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com  |  BLOG 
>>  |  FOLLOW US   |  LIKE US
>> 
>>
>

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com  |  BLOG 
 |  FOLLOW
US   |  LIKE US



Re: What is checkpoint start delay?

2021-01-19 Thread Piotr Nowojski
Hi Rex,

start delay is not the same as the alignment time. Start delay is the time
between creation of the checkpoint barrier and the time a task/subtask sees
a first checkpoint barrier from any of its inputs. Alignment time is the
time between receiving the first checkpoint barrier on a given subtask and
the last one. In other words,

start of the checkpoint TS (on JobManager) + start delay on subtask = start
of the checkpoint TS (on TaskManager)
start of the checkpoint TS (on TaskManager) + alignment time on subtask =
end of the checkpoint TS (on TaskManager)

Maybe something in your job must ramp up and record throughput is slower
during this time, causing higher back pressure, which in turns is causing
longer checkpointing time for the first checkpoint after recovery. Maybe
RocksDB is needs to load it's state from disks.

Piotrek

wt., 19 sty 2021 o 20:11 Rex Fenley  napisał(a):

> Thanks for the input.
>
> This seems odd though, if start delay is the same as alignment then (1)
> why is it only ever prominent when right after recovering from a
> checkpoint? (2) Why is the first checkpoint during the recovery process 10x
> as long as every other checkpoint? Something else must be going on that's
> in addition to the normal alignment process.
>
> On Tue, Jan 19, 2021 at 8:14 AM Piotr Nowojski 
> wrote:
>
>> Hey Rex,
>>
>> What do you mean by "Start Delay" when recovering from a checkpoint? Did
>> you mean when taking a checkpoint? If so:
>>
>> 1. https://www.google.com/search?q=flink+checkpoint+start+delay
>> 2. top 3 result (at least for me)
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
>> > Start Delay: The time it took for the first checkpoint barrier to reach
>> this subtasks since the checkpoint barrier has been created.
>>
>> 3. https://www.google.com/search?q=flink+checkpoint+barrier
>> 4. top 2 result (at least for me)
>> https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers
>> > A core element in Flink’s distributed snapshotting are the stream
>> barriers. These barriers are injected into the data stream and flow with
>> the records as part of the data stream.
>>
>> Long start delay or alignment time means checkpoint barriers are
>> propagating slowly through the job graph, usually a symptom of a
>> back-pressure. It's best to solve the back-pressure problem, via optimising
>> your job or scaling it up.
>>
>> Alternatively you could use unaligned checkpoints [1], at a cost of
>> larger checkpoint size and higher IO usage. Note here that if you are using
>> Flink 1.12.x, I would refrain from using unaligned checkpoints on the
>> production because of some bugs [2] that we are fixing right now. On Flink
>> 1.11.x it should be fine.
>>
>> Cheers,
>> Piotrek
>>
>> [1]
>> https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
>> [2] https://issues.apache.org/jira/browse/FLINK-20654
>>
>>
>>
>> pon., 18 sty 2021 o 21:32 Rex Fenley  napisał(a):
>>
>>> Hello,
>>>
>>> When we are recovering on a checkpoint it will take multiple minutes.
>>> The time is usually taken by "Start Delay". What is Start Delay and how can
>>> we optimize for it?
>>>
>>> Thanks!
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com  |  BLOG 
>>>  |  FOLLOW US   |  LIKE US
>>> 
>>>
>>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com  |  BLOG   |
>  FOLLOW US   |  LIKE US
> 
>


Re: What is checkpoint start delay?

2021-01-19 Thread Rex Fenley
Ok, this makes sense. I'm guessing loading state from S3 into RocksDB is a
large contributor to start delay then.

Thanks!

On Tue, Jan 19, 2021 at 12:16 PM Piotr Nowojski 
wrote:

> Hi Rex,
>
> start delay is not the same as the alignment time. Start delay is the time
> between creation of the checkpoint barrier and the time a task/subtask sees
> a first checkpoint barrier from any of its inputs. Alignment time is the
> time between receiving the first checkpoint barrier on a given subtask and
> the last one. In other words,
>
> start of the checkpoint TS (on JobManager) + start delay on subtask =
> start of the checkpoint TS (on TaskManager)
> start of the checkpoint TS (on TaskManager) + alignment time on subtask =
> end of the checkpoint TS (on TaskManager)
>
> Maybe something in your job must ramp up and record throughput is slower
> during this time, causing higher back pressure, which in turns is causing
> longer checkpointing time for the first checkpoint after recovery. Maybe
> RocksDB is needs to load it's state from disks.
>
> Piotrek
>
> wt., 19 sty 2021 o 20:11 Rex Fenley  napisał(a):
>
>> Thanks for the input.
>>
>> This seems odd though, if start delay is the same as alignment then (1)
>> why is it only ever prominent when right after recovering from a
>> checkpoint? (2) Why is the first checkpoint during the recovery process 10x
>> as long as every other checkpoint? Something else must be going on that's
>> in addition to the normal alignment process.
>>
>> On Tue, Jan 19, 2021 at 8:14 AM Piotr Nowojski 
>> wrote:
>>
>>> Hey Rex,
>>>
>>> What do you mean by "Start Delay" when recovering from a checkpoint? Did
>>> you mean when taking a checkpoint? If so:
>>>
>>> 1. https://www.google.com/search?q=flink+checkpoint+start+delay
>>> 2. top 3 result (at least for me)
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
>>> > Start Delay: The time it took for the first checkpoint barrier to
>>> reach this subtasks since the checkpoint barrier has been created.
>>>
>>> 3. https://www.google.com/search?q=flink+checkpoint+barrier
>>> 4. top 2 result (at least for me)
>>> https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html#barriers
>>> > A core element in Flink’s distributed snapshotting are the stream
>>> barriers. These barriers are injected into the data stream and flow with
>>> the records as part of the data stream.
>>>
>>> Long start delay or alignment time means checkpoint barriers are
>>> propagating slowly through the job graph, usually a symptom of a
>>> back-pressure. It's best to solve the back-pressure problem, via optimising
>>> your job or scaling it up.
>>>
>>> Alternatively you could use unaligned checkpoints [1], at a cost of
>>> larger checkpoint size and higher IO usage. Note here that if you are using
>>> Flink 1.12.x, I would refrain from using unaligned checkpoints on the
>>> production because of some bugs [2] that we are fixing right now. On Flink
>>> 1.11.x it should be fine.
>>>
>>> Cheers,
>>> Piotrek
>>>
>>> [1]
>>> https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1.html
>>> [2] https://issues.apache.org/jira/browse/FLINK-20654
>>>
>>>
>>>
>>> pon., 18 sty 2021 o 21:32 Rex Fenley  napisał(a):
>>>
 Hello,

 When we are recovering on a checkpoint it will take multiple minutes.
 The time is usually taken by "Start Delay". What is Start Delay and how can
 we optimize for it?

 Thanks!

 --

 Rex Fenley  |  Software Engineer - Mobile and Backend


 Remind.com  |  BLOG 
  |  FOLLOW US   |  LIKE US
 

>>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com  |  BLOG 
>>  |  FOLLOW US   |  LIKE US
>> 
>>
>

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com  |  BLOG 
 |  FOLLOW
US   |  LIKE US