Re: Checkpointing is not performing well

Ravi Bhushan Ratnakar Tue, 10 Sep 2019 10:47:12 -0700

@Rohan - I am streaming data to kafka sink after applying business logic.
For checkpoint, I am using s3 as a distributed file system. For local
recovery, I am using Optimized iops ebs volume.


@Vijay - I forget to mention that incoming data volume is ~ 10 to 21GB per
minute compressed(lz4) avro message. Generally 90% correlated events come
within 5 seconds and 10% of the correlated events get extended to 65
minute. Due to this business requirement, the state size keep growing till
65 minutes, after that the state size becomes more or less stable. As the
state size is growing and is around 350gb at peak load, checkpoint is not
able to complete within 1 minutes. I want to check as quick as possible
like every 5 second.

Thanks,
Ravi


On Tue 10 Sep, 2019, 11:37 Vijay Bhaskar, <bhaskar.eba...@gmail.com> wrote:

> For me task count seems to be huge in number with the mentioned resource
> count. To rule out the possibility of issue with state backend can you
> start writing sink data as <NO-Operation> , i.e., data ignore sink. And try
> whether you could run it for longer duration without any issue. You can
> start decreasing the task manager count until you find descent count of it
> without having any side effects. Use that value as task manager count and
> then start adding your state backend. First you can try with Rocks DB. With
> reduced task manager count you might get good results.
>
> Regards
> Bhaskar
>
> On Sun, Sep 8, 2019 at 10:15 AM Rohan Thimmappa <rohan.thimma...@gmail.com>
> wrote:
>
>> Ravi, have you looked at the io operation(iops) rate of the disk? You can
>> monitoring the iops performance and tune it accordingly with your work
>> load. This helped us in our project when we hit the wall tuning prototype
>> much all the parameters.
>>
>> Rohan
>>
>>
>> ------------------------------
>> *From:* Ravi Bhushan Ratnakar <ravibhushanratna...@gmail.com>
>> *Sent:* Saturday, September 7, 2019 5:38 PM
>> *To:* Rafi Aroch
>> *Cc:* user
>> *Subject:* Re: Checkpointing is not performing well
>>
>> Hi Rafi,
>>
>> Thank you for your quick response.
>>
>> I have tested with rocksdb state backend. Rocksdb required significantly
>> more taskmanager to perform as compare to filesystem state backend. The
>> problem here is that checkpoint process is not fast enough to complete.
>>
>> Our requirement is to do checkout as soon as possible like in 5 seconds
>> to flush the output to output sink. As the incoming data rate is high, it
>> is not able to complete quickly. If I increase the checkpoint duration, the
>> state size grows much faster and hence takes much longer time to complete
>> checkpointing. I also tried to use AT LEAST ONCE mode, but does not improve
>> much. Adding more taskmanager to increase parallelism also does not improve
>> the checkpointing performance.
>>
>> Is it possible to achieve checkpointing as short as 5 seconds with such
>> high input volume?
>>
>> Regards,
>> Ravi
>>
>> On Sat 7 Sep, 2019, 22:25 Rafi Aroch, <rafi.ar...@gmail.com> wrote:
>>
>>> Hi Ravi,
>>>
>>> Consider moving to RocksDB state backend, where you can enable
>>> incremental checkpointing. This will make you checkpoints size stay pretty
>>> much constant even when your state becomes larger.
>>>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/state_backends.html#the-rocksdbstatebackend
>>>
>>>
>>> Thanks,
>>> Rafi
>>>
>>> On Sat, Sep 7, 2019, 17:47 Ravi Bhushan Ratnakar <
>>> ravibhushanratna...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I am writing a streaming application using Flink 1.9. This application
>>>> consumes data from kinesis stream which is basically avro payload.
>>>> Application is using KeyedProcessFunction to execute business logic on the
>>>> basis of correlation id using event time characteristics with below
>>>> configuration --
>>>> StateBackend - filesystem with S3 storage
>>>> registerTimeTimer duration for each key is  -  currentWatermark  + 15
>>>> seconds
>>>> checkpoint interval - 1min
>>>> minPauseBetweenCheckpointInterval - 1 min
>>>> checkpoint timeout - 10mins
>>>>
>>>> incoming data rate from kinesis -  ~10 to 21GB/min
>>>>
>>>> Number of Task manager - 200 (r4.2xlarge -> 8cpu,61GB)
>>>>
>>>> First 2-4 checkpoints get completed within 1mins where the state size
>>>> is usually 50GB. As the state size grows beyond 50GB, then checkpointing
>>>> time starts taking more than 1mins and it increased till 10 mins and then
>>>> checkpoint fails. The moment the checkpoint starts taking more than 1 mins
>>>> to complete then application starts processing slow and start lagging in
>>>> output.
>>>>
>>>> Any suggestion to fine tune checkpoint performance would be highly
>>>> appreciated.
>>>>
>>>> Regards,
>>>> Ravi
>>>>
>>>

Re: Checkpointing is not performing well

Reply via email to