Hi Bekir,

Another user reported checkpointing issues with Flink 1.8.0 [1].
These seem to be resolved with Flink 1.8.1.

Hope this helps,
Fabian

[1]
https://lists.apache.org/thread.html/991fe3b09fd6a052ff52e5f7d9cdd9418545e68b02e23493097d9bc4@%3Cuser.flink.apache.org%3E

Am Mi., 17. Juli 2019 um 09:16 Uhr schrieb Congxian Qiu <
qcx978132...@gmail.com>:

> Hi Bekir
>
> First of all, I think there is something wrong.  the state size is almost
> the same,  but the duration is different so much.
>
> The checkpoint for RocksDBStatebackend is dump sst files, then copy the
> needed sst files(if you enable incremental checkpoint, the sst files
> already on remote will not upload), then complete checkpoint. Can you check
> the network bandwidth usage during checkpoint?
>
> Best,
> Congxian
>
>
> Bekir Oguz <bekir.o...@persgroep.net> 于2019年7月16日周二 下午10:45写道:
>
>> Hi all,
>> We have a flink job with user state, checkpointing to RocksDBBackend
>> which is externally stored in AWS S3.
>> After we have migrated our cluster from 1.6 to 1.8, we see occasionally
>> that some slots do to acknowledge the checkpoints quick enough. As an
>> example: All slots acknowledge between 30-50 seconds except only one slot
>> acknowledges in 15 mins. Checkpoint sizes are similar to each other, like
>> 200-400 MB.
>>
>> We did not experience this weird behaviour in Flink 1.6. We have 5 min
>> checkpoint interval and this happens sometimes once in an hour sometimes
>> more but not in all the checkpoint requests. Please see the screenshot
>> below.
>>
>> Also another point: For the faulty slots, the duration is consistently 15
>> mins and some seconds, we couldn’t find out where this 15 mins response
>> time comes from. And each time it is a different task manager, not always
>> the same one.
>>
>> Do you guys aware of any other users having similar issues with the new
>> version and also a suggested bug fix or solution?
>>
>>
>>
>>
>> Thanks in advance,
>> Bekir Oguz
>>
>

Reply via email to