Hi Dongwon,

There are currently no metrics for the async work-queue size (you should be
able to see the queue stats with debug logs enabled though [1]). As far as
I can tell from looking at the code, the async operator is able to
checkpoint even if the work-queue is exhausted.

Arvid can you please validate the above? (the checkpoints not being blocked
by the work queue part)

[1]
https://github.com/apache/flink/blob/release-1.14.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/async/queue/OrderedStreamElementQueue.java#L109

Best,
D.

On Sun, Nov 7, 2021 at 10:41 AM Dongwon Kim <eastcirc...@gmail.com> wrote:

> Hi community,
>
> While using Flink's async i/o for interacting with an external system, I
> got the following exception:
>
> 2021-11-06 10:38:35,270 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 54 (type=CHECKPOINT) @ 1636162715262 for job 
> f168a44ea33198cd71783824d49f9554.
> 2021-11-06 10:38:47,031 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
> checkpoint 54 for job f168a44ea33198cd71783824d49f9554 (11930992707 bytes, 
> checkpointDuration=11722 ms, finalizationTime=47 ms).
> 2021-11-06 10:58:35,270 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 55 (type=CHECKPOINT) @ 1636163915262 for job 
> f168a44ea33198cd71783824d49f9554.
> 2021-11-06 11:08:35,271 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
> 55 of job f168a44ea33198cd71783824d49f9554 expired before completing.
> 2021-11-06 11:08:35,287 INFO  org.apache.flink.runtime.jobmaster.JobMaster    
>              [] - Trying to recover from a global failure.
>
>
> - FYI, I'm using 1.14.0 and enabled unaligned checkpointing and buffer
> debloating
> - the 55th ckpt failed to complete within 10 mins (which is the value of
> execution.checkpointing.timeout)
> - the below graph shows that backpressure skyrocketed around the time the
> 55th ckpt began
> [image: image.png]
>
> What I suspect is the capacity of the asynchronous operation because
> limiting the value can cause back-pressure once the capacity is exhausted
> [1].
>
> Although I could increase the value, I want to monitor the current
> in-flight async i/o requests like the above back-pressure graph on Grafana.
> [2] does not introduce any system metric specific to async i/o.
>
> Best,
>
> Dongwon
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#system-metrics
>
>
>

Reply via email to