Hi Sushant

What confuse me is that why source task cannot complete checkpoint in 3 minutes 
[1]. If no sub-task has ever completed the checkpoint, which means even source 
task cannot complete. Actually source task would not need to buffer the data. 
From what I see, it might be affected by acquiring the lock which hold by 
stream task main thread to process elements [2]. Could you use jstack to 
capture your java process' threads to know what happened when checkpoint failed?

[1] 
https://github.com/sushantbprise/flink-dashboard/blob/master/failed-checkpointing/state2.png
[2] 
https://github.com/apache/flink/blob/ccc7eb431477059b32fb924104c17af953620c74/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L758

Best
Yun Tang
________________________________
From: Sushant Sawant <sushantsawant7...@gmail.com>
Sent: Tuesday, August 27, 2019 15:01
To: user <user@flink.apache.org>
Subject: Re: checkpoint failure suddenly even state size less than 1 mb

Hi team,
Anyone for help/suggestion, now we have stopped all input in kafka, there is no 
processing, no sink but checkpointing is failing.
Is it like once checkpoint fails it keeps failing forever until job restart.

Help appreciated.

Thanks & Regards,
Sushant Sawant

On 23 Aug 2019 12:56 p.m., "Sushant Sawant" 
<sushantsawant7...@gmail.com<mailto:sushantsawant7...@gmail.com>> wrote:
Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.

My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in 
production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and 
checkpoint failure
https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
and  after restart, "everything is fine" images in this folder,
https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--
Could anyone point me towards direction what would have went wrong/ trouble 
shooting??


Thanks & Regards,
Sushant Sawant

Reply via email to