checkpoint failure suddenly even state size is into 10 mb around

Sushant Sawant Fri, 23 Aug 2019 00:27:26 -0700

Hi all,
m facing two issues which I believe are co-related though.
1. Kafka source shows high back pressure.
2. Sudden checkpoint failure for entire day until restart.


My job does following thing,
a. Read from Kafka
b. Asyncio to external system
c. Dumping in Cassandra, Elasticsearch

Checkpointing is using file system.
This flink job is proven under high load,
around 5000/sec throughput.
But recently we scaled down parallelism since, there wasn't any load in
production and these issues started.

Please find the status shown by flink dashboard.
The github folder contains image where there was high back pressure and
checkpoint failure
https://github.com/sushantbprise/flink-dashboard/tree/master/failed-checkpointing
and  after restart, "everything is fine" images in this folder,
https://github.com/sushantbprise/flink-dashboard/tree/master/working-checkpointing

--
Could anyone point me towards direction what would have went wrong/ trouble
shooting??


Thanks & Regards,
Sushant Sawant

checkpoint failure suddenly even state size is into 10 mb around

Reply via email to