PartitionNotFoundException when restarting from checkpoint

Seth Wiesman Fri, 09 Mar 2018 08:53:58 -0800

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB 
and incremental checkpointing, last night a job failed and became stuck in a 
restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a 
fresh Flink session with no luck. Looking through the logs we can see that the 
specified partition is never registered with the ResultPartitionManager.


My questions are:

1)       Are partitions a part of state or are the ephemeral to the job

2)       If they are not part of state, where would the task managers be 
getting that partition id to begin with

3)       Right now we are logging everything under 
org.apache.flink.runtime.io.network, is there anywhere else to look

Thank you,

[cid:image001.png@01D3B79D.36E45B00]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 
10007 swies...@mediamath.com<mailto:fl...@mediamath.com>

PartitionNotFoundException when restarting from checkpoint

Reply via email to