When this happens, it appears that one of the workers fails but the rest of the
workers continue to run. How would I be able to configure the app to be able to
recover itself completely from the last successful checkpoint when this happens?
‐‐‐ Original Message ‐‐‐
On Monday, December 3, 2018 11:02 AM, Flink Developer
wrote:
> I have a Flink app on 1.5.2 which sources data from Kafka topic (400
> partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3
> with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min.
> Checkpoint size is a few mb. After execution for a few days, I see:
>
> Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover
> strategy - falling back to global restart
> Java.lang.ClassCastException:
> com.amazonaws.services.s3.model.AmazonS3Exception cannot be cast to
> com.amazonaws.AmazonClientException
> At
> org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42)
> At org.apache.flink.util.SerializedThrowable
> At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus()
> At
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
> At akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>
> What causes the exception and why is the Flink job unable to recover? It
> states failing back to globsl restart? How can this be configured to recover
> properly? Is the checkloche interval/timeout too low? The Flink job's
> configuration shows Restart with fixed delay (0ms) #2147483647 restart
> attempts.