Re: Flink Exception - AmazonS3Exception and ExecutionGraph - Error in failover strategy

2018-12-04 Thread Flink Developer
When this happens, it appears that one of the workers fails but the rest of the 
workers continue to run. How would I be able to configure the app to be able to 
recover itself completely from the last successful checkpoint when this happens?

‐‐‐ Original Message ‐‐‐
On Monday, December 3, 2018 11:02 AM, Flink Developer 
 wrote:

> I have a Flink app on 1.5.2 which sources data from Kafka topic (400 
> partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 
> with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. 
> Checkpoint size is a few mb. After execution for a few days, I see:
>
> Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover 
> strategy - falling back to global restart
> Java.lang.ClassCastException: 
> com.amazonaws.services.s3.model.AmazonS3Exception cannot be cast to 
> com.amazonaws.AmazonClientException
> At 
> org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42)
> At org.apache.flink.util.SerializedThrowable
> At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus()
> At 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
> At akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>
> What causes the exception  and  why is the Flink job unable to recover? It 
> states failing back to globsl restart? How can this be configured to recover 
> properly? Is the checkloche interval/timeout too low? The Flink job's 
> configuration shows Restart with fixed delay (0ms) #2147483647 restart 
> attempts.

Flink Exception - AmazonS3Exception and ExecutionGraph - Error in failover strategy

2018-12-03 Thread Flink Developer
I have a Flink app on 1.5.2 which sources data from Kafka topic (400 
partitions) and runs with 400 parallelism. The sink uses bucketing sink to S3 
with rocks db. Checkpoint interval is 2 min and checkpoint timeout is 2 min. 
Checkpoint size is a few mb. After execution for a few days, I see:

Org.apache.flink.runtime.executiongraph.ExecutionGraph - Error in failover 
strategy - falling back to global restart
Java.lang.ClassCastException: com.amazonaws.services.s3.model.AmazonS3Exception 
cannot be cast to com.amazonaws.AmazonClientException
At 
org.apache.hadoop.fs.s3a.AWSClientIOException.getCause(AWSClientIOException.java:42)
At org.apache.flink.util.SerializedThrowable
At org.apache.flink.runtime.executiongraph.ExecutionGraph.notifyJobStatus()
At 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
At akka.dispatch.Mailbox.exec(Mailbox.scala:234)

What causes the exception  and  why is the Flink job unable to recover? It 
states failing back to globsl restart? How can this be configured to recover 
properly? Is the checkloche interval/timeout too low? The Flink job's 
configuration shows Restart with fixed delay (0ms) #2147483647 restart attempts.