Re: Job manager is failing to start with an S3 no key specified exception [1.7.2]

2019-12-10 Thread Andrey Zagrebin
`flink-2`Hi Harshith,

Could you share your full log files from the job master?
As I understand, this stack trace already belongs to a failover attempt,
what was the original cause of failover? Do you still have any other job
state in S3 for this cluster id `flink-2`?
Have you tried the latest version of Flink 1.9?

Best,
Andrey

On Mon, Dec 9, 2019 at 12:37 PM Kumar Bolar, Harshith 
wrote:

> Hi all,
>
>
>
> I'm running a standalone Flink cluster with Zookeeper and S3 for high
> availability storage. All of a sudden, the job managers started failing
> with an S3 `UnrecoverableS3OperationException` error. Here is the full
> error trace -
>
>
>
> ```
>
> java.lang.RuntimeException:
> org.apache.flink.runtime.client.JobExecutionException: Could not set up
> JobManager
>
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
>
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>
> at
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
> not set up JobManager
>
> at
> org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
>
> at
> org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
>
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
>
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
>
> ... 7 more
>
> Caused by:
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$UnrecoverableS3OperationException:
> org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The specified key does not exist. (Service: Amazon S3; Status Code: 404;
> Error Code: NoSuchKey; Request ID: 1769066EBD605AB5; S3 Extended Request
> ID:
> K8jjbsE4DPAsZJDVJKBq3Nh0E0o+feafefavbvbaae+nbUTphHHw73/eafafefa+dsVMR0=),
> S3 Extended Request ID:
> lklalkioe+eae2234+nbUTphHHw73/gVSclc1o1YH7M0MeNjmXl+dsVMR0= (Path:
> s3://abc-staging/flink/jobmanagerha/flink-2/blob/job_3e16166a1122885eb6e9b2437929b266/blob_p-3b687174148e9e1dd951f2a9fbec83f4fcd5281e-b85417f69b354c83b270bf01dcf389e0)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.lambda$openStream$1(PrestoS3FileSystem.java:908)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:893)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:878)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.seekStream(PrestoS3FileSystem.java:871)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.lambda$read$0(PrestoS3FileSystem.java:810)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
>
> at
> org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.read(PrestoS3FileSystem.java:809)
>
> ... 10 more
>
> Caused by:
> org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
> The specified key does not exist. (Service: Amazon S3; Status Code: 404;
> Error Code: NoSuchKey; Request ID: 1769066EBaD6aefB5; S3 Extended Request
> ID: fealloga+4rVwsF+nbUTphHHw73/gVSclc1o1YH7M0MeNjmXl+dsVMR0=), S3 Extended
> Request ID:
> K8jjbsE4DPAsZJDVJKBq3Nh0E0o+4rVwsF+nbUTphHHweafga/lc1o1YH7M0MeNjmXl+dsVMR0=
>
> at
> org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
>
> at
> org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
>
> at
> 

Job manager is failing to start with an S3 no key specified exception [1.7.2]

2019-12-09 Thread Kumar Bolar, Harshith
Hi all,

I'm running a standalone Flink cluster with Zookeeper and S3 for high 
availability storage. All of a sudden, the job managers started failing with an 
S3 `UnrecoverableS3OperationException` error. Here is the full error trace -

```
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
at 
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set 
up JobManager
at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
at 
org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
at 
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 more
Caused by: 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$UnrecoverableS3OperationException:
 
org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error 
Code: NoSuchKey; Request ID: 1769066EBD605AB5; S3 Extended Request ID: 
K8jjbsE4DPAsZJDVJKBq3Nh0E0o+feafefavbvbaae+nbUTphHHw73/eafafefa+dsVMR0=), S3 
Extended Request ID: 
lklalkioe+eae2234+nbUTphHHw73/gVSclc1o1YH7M0MeNjmXl+dsVMR0= (Path: 
s3://abc-staging/flink/jobmanagerha/flink-2/blob/job_3e16166a1122885eb6e9b2437929b266/blob_p-3b687174148e9e1dd951f2a9fbec83f4fcd5281e-b85417f69b354c83b270bf01dcf389e0)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.lambda$openStream$1(PrestoS3FileSystem.java:908)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:893)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.openStream(PrestoS3FileSystem.java:878)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.seekStream(PrestoS3FileSystem.java:871)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.lambda$read$0(PrestoS3FileSystem.java:810)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
at 
org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.PrestoS3FileSystem$PrestoS3InputStream.read(PrestoS3FileSystem.java:809)
... 10 more
Caused by: 
org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error 
Code: NoSuchKey; Request ID: 1769066EBaD6aefB5; S3 Extended Request ID: 
fealloga+4rVwsF+nbUTphHHw73/gVSclc1o1YH7M0MeNjmXl+dsVMR0=), S3 Extended Request 
ID: K8jjbsE4DPAsZJDVJKBq3Nh0E0o+4rVwsF+nbUTphHHweafga/lc1o1YH7M0MeNjmXl+dsVMR0=
at 
org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
at 
org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at 
org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
... 30 more
```

I could fix this by changing the `high-availability.cluster-id` property (which 
is currently set to `flink-2`) but with that I would lose all the existing jobs 
and state. Is there any way I can tell Flink to ignore this particular key in 
S3 and start the job managers?

Thanks,
Harshith