[GitHub] [flink] rmetzger edited a comment on pull request #18692: [FLINK-26015] Fixes object store bug

2022-02-10 Thread GitBox


rmetzger edited a comment on pull request #18692:
URL: https://github.com/apache/flink/pull/18692#issuecomment-1034859586


   Sadly, the JRS still doesn't work on K8s, using a minio s3 implementation:
   ```
   2022-02-10 12:20:23,679 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Starting the resource manager.
   2022-02-10 12:20:23,765 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Start SessionDispatcherLeaderProcess.
   2022-02-10 12:20:25,060 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
   2022-02-10 12:20:25,164 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
DefaultJobGraphStore.
   2022-02-10 12:20:25,255 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
   java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not retrieve JobResults of 
globally-terminated jobs from JobResultStore
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_322]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_322]
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
 [?:1.8.0_322]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
   Caused by: org.apache.flink.util.FlinkRuntimeException: Could not retrieve 
JobResults of globally-terminated jobs from JobResultStore
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:186)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:178)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
 ~[?:1.8.0_322]
... 3 more
   Caused by: java.io.FileNotFoundException: No such file or directory: 
s3://xxx-eu-west-1-dev-store/myorg/myscope/3d78a6e7-4c88-4e6f-8e59-4fb4b6dd6319-test-job-name-a/ha/job-result-store/default
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2344) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226)
 ~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1961) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$9(S3AFileSystem.java:1940)
 ~[?:?]
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) ~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1940) 
~[?:?]
at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.listStatus(HadoopFileSystem.java:170)
 ~[?:?]
at 
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.listStatus(PluginFileSystemFactory.java:141)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:158)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:184)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:178)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
java.util.concurrent.CompletableFuture$

[GitHub] [flink] rmetzger edited a comment on pull request #18692: [FLINK-26015] Fixes object store bug

2022-02-10 Thread GitBox


rmetzger edited a comment on pull request #18692:
URL: https://github.com/apache/flink/pull/18692#issuecomment-1034859586


   Sadly, the JRS still doesn't work on K8s, using a minio s3 implementation:
   ```
   2022-02-10 12:20:23,679 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Starting the resource manager.
   2022-02-10 12:20:23,765 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Start SessionDispatcherLeaderProcess.
   2022-02-10 12:20:25,060 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - 
Stopping SessionDispatcherLeaderProcess.
   2022-02-10 12:20:25,164 INFO  
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping 
DefaultJobGraphStore.
   2022-02-10 12:20:25,255 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
   java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not retrieve JobResults of 
globally-terminated jobs from JobResultStore
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_322]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_322]
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
 [?:1.8.0_322]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
   Caused by: org.apache.flink.util.FlinkRuntimeException: Could not retrieve 
JobResults of globally-terminated jobs from JobResultStore
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:186)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:178)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
 ~[?:1.8.0_322]
... 3 more
   Caused by: java.io.FileNotFoundException: No such file or directory: 
s3://vvc-eu-west-1-dev-store/myorg/myscope/3d78a6e7-4c88-4e6f-8e59-4fb4b6dd6319-test-job-name-a/ha/job-result-store/default
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2344) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2226)
 ~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2160) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:1961) 
~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$9(S3AFileSystem.java:1940)
 ~[?:?]
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109) ~[?:?]
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:1940) 
~[?:?]
at 
org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.listStatus(HadoopFileSystem.java:170)
 ~[?:?]
at 
org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.listStatus(PluginFileSystemFactory.java:141)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:158)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:184)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:178)
 ~[flink-dist-1.15-jrs-fix.jar:1.15-jrs-fix]
at 
java.util.concurrent.CompletableFuture$