[ 
https://issues.apache.org/jira/browse/FLINK-24789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439211#comment-17439211
 ] 

Chesnay Schepler edited comment on FLINK-24789 at 11/5/21, 12:42 PM:
---------------------------------------------------------------------

Note that recently some changes were made to the CheckpointCleaner in 
{{-FLINK-24789-}} FLINK-23647.


was (Author: zentol):
Note that recently some changes were made to the CheckpointCleaner in 
FLINK-24789 FLINK-23647.

> IllegalStateException with CheckpointCleaner being closed already
> -----------------------------------------------------------------
>
>                 Key: FLINK-24789
>                 URL: https://issues.apache.org/jira/browse/FLINK-24789
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Matthias
>            Assignee: David Morávek
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>             Fix For: 1.15.0, 1.14.1
>
>         Attachments: 
> logs-ci_build-test_ci_build_finegrained_resource_management-1635785399.zip
>
>
> We experienced a failure of {{OperatorCoordinatorSchedulerTest}} in our VVP 
> Fork of Flink. The {{finegrained_resource_management}} test run failed with 
> an non-0 exit code:
> {code}
> Nov 01 17:19:12 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
> project flink-runtime: There are test failures.
> Nov 01 17:19:12 [ERROR] 
> Nov 01 17:19:12 [ERROR] Please refer to 
> /__w/1/s/flink-runtime/target/surefire-reports for the individual test 
> results.
> Nov 01 17:19:12 [ERROR] Please refer to dump files (if any exist) 
> [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
> Nov 01 17:19:12 [ERROR] ExecutionException The forked VM terminated without 
> properly saying goodbye. VM crash or System.exit called?
> Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
> -Dmvn.forkNumber=2 -XX:+UseG1GC -jar 
> /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar 
> /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 
> surefire6448660128033443499tmp surefire_4131168043975619749001tmp
> Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log
> Nov 01 17:19:12 [ERROR] Process Exit Code: 239
> Nov 01 17:19:12 [ERROR] Crashed tests:
> Nov 01 17:19:12 [ERROR] 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest
> Nov 01 17:19:12 [ERROR] 
> org.apache.maven.surefire.booter.SurefireBooterForkException: 
> ExecutionException The forked VM terminated without properly saying goodbye. 
> VM crash or System.exit called?
> Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && 
> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
> -Dmvn.forkNumber=2 -XX:+UseG1GC -jar 
> /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar 
> /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 
> surefire6448660128033443499tmp surefire_4131168043975619749001tmp
> Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log
> Nov 01 17:19:12 [ERROR] Process Exit Code: 239
> Nov 01 17:19:12 [ERROR] Crashed tests:
> Nov 01 17:19:12 [ERROR] 
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest
> Nov 01 17:19:12 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
> Nov 01 17:19:12 [ERROR] at 
> org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457)
> {code}
> It looks like the {{testSnapshotAsyncFailureFailsCheckpoint}} caused it even 
> though finishing successfully due to a fatal error when shutting down the 
> cluster:
> {code}
> 17:07:27,264 [    Checkpoint Timer] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler              [] - FATAL: 
> Thread 'Checkpoint Timer' produced an uncaught exception. Stopping the 
> process...
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: java.lang.IllegalStateException: 
> CheckpointsCleaner has already been closed
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:626)
>  ~[classes/:?]
>         at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
>  ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
>  ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
> [?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:814)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [?:1.8.0_292]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> [?:1.8.0_292]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [?:1.8.0_292]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [?:1.8.0_292]
>         at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalStateException: CheckpointsCleaner has already been closed
>         at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) 
> ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>  ~[?:1.8.0_292]
>         ... 8 more
> Caused by: java.lang.IllegalStateException: CheckpointsCleaner has already 
> been closed
>         at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
> ~[flink-core-1.14-stream-SNAPSHOT.jar:1.14-stream-SNAPSHOT]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner.incrementNumberOfCheckpointsToClean(CheckpointsCleaner.java:105)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanup(CheckpointsCleaner.java:87)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanCheckpoint(CheckpointsCleaner.java:62)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.dispose(PendingCheckpoint.java:573)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:551)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1939)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1926)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:910)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:875)
>  ~[classes/:?]
>         at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$6(CheckpointCoordinator.java:614)
>  ~[classes/:?]
>         at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) 
> ~[?:1.8.0_292]
>         at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
>  ~[?:1.8.0_292]
>         ... 8 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to