[ https://issues.apache.org/jira/browse/FLINK-24789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439211#comment-17439211 ]
Chesnay Schepler edited comment on FLINK-24789 at 11/5/21, 12:42 PM: --------------------------------------------------------------------- Note that recently some changes were made to the CheckpointCleaner in FLINK-24789 FLINK-23647. was (Author: zentol): Note that recently some changes were made to the CheckpointCleaner in FLINK-24789. > IllegalStateException with CheckpointCleaner being closed already > ----------------------------------------------------------------- > > Key: FLINK-24789 > URL: https://issues.apache.org/jira/browse/FLINK-24789 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.14.0 > Reporter: Matthias > Assignee: David Morávek > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.15.0, 1.14.1 > > Attachments: > logs-ci_build-test_ci_build_finegrained_resource_management-1635785399.zip > > > We experienced a failure of {{OperatorCoordinatorSchedulerTest}} in our VVP > Fork of Flink. The {{finegrained_resource_management}} test run failed with > an non-0 exit code: > {code} > Nov 01 17:19:12 [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on > project flink-runtime: There are test failures. > Nov 01 17:19:12 [ERROR] > Nov 01 17:19:12 [ERROR] Please refer to > /__w/1/s/flink-runtime/target/surefire-reports for the individual test > results. > Nov 01 17:19:12 [ERROR] Please refer to dump files (if any exist) > [date].dump, [date]-jvmRun[N].dump and [date].dumpstream. > Nov 01 17:19:12 [ERROR] ExecutionException The forked VM terminated without > properly saying goodbye. VM crash or System.exit called? > Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m > -Dmvn.forkNumber=2 -XX:+UseG1GC -jar > /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar > /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 > surefire6448660128033443499tmp surefire_4131168043975619749001tmp > Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log > Nov 01 17:19:12 [ERROR] Process Exit Code: 239 > Nov 01 17:19:12 [ERROR] Crashed tests: > Nov 01 17:19:12 [ERROR] > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest > Nov 01 17:19:12 [ERROR] > org.apache.maven.surefire.booter.SurefireBooterForkException: > ExecutionException The forked VM terminated without properly saying goodbye. > VM crash or System.exit called? > Nov 01 17:19:12 [ERROR] Command was /bin/sh -c cd /__w/1/s/flink-runtime && > /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m > -Dmvn.forkNumber=2 -XX:+UseG1GC -jar > /__w/1/s/flink-runtime/target/surefire/surefirebooter6007815607334336440.jar > /__w/1/s/flink-runtime/target/surefire 2021-11-01T16-51-51_363-jvmRun2 > surefire6448660128033443499tmp surefire_4131168043975619749001tmp > Nov 01 17:19:12 [ERROR] Error occurred in starting fork, check output in log > Nov 01 17:19:12 [ERROR] Process Exit Code: 239 > Nov 01 17:19:12 [ERROR] Crashed tests: > Nov 01 17:19:12 [ERROR] > org.apache.flink.runtime.operators.coordination.OperatorCoordinatorSchedulerTest > Nov 01 17:19:12 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510) > Nov 01 17:19:12 [ERROR] at > org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457) > {code} > It looks like the {{testSnapshotAsyncFailureFailsCheckpoint}} caused it even > though finishing successfully due to a fatal error when shutting down the > cluster: > {code} > 17:07:27,264 [ Checkpoint Timer] ERROR > org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: > Thread 'Checkpoint Timer' produced an uncaught exception. Stopping the > process... > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: java.lang.IllegalStateException: > CheckpointsCleaner has already been closed > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$7(CheckpointCoordinator.java:626) > ~[classes/:?] > at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > [?:1.8.0_292] > at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) > [?:1.8.0_292] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:814) > [?:1.8.0_292] > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > [?:1.8.0_292] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [?:1.8.0_292] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [?:1.8.0_292] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > [?:1.8.0_292] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > [?:1.8.0_292] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_292] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_292] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292] > Caused by: java.util.concurrent.CompletionException: > java.lang.IllegalStateException: CheckpointsCleaner has already been closed > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:838) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > ~[?:1.8.0_292] > ... 8 more > Caused by: java.lang.IllegalStateException: CheckpointsCleaner has already > been closed > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) > ~[flink-core-1.14-stream-SNAPSHOT.jar:1.14-stream-SNAPSHOT] > at > org.apache.flink.runtime.checkpoint.CheckpointsCleaner.incrementNumberOfCheckpointsToClean(CheckpointsCleaner.java:105) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanup(CheckpointsCleaner.java:87) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointsCleaner.cleanCheckpoint(CheckpointsCleaner.java:62) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.dispose(PendingCheckpoint.java:573) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.PendingCheckpoint.abort(PendingCheckpoint.java:551) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1939) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1926) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:910) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.onTriggerFailure(CheckpointCoordinator.java:875) > ~[classes/:?] > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$startTriggeringCheckpoint$6(CheckpointCoordinator.java:614) > ~[classes/:?] > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > ~[?:1.8.0_292] > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > ~[?:1.8.0_292] > ... 8 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)