snuyanzin commented on PR #23914:
URL: https://github.com/apache/flink/pull/23914#issuecomment-1853554316

   @Jiabao-Sun , @XComp 
   it seems I found the reason
   
   junit5.10.1 makes it always failing and it is becoming a bit more clear
   
   there are 2 threads
   1. junit5 trying to delete dir
   2. cleanup snapshot
   ```
        at 
org.apache.flink.runtime.state.SnapshotDirectory.cleanup(SnapshotDirectory.java:93)
        at 
org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase$NativeRocksDBSnapshotResources.release(RocksDBSnapshotStrategyBase.java:384)
        at 
org.apache.flink.runtime.state.SnapshotStrategyRunner$1.cleanupProvidedResources(SnapshotStrategyRunner.java:97)
        at 
org.apache.flink.runtime.state.AsyncSnapshotCallable.cleanup(AsyncSnapshotCallable.java:163)
        at 
org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:87)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:508)
        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54)
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
    suspect that in JUnit5 they made removal in `AfterEach` which makes it 
concurrent with checkpoint cleanup...
    
    Since junit 5.10.1 makes it failing even locally
    as a WA I replaced `TempDir` with 
    ```java
      val baseCheckpointPath = 
Files.createTempDirectory(getClass.getCanonicalName)
       Files.deleteIfExists(baseCheckpointPath);
   ```
   locally it helps
   
https://dev.azure.com/snuyanzin/flink/_build/results?buildId=2674&view=results
   now it is running on my ci to see whether it helps or not    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to