[ https://issues.apache.org/jira/browse/FLINK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grace Grimwood updated FLINK-35571: ----------------------------------- Attachment: (was: 20240612_181148_mvn-clean-package_flink-runtime_also-make.log) > ProfilingServiceTest.testRollingDeletion intermittently fails due to improper > test isolation > -------------------------------------------------------------------------------------------- > > Key: FLINK-35571 > URL: https://issues.apache.org/jira/browse/FLINK-35571 > Project: Flink > Issue Type: Bug > Components: Tests > Environment: *Git revision:* > {code:bash} > $ git show > commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master) > {code} > *Java info:* > {code:bash} > $ java -version > openjdk version "17.0.11" 2024-04-16 > OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9) > OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode) > {code} > {code:bash} > $ sdk current > Using: > java: 17.0.11-tem > maven: 3.8.6 > scala: 2.12.19 > {code} > *OS info:* > {code:bash} > $ uname -av > Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 > PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64 > {code} > *Hardware info:* > {code:bash} > $ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e > 'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:' > hw.memsize: 34359738368 > machdep.cpu.core_count: 12 > machdep.cpu.brand_string: Apple M2 Pro > {code} > Reporter: Grace Grimwood > Priority: Major > > *Symptom:* > The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the > following error: > {code:java} > [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 > s <<< FAILURE! -- in > org.apache.flink.runtime.util.profiler.ProfilingServiceTest > [ERROR] > org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion > -- Time elapsed: 9.264 s <<< FAILURE! > org.opentest4j.AssertionFailedError: expected: <3> but was: <6> > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150) > at > org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145) > at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531) > at > org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175) > at > org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at > java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194) > at > java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373) > at > java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182) > at > java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655) > at > java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622) > at > java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165) > {code} > The number of extra files found varies from failure to failure. > *Cause:* > Many of the tests in *{{ProfilingServiceTest}}* rely on a specific > configuration of the *{{ProfilingService}}* instance, but > *{{ProfilingService.getInstance}}* does not check whether an existing > instance's config matches the provided config before returning it. Because of > this, and because JUnit does not guarantee a specific ordering of tests > (unless they are specifically annotated), it is possible for these tests to > receive an instance that does not behave in the expected way and therefore > fail. > *Analysis:* > In troubleshooting the test failure, we tried adding an extra assertion to > *{{ProfilingServiceTest.setUp}}* to validate the directories being written to > were correct: > {code:java} > Assertions.assertEquals(tempDir.toString(), > profilingService.getProfilingResultDir()); > {code} > That assert produced the following failure: > {code:java} > org.opentest4j.AssertionFailedError: expected: > </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> > but was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/> > {code} > This failure shows that the *{{ProfilingService}}* returned by > *{{ProfilingService.getInstance}}* in the setup is not using the correct > directory, and therefore cannot be the correct instance for this test class > because it has the wrong config. > This is because the static method *{{ProfilingService.getInstance}}* attempts > to reuse any existing instance of *{{ProfilingService}}* before it creates a > new one and disregards any differences in config in doing so, which means > that if another test instantiates a *{{ProfilingService}}* with different > config first and does not close it, that previous instance will be provided > to *{{ProfilingServiceTest}}* rather than the new instance those tests seem > to expect. This only happens with the first test run in this class, as the > teardown method run after every test explicitly closes the existing > *{{ProfilingService}}* instance. > Specifically in the case of the test failures I have observed, it seems that > if *{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other > *{{ProfilingServiceTest}}* tests but _after_ the test methods in > *{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a > *{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've > been able to gather, *{{TaskExecutor}}* calls > *{{ProfilingService.getInstance}}* with default config, and holds on to that > instance internally but doesn't attempt to close that *{{ProfilingService}}* > instance when the *{{TaskExecutor}}* instance is itself closed. This means > that instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* > is run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* > at which point that test will fail as it incorrectly assumes that it has a > new *{{ProfilingService}}* instance with a clean directory configured. > . > Logs are attached, produced with the following command: > {code:bash} > mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl > :flink-runtime > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)