[ 
https://issues.apache.org/jira/browse/FLINK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grace Grimwood updated FLINK-35571:
-----------------------------------
    Attachment:     (was: 
20240612_181148_mvn-clean-package_flink-runtime_also-make.log)

> ProfilingServiceTest.testRollingDeletion intermittently fails due to improper 
> test isolation
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35571
>                 URL: https://issues.apache.org/jira/browse/FLINK-35571
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>         Environment: *Git revision:*
> {code:bash}
> $ git show
> commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master)
> {code}
> *Java info:*
> {code:bash}
> $ java -version
> openjdk version "17.0.11" 2024-04-16
> OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)
> {code}
> {code:bash}
> $ sdk current
> Using:
> java: 17.0.11-tem
> maven: 3.8.6
> scala: 2.12.19
> {code}
> *OS info:*
> {code:bash}
> $ uname -av
> Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 
> PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64
> {code}
> *Hardware info:*
> {code:bash}
> $ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e 
> 'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:'
> hw.memsize: 34359738368
> machdep.cpu.core_count: 12
> machdep.cpu.brand_string: Apple M2 Pro
> {code}
>            Reporter: Grace Grimwood
>            Priority: Major
>
> *Symptom:*
> The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the 
> following error:
> {code:java}
> [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 
> s <<< FAILURE! -- in 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest
> [ERROR] 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion
>  -- Time elapsed: 9.264 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <3> but was: <6>
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at 
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>       at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>       at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
>       at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
>       at 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175)
>       at 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
>       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
>       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
>       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> The number of extra files found varies from failure to failure.
> *Cause:*
> Many of the tests in *{{ProfilingServiceTest}}* rely on a specific 
> configuration of the *{{ProfilingService}}* instance, but 
> *{{ProfilingService.getInstance}}* does not check whether an existing 
> instance's config matches the provided config before returning it. Because of 
> this, and because JUnit does not guarantee a specific ordering of tests 
> (unless they are specifically annotated), it is possible for these tests to 
> receive an instance that does not behave in the expected way and therefore 
> fail.
> *Analysis:*
> In troubleshooting the test failure, we tried adding an extra assertion to 
> *{{ProfilingServiceTest.setUp}}* to validate the directories being written to 
> were correct:
> {code:java}
> Assertions.assertEquals(tempDir.toString(), 
> profilingService.getProfilingResultDir());
> {code}
> That assert produced the following failure:
> {code:java}
> org.opentest4j.AssertionFailedError: expected: 
> </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> 
> but was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/>
> {code}
> This failure shows that the *{{ProfilingService}}* returned by 
> *{{ProfilingService.getInstance}}* in the setup is not using the correct 
> directory, and therefore cannot be the correct instance for this test class 
> because it has the wrong config.
> This is because the static method *{{ProfilingService.getInstance}}* attempts 
> to reuse any existing instance of *{{ProfilingService}}* before it creates a 
> new one and disregards any differences in config in doing so, which means 
> that if another test instantiates a *{{ProfilingService}}* with different 
> config first and does not close it, that previous instance will be provided 
> to *{{ProfilingServiceTest}}* rather than the new instance those tests seem 
> to expect. This only happens with the first test run in this class, as the 
> teardown method run after every test explicitly closes the existing 
> *{{ProfilingService}}* instance.
> Specifically in the case of the test failures I have observed, it seems that 
> if *{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other 
> *{{ProfilingServiceTest}}* tests but _after_ the test methods in 
> *{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a 
> *{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've 
> been able to gather, *{{TaskExecutor}}* calls 
> *{{ProfilingService.getInstance}}* with default config, and holds on to that 
> instance internally but doesn't attempt to close that *{{ProfilingService}}* 
> instance when the *{{TaskExecutor}}* instance is itself closed. This means 
> that instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* 
> is run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* 
> at which point that test will fail as it incorrectly assumes that it has a 
> new *{{ProfilingService}}* instance with a clean directory configured.
> .
> Logs are attached, produced with the following command:
> {code:bash}
> mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl 
> :flink-runtime
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to