Grace Grimwood created FLINK-35571:
--------------------------------------
Summary: ProfilingServiceTest.testRollingDeletion intermittently
fails due to improper test isolation
Key: FLINK-35571
URL: https://issues.apache.org/jira/browse/FLINK-35571
Project: Flink
Issue Type: Bug
Components: Tests
Environment: *Git revision:*
{code:bash}
$ git show
commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master)
{code}
*Java info:*
{code:bash}
$ java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)
{code}
{code:bash}
$ sdk current
Using:
java: 17.0.11-tem
maven: 3.8.6
scala: 2.12.19
{code}
*OS info:*
{code:bash}
$ uname -av
Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT
2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64
{code}
*Hardware info:*
{code:bash}
$ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e
'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:'
hw.memsize: 34359738368
machdep.cpu.core_count: 12
machdep.cpu.brand_string: Apple M2 Pro
{code}
Reporter: Grace Grimwood
Attachments:
20240612_181148_mvn-clean-package_flink-runtime_also-make.log
*Symptom:*
The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the
following error:
{code:java}
[ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 s
<<< FAILURE! -- in org.apache.flink.runtime.util.profiler.ProfilingServiceTest
[ERROR]
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion
-- Time elapsed: 9.264 s <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <3> but was: <6>
at
org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
at
org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
at
org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
at
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
at
org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
at
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175)
at
org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at
java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
at
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
at
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
at
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
{code}
The number of extra files found varies from failure to failure.
*Cause:*
Many of the tests in *{{ProfilingServiceTest}}* rely on a specific
configuration of the *{{ProfilingService}}* instance, but
*{{ProfilingService.getInstance}}* does not check whether an existing
instance's config matches the provided config before returning it. Because of
this, and because JUnit does not guarantee a specific ordering of tests (unless
they are specifically annotated), it is possible for these tests to receive an
instance that does not behave in the expected way and therefore fail.
*Analysis:*
In troubleshooting the test failure, we tried adding an extra assertion to
*{{ProfilingServiceTest.setUp}}* to validate the directories being written to
were correct:
{code:java}
Assertions.assertEquals(tempDir.toString(),
profilingService.getProfilingResultDir());
{code}
That assert produced the following failure:
{code:java}
org.opentest4j.AssertionFailedError: expected:
</var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> but
was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/>
{code}
This failure shows that the *{{ProfilingService}}* returned by
*{{ProfilingService.getInstance}}* in the setup is not using the correct
directory, and therefore cannot be the correct instance for this test class
because it has the wrong config.
This is because the static method *{{ProfilingService.getInstance}}* attempts
to reuse any existing instance of *{{ProfilingService}}* before it creates a
new one and disregards any differences in config in doing so, which means that
if another test instantiates a *{{ProfilingService}}* with different config
first and does not close it, that previous instance will be provided to
*{{ProfilingServiceTest}}* rather than the new instance those tests seem to
expect. This only happens with the first test run in this class, as the
teardown method run after every test explicitly closes the existing
*{{ProfilingService}}* instance.
Specifically in the case of the test failures I have observed, it seems that if
*{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other
*{{ProfilingServiceTest}}* tests but _after_ the test methods in
*{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a
*{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've
been able to gather, *{{TaskExecutor}}* calls
*{{ProfilingService.getInstance}}* with default config, and holds on to that
instance internally but doesn't attempt to close that *{{ProfilingService}}*
instance when the *{{TaskExecutor}}* instance is itself closed. This means that
instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* is
run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* at
which point that test will fail as it incorrectly assumes that it has a new
*{{ProfilingService}}* instance with a clean directory configured.
.
Logs are attached, produced with the following command:
{code:bash}
mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl
:flink-runtime
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)