[ 
https://issues.apache.org/jira/browse/FLINK-35571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854602#comment-17854602
 ] 

Sam Barker commented on FLINK-35571:
------------------------------------

The simplest possible fix for this is to add 

{{ProfilingService.getInstance(new Configuration()).close();}}
to setUp in ProfilingServiceTest. To ensure that the test runs with the 
configuration it requests. However what I'm not clear on is wether that has any 
knock on implications for any other tests running in parallel. Another possible 
test only fix is to expose an additional factory method that the test can use 
to construct its own profiling service. 

However is fixing this in the test just papering over a larger issue? 

As [~gracegrimwood] highlights there is an instance of the ProfilingService 
being used in a TaskExecutor via MiniCluster but not closed by either of those. 
Which leads me to wonder if MiniCluster should be triggering the closing of the 
ProfilingService? Is the lifecycle of the MiniCluster instance sufficiently 
wide to say that that a profilingService instance shouldn't outlast it?  

Other possible fixes would be to stop the ProfilingService being a global 
singleton and have it track a Map of config to instance (I think that works 
fine with having a global singleton for the AsyncProfiler instance). Though 
looking at the production code I don't think there is any need for instances 
with distinct config. 

Silently ignoring the config passed into getInstance seems a bit a of a smell 
to me. I'd personally be inclined to throw and exception when asking for an 
instance that has different configuration to the current instance. However I 
don't have a good enough mental model of how this is used to determine if that 
is really appropriate.

> ProfilingServiceTest.testRollingDeletion intermittently fails due to improper 
> test isolation
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35571
>                 URL: https://issues.apache.org/jira/browse/FLINK-35571
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>         Environment: *Git revision:*
> {code:bash}
> $ git show
> commit b8d527166e095653ae3ff5c0431bf27297efe229 (HEAD -> master)
> {code}
> *Java info:*
> {code:bash}
> $ java -version
> openjdk version "17.0.11" 2024-04-16
> OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)
> {code}
> {code:bash}
> $ sdk current
> Using:
> java: 17.0.11-tem
> maven: 3.8.6
> scala: 2.12.19
> {code}
> *OS info:*
> {code:bash}
> $ uname -av
> Darwin MacBook-Pro 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 
> PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 arm64
> {code}
> *Hardware info:*
> {code:bash}
> $ sysctl -a | grep -e 'machdep\.cpu\.brand_string\:' -e 
> 'machdep\.cpu\.core_count\:' -e 'hw\.memsize\:'
> hw.memsize: 34359738368
> machdep.cpu.core_count: 12
> machdep.cpu.brand_string: Apple M2 Pro
> {code}
>            Reporter: Grace Grimwood
>            Priority: Major
>         Attachments: 20240612_181148_mvn-clean-package_flink-runtime.log
>
>
> *Symptom:*
> The test *{{ProfilingServiceTest.testRollingDeletion}}* fails with the 
> following error:
> {code:java}
> [ERROR] Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.32 
> s <<< FAILURE! -- in 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest
> [ERROR] 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion
>  -- Time elapsed: 9.264 s <<< FAILURE!
> org.opentest4j.AssertionFailedError: expected: <3> but was: <6>
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>       at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>       at 
> org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
>       at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
>       at 
> org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
>       at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
>       at 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.verifyRollingDeletionWorks(ProfilingServiceTest.java:175)
>       at 
> org.apache.flink.runtime.util.profiler.ProfilingServiceTest.testRollingDeletion(ProfilingServiceTest.java:117)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:568)
>       at 
> java.base/java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:194)
>       at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
>       at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
>       at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
>       at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
> {code}
> The number of extra files found varies from failure to failure.
> *Cause:*
> Many of the tests in *{{ProfilingServiceTest}}* rely on a specific 
> configuration of the *{{ProfilingService}}* instance, but 
> *{{ProfilingService.getInstance}}* does not check whether an existing 
> instance's config matches the provided config before returning it. Because of 
> this, and because JUnit does not guarantee a specific ordering of tests 
> (unless they are specifically annotated), it is possible for these tests to 
> receive an instance that does not behave in the expected way and therefore 
> fail.
> *Analysis:*
> In troubleshooting the test failure, we tried adding an extra assertion to 
> *{{ProfilingServiceTest.setUp}}* to validate the directories being written to 
> were correct:
> {code:java}
> Assertions.assertEquals(tempDir.toString(), 
> profilingService.getProfilingResultDir());
> {code}
> That assert produced the following failure:
> {code:java}
> org.opentest4j.AssertionFailedError: expected: 
> </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/junit9871405123519368112> 
> but was: </var/folders/sh/5vx5kpkd5dn_pfdptn1s9rvc0000gn/T/>
> {code}
> This failure shows that the *{{ProfilingService}}* returned by 
> *{{ProfilingService.getInstance}}* in the setup is not using the correct 
> directory, and therefore cannot be the correct instance for this test class 
> because it has the wrong config.
> This is because the static method *{{ProfilingService.getInstance}}* attempts 
> to reuse any existing instance of *{{ProfilingService}}* before it creates a 
> new one and disregards any differences in config in doing so, which means 
> that if another test instantiates a *{{ProfilingService}}* with different 
> config first and does not close it, that previous instance will be provided 
> to *{{ProfilingServiceTest}}* rather than the new instance those tests seem 
> to expect. This only happens with the first test run in this class, as the 
> teardown method run after every test explicitly closes the existing 
> *{{ProfilingService}}* instance.
> Specifically in the case of the test failures I have observed, it seems that 
> if *{{ProfilingServiceTest.testRollingDeletion}}* is run _before_ any other 
> *{{ProfilingServiceTest}}* tests but _after_ the test methods in 
> *{{JobIntermediateDatasetReuseTest}}* (or any other tests that create a 
> *{{TaskExecutor}}* via a {*}{{MiniCluster}}{*}), it will fail. From what I've 
> been able to gather, *{{TaskExecutor}}* calls 
> *{{ProfilingService.getInstance}}* with default config, and holds on to that 
> instance internally but doesn't attempt to close that *{{ProfilingService}}* 
> instance when the *{{TaskExecutor}}* instance is itself closed. This means 
> that instance is sometimes still around when *{{ProfilingServiceTest.setUp}}* 
> is run, so it gets passed to *{{ProfilingServiceTest.testRollingDeletion}}* 
> at which point that test will fail as it incorrectly assumes that it has a 
> new *{{ProfilingService}}* instance with a clean directory configured.
> .
> Logs are attached, produced with the following command:
> {code:bash}
> mvn clean package -Denforcer.skip -Dcheckstyle.skip -Drat.skip=true -pl 
> :flink-runtime
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to