[ 
https://issues.apache.org/jira/browse/FLINK-17721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112726#comment-17112726
 ] 

Xintong Song commented on FLINK-17721:
--------------------------------------

The test failure is caused by {{cleanupDirectoryWithRetry}} not deleting the 
files if the deadline is already reached.

Deleting files from a remote file system might fail due to, e.g., network 
problems. Therefore, {{cleanupDirectoryWithRetry}} keeps trying to delete the 
files until success or reaching a deadline. The deadline is defined in 
{{setup}} before executing all test cases. If executing tests takes too long, 
that the deadline is reached before calling {{cleanupDirectoryWithRetry}}, the 
files will not be deleted leading to the reported failure.

To fix the problem, we have several options.

# The easiest way is to make sure {{cleanupDirectoryWithRetry}} at least try to 
delete the files once, no matter the deadline is reached or not. The 
shortcoming is that, we may still suffer from the deletion failures due to 
network issues or asynchronism (which FLINK-13483 tries to solve). So 
practically, this test case might be still unstable, just less unstable than 
now.
# We can also set a separate timeout or retry limit for 
{{cleanupDirectoryWithRetry}}. This decouples the handling of each potential 
deletion failure apart from the overall test execution time.
# The cleanest solution, IMO, is to get rid of the deadlines. Currently, the 
deadline is used in {{checkPathExistence}} and {{cleanupDirectoryWithRetry}} as 
a timeout for verifying the FS status. However, it does not make good sense to 
me that all the checks in the class share a common deadline. We should have an 
independent timeout for each check.

I'm in favor of option 3. My only concern is that, this change might affect all 
the subclasses of {{AbstractHadoopFileSystemITTest}}. Currently, Flink 
integrate file systems with its plugin mechanism. That means users may provide 
their custom FS plugins. If a user extended {{AbstractHadoopFileSystemITTest}} 
for testing his custom plugin, our change might lead to compatibility problems. 
So the questions is, *do we want to maintain backwards compatibility for the 
testing base class of the FS framework?* From my side, I would not consider 
{{AbstractHadoopFileSystemITTest}} as public interface, thus would be ok with 
any modification to it. But I think it would be good to collect more opinions.

> AbstractHadoopFileSystemITTest .cleanupDirectoryWithRetry fails with 
> AssertionError 
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-17721
>                 URL: https://issues.apache.org/jira/browse/FLINK-17721
>             Project: Flink
>          Issue Type: Bug
>          Components: FileSystems, Tests
>            Reporter: Robert Metzger
>            Assignee: Xintong Song
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> CI: 
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1343&view=logs&j=961f8f81-6b52-53df-09f6-7291a2e4af6a&t=2f99feaa-7a9b-5916-4c1c-5e61f395079e
> {code}
> [ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 34.079 s <<< FAILURE! - in 
> org.apache.flink.fs.s3hadoop.HadoopS3FileSystemITCase
> [ERROR] org.apache.flink.fs.s3hadoop.HadoopS3FileSystemITCase  Time elapsed: 
> 21.334 s  <<< FAILURE!
> java.lang.AssertionError
>       at org.junit.Assert.fail(Assert.java:86)
>       at org.junit.Assert.assertTrue(Assert.java:41)
>       at org.junit.Assert.assertFalse(Assert.java:64)
>       at org.junit.Assert.assertFalse(Assert.java:74)
>       at 
> org.apache.flink.runtime.fs.hdfs.AbstractHadoopFileSystemITTest.cleanupDirectoryWithRetry(AbstractHadoopFileSystemITTest.java:162)
>       at 
> org.apache.flink.runtime.fs.hdfs.AbstractHadoopFileSystemITTest.teardown(AbstractHadoopFileSystemITTest.java:149)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
>       at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>       at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>       at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to