[ https://issues.apache.org/jira/browse/HDFS-16213?focusedWorklogId=646682&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-646682 ]
ASF GitHub Bot logged work on HDFS-16213: ----------------------------------------- Author: ASF GitHub Bot Created on: 05/Sep/21 07:29 Start Date: 05/Sep/21 07:29 Worklog Time Spent: 10m Work Description: virajjasani opened a new pull request #3386: URL: https://github.com/apache/hadoop/pull/3386 ### Description of PR TestFsDatasetImpl#testDnRestartWithHardLink is flapper: ``` [ERROR] testDnRestartWithHardLink(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl) Time elapsed: 7.768 s <<< FAILURE! java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at org.junit.Assert.assertTrue(Assert.java:42) at org.junit.Assert.assertTrue(Assert.java:53) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl.testDnRestartWithHardLink(TestFsDatasetImpl.java:1344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) ``` ### How was this patch tested? Unit testing. The current flaky behaviour is easy to reproduce by running the test code twice as part of same test. The resolution is to disable the detection as well as deletion of duplicate finalized replica by BlockPoolSlice instance. When Datanode comes up, BPServiceActors handshakes to Namenode and tries to initialize Block pool and in the process, it tries to get VolumeMap using BlockPoolSlice instance. While doing so, reading replicas from cache fails and hence, the thread tries to add Finalized and RBW replicas to addReplicaThreadPool fork-join pool in order to build the map. This process also tries to identify if there exists any duplicate replica. For this particular test, sometimes this process can detect duplicate replica on /data2 while processing finalized replica of /data1. Hence, before we can confirm newReplicaInfo.getBlockURI() exists, finalized replica on /data2 might get deleted (rare and flaky case). Although the probability for the thread processing the identification and deletion of duplicate finalized replica to be faster than main thread is less, it cannot be avoided. Hence, we disable adding Finalized and RBW replicas to addReplicaThreadPool in BlockPoolSlice here and re-enable it only after we confirm the existence of newReplicaInfo on "/data2" ARCHIVE storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 646682) Remaining Estimate: 0h Time Spent: 10m > Flaky test TestFsDatasetImpl#testDnRestartWithHardLink > ------------------------------------------------------ > > Key: HDFS-16213 > URL: https://issues.apache.org/jira/browse/HDFS-16213 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Failure case: > [here|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3359/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] > {code:java} > [ERROR] > testDnRestartWithHardLink(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl) > Time elapsed: 7.768 s <<< FAILURE![ERROR] > testDnRestartWithHardLink(org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl) > Time elapsed: 7.768 s <<< FAILURE!java.lang.AssertionError at > org.junit.Assert.fail(Assert.java:87) at > org.junit.Assert.assertTrue(Assert.java:42) at > org.junit.Assert.assertTrue(Assert.java:53) at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl.testDnRestartWithHardLink(TestFsDatasetImpl.java:1344) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org