[ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413600#comment-16413600
 ] 

liaoyuxiangqin commented on HDFS-13179:
---------------------------------------

[~gabor.bota] [~bharatviswa] [~elgoiri]. I think the failed reason is the 
correct replica recovered from lazypersist was replaced by incorrect replica 
read from cache when restart datanode. The comparative analysis  the successful 
and the failed test logs  as follows:
{quote}success:
 15:05:43,550 INFO BlockPoolSlice - Successfully read replica from cache file : 
/tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data1/current/BP-1756767659-172.17.0.1-1517497537101/current/replicas
 15:05:43,568 INFO FsDatasetImpl - Recovered 1 replicas from 
/tmp/run_tha_testU50PcZ/target/test/data/dfs/data/data2/current/BP-1756767659-172.17.0.1-1517497537101/current/lazypersist
{quote}
{quote}failed:
 15:07:17,309 INFO FsDatasetImpl - Recovered 1 replicas from 
/tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2/current/BP-381858266-172.17.0.1-1517497631100/current/lazypersist
 15:07:17,310 INFO FsDatasetImpl - Time to add replicas to map for block pool 
BP-381858266-172.17.0.1-1517497631100 on volume 
/tmp/run_tha_testvb5u64/target/test/data/dfs/data/data2: 2ms
 15:07:17,311 INFO BlockPoolSlice - Successfully read replica from cache file : 
/tmp/run_tha_testvb5u64/target/test/data/dfs/data/data1/current/BP-381858266-172.17.0.1-1517497631100/current/replicas
{quote}
>From above test logs we can find that the read replica from cache file execute 
>in the front of recover replaca from lazypersist when success, but the execute 
>sequence is in opposite when failed.The reason lead to this issue is block 
>pool getAllVolumesMap() async by dispatch thread per volume when datanode 
>restart, and the thread process speed may different in every time, so cannot 
>guarantee which vloume finish first.

At last I have a question about What is the effect of cache file? And I think 
the transient Storage not need to write replica to cache file when shutdown 
volume.
 Thinks!

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13179
>                 URL: https://issues.apache.org/jira/browse/HDFS-13179
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.0.0
>            Reporter: Gabor Bota
>            Priority: Critical
>         Attachments: test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
>     try {
>       ensureFileReplicasOnStorageType(path1, DEFAULT);
>     }catch (TimeoutException t){
>       LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>       ensureFileReplicasOnStorageType(path1, RAM_DISK);
>     }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =============================
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =============================
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to