subject:"\[jira\] \[Updated\] \(HDFS\-13179\) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently"

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-06-05 Thread Stephen O'Donnell (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-13179:
-
Attachment: HDFS-13179-branch-3.0.003.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, 
> HDFS-13179-branch-3.0.003.patch, HDFS-13179.001.patch, HDFS-13179.002.patch, 
> HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-13179:
---
Fix Version/s: 2.10.1

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, 
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-31 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-13179:
---
Fix Version/s: 3.2.2
   3.1.4
   3.0.4

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, 
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-28 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated HDFS-13179:
-
Attachment: HDFS-13179-branch-2.10.003.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-13179-branch-2.10.003.patch, HDFS-13179.001.patch, 
> HDFS-13179.002.patch, HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-28 Thread Jira



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated HDFS-13179:
---
Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-13179.001.patch, HDFS-13179.002.patch, 
> HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-27 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated HDFS-13179:
-
Attachment: HDFS-13179.003.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Attachments: HDFS-13179.001.patch, HDFS-13179.002.patch, 
> HDFS-13179.003.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-23 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated HDFS-13179:
-
Attachment: HDFS-13179.002.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Attachments: HDFS-13179.001.patch, HDFS-13179.002.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-22 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated HDFS-13179:
-
Attachment: HDFS-13179.001.patch

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Attachments: HDFS-13179.001.patch, test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2020-01-22 Thread Ahmed Hussein (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Hussein updated HDFS-13179:
-
Status: Patch Available  (was: Open)

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Assignee: Ahmed Hussein
>Priority: Critical
> Attachments: test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-03-12 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Attachment: test runs.zip

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Priority: Critical
> Attachments: test runs.zip
>
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-02-22 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Component/s: fs

> TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails 
> intermittently
> --
>
> Key: HDFS-13179
> URL: https://issues.apache.org/jira/browse/HDFS-13179
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.0.0
>Reporter: Gabor Bota
>Priority: Critical
>
> The error caused by TimeoutException because the test is waiting to ensure 
> that the file is replicated to DISK storage but the replication can't be 
> finished to DISK during the 30s timeout in ensureFileReplicasOnStorageType(), 
> but the file is still on RAM_DISK - so there is no data loss.
> Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
> fixes the flakiness. 
> {code:java}
> try {
>   ensureFileReplicasOnStorageType(path1, DEFAULT);
> }catch (TimeoutException t){
>   LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
> RAM_DISK");
>   ensureFileReplicasOnStorageType(path1, RAM_DISK);
> }
>   }
> {code}
> Some thoughts:
> * Successful and failed tests run similar to the point when datanode 
> restarts. Restart line is the following in the log: LazyPersistTestCase - 
> Restarting the DataNode
> * There is a line which only occurs in the failed test: *addStoredBlock: 
> Redundant addStoredBlock request received for blk_1073741825_1001 on node 
> 127.0.0.1:49455 size 5242880*
> * This redundant request at BlockManager#addStoredBlock could be the main 
> reason for the test fail. Something wrong with the gen stamp? Corrupt 
> replicas? 
> =
> Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
> 1000 runs, 34 failures (3.4% fail)
> Failure rate analysis:
> TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%
> 33 failures caused by: {noformat}
> java.util.concurrent.TimeoutException: Timed out waiting for condition. 
> Thread diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 
> on 39589" 
> {noformat}
> 1 failure caused by: {noformat}
> java.net.BindException: Problem binding to [localhost:56729] 
> java.net.BindException: Address already in use; For more details see: 
> http://wiki.apache.org/hadoop/BindException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
>  Caused by: java.net.BindException: Address already in use at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
> {noformat}
> =
> Example stacktrace:
> {noformat}
> Timed out waiting for condition. Thread diagnostics:
> Timestamp: 2017-11-01 10:36:49,499
> "Thread-1" prio=5 tid=13 runnable
> java.lang.Thread.State: RUNNABLE
> at java.lang.Thread.dumpThreads(Native Method)
> at java.lang.Thread.getAllStackTraces(Thread.java:1610)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
> at 
> org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
> at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-02-22 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Description: 
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Some thoughts:
* Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
* There is a line which only occours in the failed test: *addStoredBlock: 
Redundant addStoredBlock request received for blk_1073741825_1001 on node 
127.0.0.1:49455 size 5242880*
* This redundant request at BlockManager#addStoredBlock could be the main 
reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread 
diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] 
java.net.BindException: Address already in use; For more details see: 
http://wiki.apache.org/hadoop/BindException at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
 Caused by: java.net.BindException: Address already in use at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

=

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}

  was:
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
There is a line in the failed test: *addStoredBlock: Redundant addStoredBlock 
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason 
for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-02-22 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Description: 
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Some thoughts:
* Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
* There is a line which only occurs in the failed test: *addStoredBlock: 
Redundant addStoredBlock request received for blk_1073741825_1001 on node 
127.0.0.1:49455 size 5242880*
* This redundant request at BlockManager#addStoredBlock could be the main 
reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread 
diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] 
java.net.BindException: Address already in use; For more details see: 
http://wiki.apache.org/hadoop/BindException at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
 Caused by: java.net.BindException: Address already in use at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

=

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}

  was:
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Some thoughts:
* Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
* There is a line which only occours in the failed test: *addStoredBlock: 
Redundant addStoredBlock request received for blk_1073741825_1001 on node 
127.0.0.1:49455 size 5242880*
* This redundant request at BlockManager#addStoredBlock could be the main 
reason for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-02-22 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Description: 
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
There is a line in the failed test: *addStoredBlock: Redundant addStoredBlock 
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason 
for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread 
diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] 
java.net.BindException: Address already in use; For more details see: 
http://wiki.apache.org/hadoop/BindException at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
 Caused by: java.net.BindException: Address already in use at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

=

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}

  was:
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
There is a line in the failed test: *addStoredBlock: Redundant addStoredBlock 
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason 
for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by:

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

2018-02-22 Thread Gabor Bota (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Bota updated HDFS-13179:
--
Description: 
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
There is a line in the failed test: *addStoredBlock: Redundant addStoredBlock 
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason 
for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}
java.util.concurrent.TimeoutException: Timed out waiting for condition. Thread 
diagnostics: Timestamp: 2018-01-05 11:50:34,964 "IPC Server handler 6 on 39589" 
{noformat}

1 failure caused by: {noformat}
java.net.BindException: Problem binding to [localhost:56729] 
java.net.BindException: Address already in use; For more details see: 
http://wiki.apache.org/hadoop/BindException at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
 Caused by: java.net.BindException: Address already in use at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:49)
{noformat}

Example stacktrace:
{noformat}
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2017-11-01 10:36:49,499
"Thread-1" prio=5 tid=13 runnable
java.lang.Thread.State: RUNNABLE
at java.lang.Thread.dumpThreads(Native Method)
at java.lang.Thread.getAllStackTraces(Thread.java:1610)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDump(TimedOutTestsListener.java:87)
at 
org.apache.hadoop.test.TimedOutTestsListener.buildThreadDiagnosticString(TimedOutTestsListener.java:73)
at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:369)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.LazyPersistTestCase.ensureFileReplicasOnStorageType(LazyPersistTestCase.java:140)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas(TestLazyPersistReplicaRecovery.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
...
{noformat}

  was:
The error caused by TimeoutException because the test is waiting to ensure that 
the file is replicated to DISK storage but the replication can't be finished to 
DISK during the 30s timeout in ensureFileReplicasOnStorageType(), but the file 
is still on RAM_DISK - so there is no data loss.

Adding the following to TestLazyPersistReplicaRecovery.java:56 essentially 
fixes the flakiness. 
{code:java}
try {
  ensureFileReplicasOnStorageType(path1, DEFAULT);
}catch (TimeoutException t){
  LOG.warn("We got \"" + t.getMessage() + "\" so trying to find data on 
RAM_DISK");
  ensureFileReplicasOnStorageType(path1, RAM_DISK);
}
  }
{code}

Successful and failed tests run similar to the point when datanode restarts. 
Restart line is the following in the log: LazyPersistTestCase - Restarting the 
DataNode
There is a line in the failed test: addStoredBlock: *Redundant addStoredBlock 
request received for blk_1073741825_1001 on node 127.0.0.1:49455 size 5242880*
This redundant request at BlockManager#addStoredBlock could be the main reason 
for the test fail. Something wrong with the gen stamp? Corrupt replicas? 

=

Current fail ratio based on my test of TestLazyPersistReplicaRecovery: 
1000 runs, 34 failures (3.4% fail)

Failure rate analysis:
TestLazyPersistReplicaRecovery.testDnRestartWithSavedReplicas: 3.4%

33 failures caused by: {noformat}

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

[jira] [Updated] (HDFS-13179) TestLazyPersistReplicaRecovery#testDnRestartWithSavedReplicas fails intermittently

15 matches

Site Navigation

Mail list logo

Footer information