[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2017-03-09 Thread Siva Teja Patibandla (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903901#comment-15903901
 ] 

Siva Teja Patibandla commented on HDFS-4937:


Hi Kihwal, was the v3 patch tested? it seems the whole function chooseRandom() 
got rewritten in later releases so the fix may not have gotten much test 
mileage so whether I should use it or not.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 2.8.0, 2.7.3, 3.0.0-alpha1
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991615#comment-14991615
 ] 

Yi Liu commented on HDFS-4937:
--

This time, it's correct now.  The logic of current patch is straight. 

+1 for the {{v1}} patch, thanks Kihwal.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991640#comment-14991640
 ] 

Brahma Reddy Battula commented on HDFS-4937:


Yes,V1 Patch LGTM, +1 (non-binding).

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991888#comment-14991888
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2512 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2512/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992009#comment-14992009
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #641 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/641/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992023#comment-14992023
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1365 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1365/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992312#comment-14992312
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #631 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/631/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992122#comment-14992122
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2571 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2571/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992412#comment-14992412
 ] 

Hudson commented on HDFS-4937:
--

ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #574 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/574/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991846#comment-14991846
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8760 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8760/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
ff47f35deed14ba6463cba76f0e6a6c15abb3eca)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.3
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989859#comment-14989859
 ] 

Kihwal Lee commented on HDFS-4937:
--

First of all, the precommit build ran 4,075 test cases, so I think it ran all 
of them this time.

The test failures are not related to the patch. I've rerun the failed tests and 
only {{TestSeveralNameNodes}} were failing occasionally. It was timing out 
waiting for a thread to finish writing. This test has been failing in other 
precommit builds as well. When I increase the timeout, it passed 100% of times. 
 I will file a jira for this.

{panel}
---
 T E S T S
---
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.298 sec - 
in org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.295 sec - in 
org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 157.484 sec - 
in org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.TestLeaseRecovery2
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.445 sec - in 
org.apache.hadoop.hdfs.TestLeaseRecovery2
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 98.315 sec - 
in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.TestCrcCorruption
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.387 sec - in 
org.apache.hadoop.hdfs.TestCrcCorruption
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.775 sec - in 
org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser
{panel}

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-04 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989891#comment-14989891
 ] 

Kihwal Lee commented on HDFS-4937:
--

bq. I will file a jira for this.
HDFS-9376

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, 
> HDFS-4937.v2.patch, HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988819#comment-14988819
 ] 

Hadoop QA commented on HDFS-4937:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
32s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 22s 
{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk cannot run 
convertXmlToText from findbugs {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 31s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
51s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 85m 6s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 34s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 25s 
{color} | {color:red} Patch generated 56 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 189m 15s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 |
|   | hadoop.hdfs.TestCrcCorruption |
|   | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes |
|   | hadoop.hdfs.TestLeaseRecovery2 |
|   | hadoop.hdfs.security.TestDelegationTokenForProxyUser |
|   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
|   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.hdfs.server.namenode.ha.TestDNFencing |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 |
|   | hadoop.hdfs.security.TestDelegationTokenForProxyUser |
|   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.TestEncryptionZones |
\\
\\
|| Subsystem || Report/Notes 

[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-03 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987384#comment-14987384
 ] 

Kihwal Lee commented on HDFS-4937:
--

The test failures are definitely related. When I run {{TestReplicationPolicy}}, 
different cases fail depending on test ordering. One failure might be affecting 
other cases.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985997#comment-14985997
 ] 

Hadoop QA commented on HDFS-4937:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 49s 
{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk cannot run 
convertXmlToText from findbugs {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
37s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 46s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 50m 21s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 46s {color} 
| {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s 
{color} | {color:red} Patch generated 58 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 119m 34s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_60 Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup |
|   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy |
|   | hadoop.hdfs.server.blockmanagement.TestBlockManager |
|   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain 
|
|   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
|   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad |
| JDK v1.7.0_79 Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup |
|   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
|   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy |
|   | hadoop.hdfs.TestDecommission |
|   | 

[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-11-02 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985507#comment-14985507
 ] 

Kihwal Lee commented on HDFS-4937:
--

So sorry about the spectacular 118 test failures! It should have refreshed the 
count with an empty exclude node set to obtain the correct count. Looks like a 
few failed test cases are passing with the change. Let's see if the precommit 
agrees.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch, 
> HDFS-4937.v3.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983864#comment-14983864
 ] 

Brahma Reddy Battula commented on HDFS-4937:


I don't know if I can give a -1. But shall we revert this? A low of tests are 
broken because of it.

{code}
662 int refreshCounter = numOfAvailableNodes;
...
671 while(numOfReplicas > 0 && numOfAvailableNodes > 0) {
672   DatanodeDescriptor chosenNode = chooseDataNode(scope);
673   if (excludedNodes.add(chosenNode)) { //was not in the excluded list
674 if (LOG.isDebugEnabled()) {
675   builder.append("\nNode 
").append(NodeBase.getPath(chosenNode)).append(" [");
676 }
677 numOfAvailableNodes--;
678 DatanodeStorageInfo storage = null;
679 if (isGoodDatanode(chosenNode, maxNodesPerRack, considerLoad,
...
711   }
712   // Refresh the node count. If the live node count became smaller,
713   // but it is not reflected in this loop, it may loop forever in case
714   // the replicas/rack cannot be satisfied.
715   if (--refreshCounter == 0) {
716 refreshCounter = clusterMap.countNumOfAvailableNodes(scope,
717 excludedNodes);
718 // It has already gone through enough number of nodes.
719 if (refreshCounter <= excludedNodes.size()) {
720   break;
721 }
722   }
723 }
{code}

line 672 {{chooseDataNode(scope)}} is random, if {{chosenNode}} happens to be a 
excluded one, it won't go to line 674. But {{refreshCounter}} is still 
decreased.
If we out of luck, too many times of {{chooseDataNode(scope)}} return a already 
excluded one, we go inside line 716, and break at line 720.
Then we end up with choosing not enough {{numOfReplicas}}. In fact we could 
have.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983924#comment-14983924
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #612 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/612/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983923#comment-14983923
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #624 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/624/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983954#comment-14983954
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1347 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1347/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983973#comment-14983973
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2554 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2554/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983986#comment-14983986
 ] 

Hudson commented on HDFS-4937:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #560 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/560/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983905#comment-14983905
 ] 

Yi Liu commented on HDFS-4937:
--

Revert from trunk, branch-2, branch-2.7.  Thanks Brahma.

I thought the tests passed... But actually the jenkins doesn't include the 
tests result. 

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983907#comment-14983907
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8738 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8738/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983906#comment-14983906
 ] 

Yi Liu commented on HDFS-4937:
--

I did consider the situation you mentioned, But I thought in real env the NN 
could find other racks/DNs if it has gone through enough number of nodes.  But 
I missed the fact that many tests may only contain few available DNs, and 
{{refreshCounter <= excludedNodes.size()}} will be true, also in real env this 
also may happen if total number of DNs is few.  So the patch should not be 
correct for these cases, revert them.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-31 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983996#comment-14983996
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2497 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2497/])
Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 
7fd6416759cbb202ed21b47d28c1587e04a5cdc6)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Vinayakumar B (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983824#comment-14983824
 ] 

Vinayakumar B commented on HDFS-4937:
-

hi [~aw], any idea why tests did not run in last precommit for the patch here  
[above comment 
|https://issues.apache.org/jira/browse/HDFS-4937?focusedCommentId=14981649=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981649]
?

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982643#comment-14982643
 ] 

Kihwal Lee commented on HDFS-4937:
--

Also committed to branch-2.7.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.8.0
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982633#comment-14982633
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-trunk-Commit #8730 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/8730/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.8.0
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982676#comment-14982676
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2549 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2549/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982818#comment-14982818
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #619 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/619/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982791#comment-14982791
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2493 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2493/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983125#comment-14983125
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #556 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/556/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982867#comment-14982867
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #1342 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/1342/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982973#comment-14982973
 ] 

Hudson commented on HDFS-4937:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #607 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/607/])
HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 
43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f)
* hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java


> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-29 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981132#comment-14981132
 ] 

Kihwal Lee commented on HDFS-4937:
--

The failed test cases pass when run locally. 
{noformat}
---
 T E S T S
---
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 146.108 sec - 
in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.379 sec - in 
org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.582 sec - in 
org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; 
support was removed in 8.0
Running org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 104.369 sec - 
in org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints

Results :

Tests run: 54, Failures: 0, Errors: 0, Skipped: 0
{noformat}
Also, there actually is no new findbugs issue.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981649#comment-14981649
 ] 

Hadoop QA commented on HDFS-4937:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s 
{color} | {color:blue} docker + precommit patch detected. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
0s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 4s 
{color} | {color:green} trunk passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
0s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
0s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed with JDK v1.8.0_60 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_79 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
0s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
0s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 
8s {color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 13s 
{color} | {color:red} Patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 45s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.7.1 Server=1.7.1 
Image:test-patch-base-hadoop-date2015-10-30 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12769642/HDFS-4937.v2.patch |
| JIRA Issue | HDFS-4937 |
| Optional Tests |  asflicense  shellcheck  javac  javadoc  mvninstall  unit  
findbugs  checkstyle  compile  |
| uname | Linux 933890a322fa 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/patchprocess/apache-yetus-e77b1ce/precommit/personality/hadoop.sh
 |
| git revision | trunk / e5b1733 |
| Default Java | 1.7.0_79 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-oracle:1.8.0_60 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 |
| shellcheck | v0.4.1 |
| JDK v1.7.0_79  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13285/testReport/ |
| asflicense | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13285/artifact/patchprocess/patch-asflicense-problems.txt
 |
| 

[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981789#comment-14981789
 ] 

Yi Liu commented on HDFS-4937:
--

+1, thanks Kihwal.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975240#comment-14975240
 ] 

Hadoop QA commented on HDFS-4937:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |  16m 36s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   8m  2s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 37s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 39s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 36s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   2m 35s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | native |   3m 13s | Pre-build of native portion |
| {color:red}-1{color} | hdfs tests |  50m 48s | Tests failed in hadoop-hdfs. |
| | |  95m 11s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs |
| Failed unit tests | 
hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap |
|   | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints |
|   | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots |
|   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12768765/HDFS-4937.v1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 2f1eb2b |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13202/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
 |
| hadoop-hdfs test log | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13202/artifact/patchprocess/testrun_hadoop-hdfs.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13202/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13202/console |


This message was automatically generated.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch, HDFS-4937.v1.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969656#comment-14969656
 ] 

Hadoop QA commented on HDFS-4937:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / aea26bf |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/13130/console |


This message was automatically generated.

> ReplicationMonitor can infinite-loop in 
> BlockPlacementPolicyDefault#chooseRandom()
> --
>
> Key: HDFS-4937
> URL: https://issues.apache.org/jira/browse/HDFS-4937
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.4-alpha, 0.23.8
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: BB2015-05-TBR
> Attachments: HDFS-4937.patch
>
>
> When a large number of nodes are removed by refreshing node lists, the 
> network topology is updated. If the refresh happens at the right moment, the 
> replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
> This is because the cached cluster size is used in the terminal condition 
> check of the loop. This usually happens when a block with a high replication 
> factor is being processed. Since replicas/rack is also calculated beforehand, 
> no node choice may satisfy the goodness criteria if refreshing removed racks. 
> All nodes will end up in the excluded list, but the size will still be less 
> than the cached cluster size, so it will loop infinitely. This was observed 
> in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524647#comment-14524647
 ] 

Hadoop QA commented on HDFS-4937:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/10547/console |


This message was automatically generated.

 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee
Assignee: Kihwal Lee
 Attachments: HDFS-4937.patch


 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524682#comment-14524682
 ] 

Hadoop QA commented on HDFS-4937:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / f1a152c |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/10556/console |


This message was automatically generated.

 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee
Assignee: Kihwal Lee
 Attachments: HDFS-4937.patch


 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2013-08-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726787#comment-13726787
 ] 

Hadoop QA commented on HDFS-4937:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4754//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4754//console

This message is automatically generated.

 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee
 Attachments: HDFS-4937.patch


 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2013-07-01 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696797#comment-13696797
 ] 

Uma Maheswara Rao G commented on HDFS-4937:
---

Hi Kihwal, you said in the comment that operator added large number of new 
nodes right. Even then it was not able choose at least from them?


 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee

 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2013-07-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696922#comment-13696922
 ] 

Kihwal Lee commented on HDFS-4937:
--

bq. Even then it was not able choose at least from them?

It couldn't pick enough number of nodes because the max replicas/rack was 
already calculated. I think it worked fine for majority of blocks with 3 
replicas since the cluster had more than 3 racks even after refresh. The issue 
was with blocks with many more replicas. But picking enough nodes is just one 
condition. The other is for checking the exhaustion of candidate nodes. It 
would have bailed out of the while loop, if the cached cluster size was updated 
inside the loop.

To avoid frequent cluster-size refresh for this rare condition, we can make it 
update the cached value after {{dfs.replication.max}} iterations, within which 
most blocks should find all they need. If NN hits this issue, it will loop 
{{dfs.replication.max}} times and break out. I prefer this over adding locking, 
which will slow down normal cases.


 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee

 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()

2013-06-25 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693424#comment-13693424
 ] 

Kihwal Lee commented on HDFS-4937:
--

This can mostly be avoided by decommissioning nodes in a smaller batch, which 
is the recommended practice.  But for this particular case, the operator added 
a large number of new nodes and decommissioned old nodes.

 ReplicationMonitor can infinite-loop in 
 BlockPlacementPolicyDefault#chooseRandom()
 --

 Key: HDFS-4937
 URL: https://issues.apache.org/jira/browse/HDFS-4937
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.0.4-alpha, 0.23.8
Reporter: Kihwal Lee

 When a large number of nodes are removed by refreshing node lists, the 
 network topology is updated. If the refresh happens at the right moment, the 
 replication monitor thread may stuck in the while loop of {{chooseRandom()}}. 
 This is because the cached cluster size is used in the terminal condition 
 check of the loop. This usually happens when a block with a high replication 
 factor is being processed. Since replicas/rack is also calculated beforehand, 
 no node choice may satisfy the goodness criteria if refreshing removed racks. 
 All nodes will end up in the excluded list, but the size will still be less 
 than the cached cluster size, so it will loop infinitely. This was observed 
 in a production environment.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira