[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903901#comment-15903901 ] Siva Teja Patibandla commented on HDFS-4937: Hi Kihwal, was the v3 patch tested? it seems the whole function chooseRandom() got rewritten in later releases so the fix may not have gotten much test mileage so whether I should use it or not. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 2.8.0, 2.7.3, 3.0.0-alpha1 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991615#comment-14991615 ] Yi Liu commented on HDFS-4937: -- This time, it's correct now. The logic of current patch is straight. +1 for the {{v1}} patch, thanks Kihwal. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991640#comment-14991640 ] Brahma Reddy Battula commented on HDFS-4937: Yes,V1 Patch LGTM, +1 (non-binding). > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991888#comment-14991888 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2512 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2512/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992009#comment-14992009 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #641 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/641/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992023#comment-14992023 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1365 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1365/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992312#comment-14992312 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #631 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/631/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992122#comment-14992122 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2571 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2571/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992412#comment-14992412 ] Hudson commented on HDFS-4937: -- ABORTED: Integrated in Hadoop-Hdfs-trunk-Java8 #574 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/574/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991846#comment-14991846 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-trunk-Commit #8760 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8760/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev ff47f35deed14ba6463cba76f0e6a6c15abb3eca) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.3 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989859#comment-14989859 ] Kihwal Lee commented on HDFS-4937: -- First of all, the precommit build ran 4,075 test cases, so I think it ran all of them this time. The test failures are not related to the patch. I've rerun the failed tests and only {{TestSeveralNameNodes}} were failing occasionally. It was timing out waiting for a thread to finish writing. This test has been failing in other precommit builds as well. When I increase the timeout, it passed 100% of times. I will file a jira for this. {panel} --- T E S T S --- Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.298 sec - in org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.295 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 157.484 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestLeaseRecovery2 Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.445 sec - in org.apache.hadoop.hdfs.TestLeaseRecovery2 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 98.315 sec - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestCrcCorruption Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.387 sec - in org.apache.hadoop.hdfs.TestCrcCorruption Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.775 sec - in org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser {panel} > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989891#comment-14989891 ] Kihwal Lee commented on HDFS-4937: -- bq. I will file a jira for this. HDFS-9376 > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14988819#comment-14988819 ] Hadoop QA commented on HDFS-4937: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 18s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 22s {color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk cannot run convertXmlToText from findbugs {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 31s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 30s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 85m 6s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 78m 34s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 25s {color} | {color:red} Patch generated 56 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 189m 15s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 | | | hadoop.hdfs.TestCrcCorruption | | | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes | | | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.security.TestDelegationTokenForProxyUser | | | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | JDK v1.7.0_79 Failed junit tests | hadoop.hdfs.server.namenode.ha.TestDNFencing | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 | | | hadoop.hdfs.security.TestDelegationTokenForProxyUser | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.TestEncryptionZones | \\ \\ || Subsystem || Report/Notes
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987384#comment-14987384 ] Kihwal Lee commented on HDFS-4937: -- The test failures are definitely related. When I run {{TestReplicationPolicy}}, different cases fail depending on test ordering. One failure might be affecting other cases. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985997#comment-14985997 ] Hadoop QA commented on HDFS-4937: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 49s {color} | {color:red} hadoop-hdfs-project/hadoop-hdfs in trunk cannot run convertXmlToText from findbugs {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 45s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 46s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 50m 21s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 49m 46s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 20s {color} | {color:red} Patch generated 58 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 119m 34s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | | | hadoop.hdfs.server.blockmanagement.TestBlockManager | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain | | | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyConsiderLoad | | JDK v1.7.0_79 Failed junit tests | hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithNodeGroup | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | | | hadoop.hdfs.TestDecommission | | |
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985507#comment-14985507 ] Kihwal Lee commented on HDFS-4937: -- So sorry about the spectacular 118 test failures! It should have refreshed the count with an empty exclude node set to obtain the correct count. Looks like a few failed test cases are passing with the change. Let's see if the precommit agrees. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch, > HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983864#comment-14983864 ] Brahma Reddy Battula commented on HDFS-4937: I don't know if I can give a -1. But shall we revert this? A low of tests are broken because of it. {code} 662 int refreshCounter = numOfAvailableNodes; ... 671 while(numOfReplicas > 0 && numOfAvailableNodes > 0) { 672 DatanodeDescriptor chosenNode = chooseDataNode(scope); 673 if (excludedNodes.add(chosenNode)) { //was not in the excluded list 674 if (LOG.isDebugEnabled()) { 675 builder.append("\nNode ").append(NodeBase.getPath(chosenNode)).append(" ["); 676 } 677 numOfAvailableNodes--; 678 DatanodeStorageInfo storage = null; 679 if (isGoodDatanode(chosenNode, maxNodesPerRack, considerLoad, ... 711 } 712 // Refresh the node count. If the live node count became smaller, 713 // but it is not reflected in this loop, it may loop forever in case 714 // the replicas/rack cannot be satisfied. 715 if (--refreshCounter == 0) { 716 refreshCounter = clusterMap.countNumOfAvailableNodes(scope, 717 excludedNodes); 718 // It has already gone through enough number of nodes. 719 if (refreshCounter <= excludedNodes.size()) { 720 break; 721 } 722 } 723 } {code} line 672 {{chooseDataNode(scope)}} is random, if {{chosenNode}} happens to be a excluded one, it won't go to line 674. But {{refreshCounter}} is still decreased. If we out of luck, too many times of {{chooseDataNode(scope)}} return a already excluded one, we go inside line 716, and break at line 720. Then we end up with choosing not enough {{numOfReplicas}}. In fact we could have. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983924#comment-14983924 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #612 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/612/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983923#comment-14983923 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #624 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/624/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983954#comment-14983954 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1347 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1347/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983973#comment-14983973 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2554 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2554/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983986#comment-14983986 ] Hudson commented on HDFS-4937: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #560 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/560/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983905#comment-14983905 ] Yi Liu commented on HDFS-4937: -- Revert from trunk, branch-2, branch-2.7. Thanks Brahma. I thought the tests passed... But actually the jenkins doesn't include the tests result. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983907#comment-14983907 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-trunk-Commit #8738 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8738/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983906#comment-14983906 ] Yi Liu commented on HDFS-4937: -- I did consider the situation you mentioned, But I thought in real env the NN could find other racks/DNs if it has gone through enough number of nodes. But I missed the fact that many tests may only contain few available DNs, and {{refreshCounter <= excludedNodes.size()}} will be true, also in real env this also may happen if total number of DNs is few. So the patch should not be correct for these cases, revert them. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983996#comment-14983996 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2497 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2497/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983824#comment-14983824 ] Vinayakumar B commented on HDFS-4937: - hi [~aw], any idea why tests did not run in last precommit for the patch here [above comment |https://issues.apache.org/jira/browse/HDFS-4937?focusedCommentId=14981649=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14981649] ? > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982643#comment-14982643 ] Kihwal Lee commented on HDFS-4937: -- Also committed to branch-2.7. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.8.0 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982633#comment-14982633 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-trunk-Commit #8730 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8730/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.8.0 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982676#comment-14982676 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2549 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2549/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982818#comment-14982818 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #619 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/619/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982791#comment-14982791 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2493 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2493/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983125#comment-14983125 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #556 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/556/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982867#comment-14982867 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1342 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1342/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982973#comment-14982973 ] Hudson commented on HDFS-4937: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #607 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/607/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981132#comment-14981132 ] Kihwal Lee commented on HDFS-4937: -- The failed test cases pass when run locally. {noformat} --- T E S T S --- Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots Tests run: 36, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 146.108 sec - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.379 sec - in org.apache.hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.582 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 104.369 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints Results : Tests run: 54, Failures: 0, Errors: 0, Skipped: 0 {noformat} Also, there actually is no new findbugs issue. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981649#comment-14981649 ] Hadoop QA commented on HDFS-4937: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 6s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 21s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 4s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 58s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 34s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 31s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 8s {color} | {color:green} There were no new shellcheck issues. {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 0s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 13s {color} | {color:red} Patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 23m 45s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-30 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12769642/HDFS-4937.v2.patch | | JIRA Issue | HDFS-4937 | | Optional Tests | asflicense shellcheck javac javadoc mvninstall unit findbugs checkstyle compile | | uname | Linux 933890a322fa 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/patchprocess/apache-yetus-e77b1ce/precommit/personality/hadoop.sh | | git revision | trunk / e5b1733 | | Default Java | 1.7.0_79 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 | | shellcheck | v0.4.1 | | JDK v1.7.0_79 Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13285/testReport/ | | asflicense | https://builds.apache.org/job/PreCommit-HDFS-Build/13285/artifact/patchprocess/patch-asflicense-problems.txt | |
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981789#comment-14981789 ] Yi Liu commented on HDFS-4937: -- +1, thanks Kihwal. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975240#comment-14975240 ] Hadoop QA commented on HDFS-4937: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 36s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 2s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 37s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 2m 35s | The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | native | 3m 13s | Pre-build of native portion | | {color:red}-1{color} | hdfs tests | 50m 48s | Tests failed in hadoop-hdfs. | | | | 95m 11s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs | | Failed unit tests | hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap | | | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.server.namenode.ha.TestEditLogTailer | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768765/HDFS-4937.v1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 2f1eb2b | | Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13202/artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13202/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13202/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13202/console | This message was automatically generated. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969656#comment-14969656 ] Hadoop QA commented on HDFS-4937: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / aea26bf | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13130/console | This message was automatically generated. > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > -- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.0.4-alpha, 0.23.8 >Reporter: Kihwal Lee >Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524647#comment-14524647 ] Hadoop QA commented on HDFS-4937: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / f1a152c | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/10547/console | This message was automatically generated. ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee Assignee: Kihwal Lee Attachments: HDFS-4937.patch When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524682#comment-14524682 ] Hadoop QA commented on HDFS-4937: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / f1a152c | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/10556/console | This message was automatically generated. ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee Assignee: Kihwal Lee Attachments: HDFS-4937.patch When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726787#comment-13726787 ] Hadoop QA commented on HDFS-4937: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4754//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4754//console This message is automatically generated. ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee Attachments: HDFS-4937.patch When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696797#comment-13696797 ] Uma Maheswara Rao G commented on HDFS-4937: --- Hi Kihwal, you said in the comment that operator added large number of new nodes right. Even then it was not able choose at least from them? ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696922#comment-13696922 ] Kihwal Lee commented on HDFS-4937: -- bq. Even then it was not able choose at least from them? It couldn't pick enough number of nodes because the max replicas/rack was already calculated. I think it worked fine for majority of blocks with 3 replicas since the cluster had more than 3 racks even after refresh. The issue was with blocks with many more replicas. But picking enough nodes is just one condition. The other is for checking the exhaustion of candidate nodes. It would have bailed out of the while loop, if the cached cluster size was updated inside the loop. To avoid frequent cluster-size refresh for this rare condition, we can make it update the cached value after {{dfs.replication.max}} iterations, within which most blocks should find all they need. If NN hits this issue, it will loop {{dfs.replication.max}} times and break out. I prefer this over adding locking, which will slow down normal cases. ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-4937) ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom()
[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13693424#comment-13693424 ] Kihwal Lee commented on HDFS-4937: -- This can mostly be avoided by decommissioning nodes in a smaller batch, which is the recommended practice. But for this particular case, the operator added a large number of new nodes and decommissioned old nodes. ReplicationMonitor can infinite-loop in BlockPlacementPolicyDefault#chooseRandom() -- Key: HDFS-4937 URL: https://issues.apache.org/jira/browse/HDFS-4937 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.0.4-alpha, 0.23.8 Reporter: Kihwal Lee When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of {{chooseRandom()}}. This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks. All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira