[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-04-05 Thread Lei (Eddy) Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lei (Eddy) Xu updated HADOOP-15317:
---
Fix Version/s: (was: 3.0.2)
   3.0.3

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Fix For: 2.10.0, 2.8.4, 3.2.0, 3.1.1, 2.9.2, 3.0.3
>
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch, 
> HADOOP-15317.06.patch, Screen Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-04-02 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.9.2
   3.1.1
   3.0.2
   3.2.0
   2.8.4
   2.10.0
   Status: Resolved  (was: Patch Available)

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Fix For: 2.10.0, 2.8.4, 3.2.0, 3.0.2, 3.1.1, 2.9.2
>
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch, 
> HADOOP-15317.06.patch, Screen Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-29 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.06.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch, 
> HADOOP-15317.06.patch, Screen Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-28 Thread Ajay Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Kumar updated HADOOP-15317:

Attachment: Screen Shot 2018-03-28 at 7.23.32 PM.png

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch, Screen 
> Shot 2018-03-28 at 7.23.32 PM.png
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-27 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.05.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-27 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: (was: HADOOP-15317.05.patch)

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-27 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.05.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch, HADOOP-15317.05.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-27 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.04.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch, HADOOP-15317.04.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-23 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.03.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch, 
> HADOOP-15317.03.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-16 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.02.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch, HADOOP-15317.02.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-14 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Status: Patch Available  (was: Open)

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15317) Improve NetworkTopology chooseRandom's loop

2018-03-14 Thread Xiao Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Chen updated HADOOP-15317:
---
Attachment: HADOOP-15317.01.patch

> Improve NetworkTopology chooseRandom's loop
> ---
>
> Key: HADOOP-15317
> URL: https://issues.apache.org/jira/browse/HADOOP-15317
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Major
> Attachments: HADOOP-15317.01.patch
>
>
> Recently we found a postmortem case where the ANN seems to be in an infinite 
> loop. From the logs it seems it just went through a rolling restart, and DNs 
> are getting registered.
> Later the NN become unresponsive, and from the stacktrace it's inside a 
> do-while loop inside {{NetworkTopology#chooseRandom}} - part of what's done 
> in HDFS-10320.
> Going through the code and logs I'm not able to come up with any theory 
> (thought about incorrect locking, or the Node object being modified outside 
> of NetworkTopology, both seem impossible) why this is happening, but we 
> should eliminate this loop.
> stacktrace:
> {noformat}
>  Stack:
> java.util.HashMap.hash(HashMap.java:338)
> java.util.HashMap.containsKey(HashMap.java:595)
> java.util.HashSet.contains(HashSet.java:203)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:786)
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:732)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:757)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:692)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:666)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:573)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:461)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:368)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:243)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:115)
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4AdditionalDatanode(BlockManager.java:1596)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3599)
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:717)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org