[ https://issues.apache.org/jira/browse/HDFS-14181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731011#comment-16731011 ]
Sihai Ke edited comment on HDFS-14181 at 12/30/18 4:25 PM: ----------------------------------------------------------- hi [~hexiaoqiao], thanks for your comments, I think why this bug is not triggered is as you said, this method is called only be the method mentioned by you && the scope is *NodeBase.ROOT*, and it doesn't have prod affection, but does that mean it is not a bug ? let's clarify the logic: # variable *availableNodes* is calculated by below code (focus on else situation), and will be used in fucntion *chooseRandom* {code:java} final int availableNodes; if (excludedScope == null) { availableNodes = countNumOfAvailableNodes(scope, excludedNodes); } else { availableNodes = countNumOfAvailableNodes("~" + excludedScope, excludedNodes); } LOG.debug("Choosing random from {} available nodes on node {}," + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.", availableNodes, innerNode, scope, excludedScope, excludedNodes, numOfDatanodes); Node ret = null; if (availableNodes > 0) { ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes, availableNodes); } LOG.debug("chooseRandom returning {}", ret); return ret; } {code} 2. In function chooseRandom, *availableNodes stands for the nodes under parentNode that could be choosen, i think there is a logic gap between these two function, ,* *a. Scope is substring of excludedScope.* *b.* ~excludedScope *equals to* nodes under Scope and not in excludedScope *??? (I think here is the bug.)* *c.* #b is true when scope is root. and the UT is one case will trigger this bug. and one thing important is it will throw exception, and means failed to choose node for data. {code:java} /** * Randomly choose one node under <i>parentNode</i>, considering the exclude * nodes and scope. Should be called with {@link #netlock}'s readlock held. * * @param parentNode the parent node * @param excludedScopeNode the node corresponding to the exclude scope. * @param excludedNodes a collection of nodes to be excluded from * @param totalInScopeNodes total number of nodes under parentNode, excluding * the excludedScopeNode * @param availableNodes number of available nodes under parentNode that * could be chosen, excluding excludedNodes * @return the chosen node, or null if none can be chosen */ private Node chooseRandom(final InnerNode parentNode, final Node excludedScopeNode, final Collection<Node> excludedNodes, final int totalInScopeNodes, final int availableNodes) { Preconditions.checkArgument( totalInScopeNodes >= availableNodes && availableNodes > 0, String .format("%d should >= %d, and both should be positive.", totalInScopeNodes, availableNodes)); if (excludedNodes == null || excludedNodes.isEmpty()) { // if there are no excludedNodes, randomly choose a node final int index = r.nextInt(totalInScopeNodes); return parentNode.getLeaf(index, excludedScopeNode); } {code} 3. For changing the method from protect to private, could we refine the logic, my thought is this method is general, so when other class who extend this class could reuse the logic. was (Author: sihai): hi [~hexiaoqiao], thanks for your comments, I think why this bug is not triggered is as you said, this method is called only be the method mentioned by you && the scope is *NodeBase.ROOT*, and it doesn't have prod affection, but does that mean it is not a bug ? let's clarify the logic: # variable *availableNodes* is calculated by below code (focus on else situation), and will be used in fucntion *chooseRandom* {code:java} final int availableNodes; if (excludedScope == null) { availableNodes = countNumOfAvailableNodes(scope, excludedNodes); } else { availableNodes = countNumOfAvailableNodes("~" + excludedScope, excludedNodes); } LOG.debug("Choosing random from {} available nodes on node {}," + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.", availableNodes, innerNode, scope, excludedScope, excludedNodes, numOfDatanodes); Node ret = null; if (availableNodes > 0) { ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes, availableNodes); } LOG.debug("chooseRandom returning {}", ret); return ret; } {code} 2. In function chooseRandom, *availableNodes stands for the nodes under parentNode that could be choosen, i think there is a logic gap between these two function, ,* *a. Scope is substring of excludedScope.* *b.* ~excludedScope *equals to* nodes under Scope and not in excludedScope *??? (I think here is the bug.)* *c.* #b is true when scope is root. and the UT is one case will trigger this bug. and one thing important is it will throw exception, and means failed to choose node for data. {code:java} /** * Randomly choose one node under <i>parentNode</i>, considering the exclude * nodes and scope. Should be called with {@link #netlock}'s readlock held. * * @param parentNode the parent node * @param excludedScopeNode the node corresponding to the exclude scope. * @param excludedNodes a collection of nodes to be excluded from * @param totalInScopeNodes total number of nodes under parentNode, excluding * the excludedScopeNode * @param availableNodes number of available nodes under parentNode that * could be chosen, excluding excludedNodes * @return the chosen node, or null if none can be chosen */ private Node chooseRandom(final InnerNode parentNode, final Node excludedScopeNode, final Collection<Node> excludedNodes, final int totalInScopeNodes, final int availableNodes) { Preconditions.checkArgument( totalInScopeNodes >= availableNodes && availableNodes > 0, String .format("%d should >= %d, and both should be positive.", totalInScopeNodes, availableNodes)); if (excludedNodes == null || excludedNodes.isEmpty()) { // if there are no excludedNodes, randomly choose a node final int index = r.nextInt(totalInScopeNodes); return parentNode.getLeaf(index, excludedScopeNode); } {code} 3. For changing the method from protect to private, could we refine the logic, I thought is this method is general, so when other class who extend this class could reuse this logic. > Suspect there is a bug in NetworkTopology.java chooseRandom function. > --------------------------------------------------------------------- > > Key: HDFS-14181 > URL: https://issues.apache.org/jira/browse/HDFS-14181 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, namenode > Affects Versions: 2.9.2 > Reporter: Sihai Ke > Priority: Major > Attachments: 0001-add-UT-for-NetworkTopology.patch, > image-2018-12-29-15-02-19-415.png > > > During reading the hadoop NetworkTopology.java, I suspect there is a bug in > function > chooseRandom (line 498, hadoop version 2.9.2-RC0), > I think there is a bug in{color:#f79232} code, ~excludedScope doesn't mean > availableNodes under Scope node, and I also add unit test for this and get an > exception.{color} > bug code in the else. > {code:java} > // code placeholder > if (excludedScope == null) { > availableNodes = countNumOfAvailableNodes(scope, excludedNodes); > } else { > availableNodes = > countNumOfAvailableNodes("~" + excludedScope, excludedNodes); > }{code} > Source code: > {code:java} > // code placeholder > protected Node chooseRandom(final String scope, String excludedScope, > final Collection<Node> excludedNodes) { > if (excludedScope != null) { > if (scope.startsWith(excludedScope)) { > return null; > } > if (!excludedScope.startsWith(scope)) { > excludedScope = null; > } > } > Node node = getNode(scope); > if (!(node instanceof InnerNode)) { > return excludedNodes != null && excludedNodes.contains(node) ? > null : node; > } > InnerNode innerNode = (InnerNode)node; > int numOfDatanodes = innerNode.getNumOfLeaves(); > if (excludedScope == null) { > node = null; > } else { > node = getNode(excludedScope); > if (!(node instanceof InnerNode)) { > numOfDatanodes -= 1; > } else { > numOfDatanodes -= ((InnerNode)node).getNumOfLeaves(); > } > } > if (numOfDatanodes <= 0) { > LOG.debug("Failed to find datanode (scope=\"{}\" excludedScope=\"{}\")." > + " numOfDatanodes={}", > scope, excludedScope, numOfDatanodes); > return null; > } > final int availableNodes; > if (excludedScope == null) { > availableNodes = countNumOfAvailableNodes(scope, excludedNodes); > } else { > availableNodes = > countNumOfAvailableNodes("~" + excludedScope, excludedNodes); > } > LOG.debug("Choosing random from {} available nodes on node {}," > + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.", > availableNodes, innerNode, scope, excludedScope, excludedNodes, > numOfDatanodes); > Node ret = null; > if (availableNodes > 0) { > ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes, > availableNodes); > } > LOG.debug("chooseRandom returning {}", ret); > return ret; > } > {code} > > > Add Unit Test in TestClusterTopology.java, but get exception. > > {code:java} > // code placeholder > @Test > public void testChooseRandom1() { > // create the topology > NetworkTopology cluster = NetworkTopology.getInstance(new Configuration()); > NodeElement node1 = getNewNode("node1", "/a1/b1/c1"); > cluster.add(node1); > NodeElement node2 = getNewNode("node2", "/a1/b1/c1"); > cluster.add(node2); > NodeElement node3 = getNewNode("node3", "/a1/b1/c2"); > cluster.add(node3); > NodeElement node4 = getNewNode("node4", "/a1/b2/c3"); > cluster.add(node4); > Node node = cluster.chooseRandom("/a1/b1", "/a1/b1/c1", null); > assertSame(node.getName(), "node3"); > } > {code} > > Exception: > {code:java} > // code placeholder > java.lang.IllegalArgumentException: 1 should >= 2, and both should be > positive. > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:567) > at > org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:544) > atorg.apache.hadoop.net.TestClusterTopology.testChooseRandom1(TestClusterTopology.java:198) > {code} > > {color:#f79232}!image-2018-12-29-15-02-19-415.png!{color} > > > [~vagarychen] this change is imported in PR HDFS-11577, could you help to > check whether this is a bug ? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org