[ 
https://issues.apache.org/jira/browse/HDFS-14181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731011#comment-16731011
 ] 

Sihai Ke edited comment on HDFS-14181 at 12/30/18 4:25 PM:
-----------------------------------------------------------

hi [~hexiaoqiao], thanks for your comments, I think why this bug is not 
triggered is as you said, this method is called only be the method mentioned by 
you && the scope is *NodeBase.ROOT*, and it doesn't have prod affection, but 
does that mean it is not a bug ? let's clarify the logic:
 # variable *availableNodes* is calculated by below code (focus on else 
situation), and will be used in fucntion *chooseRandom*

{code:java}
  final int availableNodes;
  if (excludedScope == null) {
    availableNodes = countNumOfAvailableNodes(scope, excludedNodes);
  } else {
    availableNodes =
        countNumOfAvailableNodes("~" + excludedScope, excludedNodes);
  }
  LOG.debug("Choosing random from {} available nodes on node {},"
      + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.",
      availableNodes, innerNode, scope, excludedScope, excludedNodes,
      numOfDatanodes);
  Node ret = null;
  if (availableNodes > 0) {
    ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes,
        availableNodes);
  }
  LOG.debug("chooseRandom returning {}", ret);
  return ret;
}
{code}
 

    2. In function chooseRandom,  *availableNodes stands for the nodes under 
parentNode that could be choosen,  i think there is a logic gap between these 
two function, ,*

       *a.  Scope is substring of excludedScope.*

       *b.* ~excludedScope *equals to* nodes under Scope and not in 
excludedScope *??? (I think here is the bug.)*

       *c.* #b is true when scope is root.

and the UT is one case will trigger this bug. and one thing important is it 
will throw exception, and means failed to choose node for data.
{code:java}
/**
 * Randomly choose one node under <i>parentNode</i>, considering the exclude
 * nodes and scope. Should be called with {@link #netlock}'s readlock held.
 *
 * @param parentNode        the parent node
 * @param excludedScopeNode the node corresponding to the exclude scope.
 * @param excludedNodes     a collection of nodes to be excluded from
 * @param totalInScopeNodes total number of nodes under parentNode, excluding
 *                          the excludedScopeNode
 * @param availableNodes    number of available nodes under parentNode that
 *                          could be chosen, excluding excludedNodes
 * @return the chosen node, or null if none can be chosen
 */
private Node chooseRandom(final InnerNode parentNode,
    final Node excludedScopeNode, final Collection<Node> excludedNodes,
    final int totalInScopeNodes, final int availableNodes) {
  Preconditions.checkArgument(
      totalInScopeNodes >= availableNodes && availableNodes > 0, String
          .format("%d should >= %d, and both should be positive.",
              totalInScopeNodes, availableNodes));
  if (excludedNodes == null || excludedNodes.isEmpty()) {
    // if there are no excludedNodes, randomly choose a node
    final int index = r.nextInt(totalInScopeNodes);
    return parentNode.getLeaf(index, excludedScopeNode);
  }
{code}
 

3. For changing the method from protect to private, could we refine the logic, 
my thought is this method is general, so when other class who extend this class 
could reuse the logic.


was (Author: sihai):
hi [~hexiaoqiao], thanks for your comments, I think why this bug is not 
triggered is as you said, this method is called only be the method mentioned by 
you && the scope is *NodeBase.ROOT*, and it doesn't have prod affection, but 
does that mean it is not a bug ? let's clarify the logic:
 # variable *availableNodes* is calculated by below code (focus on else 
situation), and will be used in fucntion *chooseRandom*

{code:java}
  final int availableNodes;
  if (excludedScope == null) {
    availableNodes = countNumOfAvailableNodes(scope, excludedNodes);
  } else {
    availableNodes =
        countNumOfAvailableNodes("~" + excludedScope, excludedNodes);
  }
  LOG.debug("Choosing random from {} available nodes on node {},"
      + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.",
      availableNodes, innerNode, scope, excludedScope, excludedNodes,
      numOfDatanodes);
  Node ret = null;
  if (availableNodes > 0) {
    ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes,
        availableNodes);
  }
  LOG.debug("chooseRandom returning {}", ret);
  return ret;
}
{code}
 

    2. In function chooseRandom,  *availableNodes stands for the nodes under 
parentNode that could be choosen,  i think there is a logic gap between these 
two function, ,*

       *a.  Scope is substring of excludedScope.*

       *b.* ~excludedScope *equals to* nodes under Scope and not in 
excludedScope *??? (I think here is the bug.)*

       *c.* #b is true when scope is root.

and the UT is one case will trigger this bug. and one thing important is it 
will throw exception, and means failed to choose node for data.
{code:java}
/**
 * Randomly choose one node under <i>parentNode</i>, considering the exclude
 * nodes and scope. Should be called with {@link #netlock}'s readlock held.
 *
 * @param parentNode        the parent node
 * @param excludedScopeNode the node corresponding to the exclude scope.
 * @param excludedNodes     a collection of nodes to be excluded from
 * @param totalInScopeNodes total number of nodes under parentNode, excluding
 *                          the excludedScopeNode
 * @param availableNodes    number of available nodes under parentNode that
 *                          could be chosen, excluding excludedNodes
 * @return the chosen node, or null if none can be chosen
 */
private Node chooseRandom(final InnerNode parentNode,
    final Node excludedScopeNode, final Collection<Node> excludedNodes,
    final int totalInScopeNodes, final int availableNodes) {
  Preconditions.checkArgument(
      totalInScopeNodes >= availableNodes && availableNodes > 0, String
          .format("%d should >= %d, and both should be positive.",
              totalInScopeNodes, availableNodes));
  if (excludedNodes == null || excludedNodes.isEmpty()) {
    // if there are no excludedNodes, randomly choose a node
    final int index = r.nextInt(totalInScopeNodes);
    return parentNode.getLeaf(index, excludedScopeNode);
  }
{code}
 

3. For changing the method from protect to private, could we refine the logic, 
I thought is this method is general, so when other class who extend this class 
could reuse this logic.

> Suspect there is a bug in NetworkTopology.java chooseRandom function.
> ---------------------------------------------------------------------
>
>                 Key: HDFS-14181
>                 URL: https://issues.apache.org/jira/browse/HDFS-14181
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs, namenode
>    Affects Versions: 2.9.2
>            Reporter: Sihai Ke
>            Priority: Major
>         Attachments: 0001-add-UT-for-NetworkTopology.patch, 
> image-2018-12-29-15-02-19-415.png
>
>
> During reading the hadoop NetworkTopology.java, I suspect there is a bug in 
> function 
> chooseRandom (line 498, hadoop version 2.9.2-RC0), 
>  I think there is a bug in{color:#f79232} code, ~excludedScope doesn't mean 
> availableNodes under Scope node, and I also add unit test for this and get an 
> exception.{color}
> bug code in the else.
> {code:java}
> // code placeholder
>  if (excludedScope == null) {
>     availableNodes = countNumOfAvailableNodes(scope, excludedNodes);
>   } else {
>     availableNodes =
>         countNumOfAvailableNodes("~" + excludedScope, excludedNodes);
>   }{code}
> Source code:
> {code:java}
> // code placeholder
> protected Node chooseRandom(final String scope, String excludedScope,
>     final Collection<Node> excludedNodes) {
>   if (excludedScope != null) {
>     if (scope.startsWith(excludedScope)) {
>       return null;
>     }
>     if (!excludedScope.startsWith(scope)) {
>       excludedScope = null;
>     }
>   }
>   Node node = getNode(scope);
>   if (!(node instanceof InnerNode)) {
>     return excludedNodes != null && excludedNodes.contains(node) ?
>         null : node;
>   }
>   InnerNode innerNode = (InnerNode)node;
>   int numOfDatanodes = innerNode.getNumOfLeaves();
>   if (excludedScope == null) {
>     node = null;
>   } else {
>     node = getNode(excludedScope);
>     if (!(node instanceof InnerNode)) {
>       numOfDatanodes -= 1;
>     } else {
>       numOfDatanodes -= ((InnerNode)node).getNumOfLeaves();
>     }
>   }
>   if (numOfDatanodes <= 0) {
>     LOG.debug("Failed to find datanode (scope=\"{}\" excludedScope=\"{}\")."
>             + " numOfDatanodes={}",
>         scope, excludedScope, numOfDatanodes);
>     return null;
>   }
>   final int availableNodes;
>   if (excludedScope == null) {
>     availableNodes = countNumOfAvailableNodes(scope, excludedNodes);
>   } else {
>     availableNodes =
>         countNumOfAvailableNodes("~" + excludedScope, excludedNodes);
>   }
>   LOG.debug("Choosing random from {} available nodes on node {},"
>       + " scope={}, excludedScope={}, excludeNodes={}. numOfDatanodes={}.",
>       availableNodes, innerNode, scope, excludedScope, excludedNodes,
>       numOfDatanodes);
>   Node ret = null;
>   if (availableNodes > 0) {
>     ret = chooseRandom(innerNode, node, excludedNodes, numOfDatanodes,
>         availableNodes);
>   }
>   LOG.debug("chooseRandom returning {}", ret);
>   return ret;
> }
> {code}
>  
>  
> Add Unit Test in TestClusterTopology.java, but get exception.
>  
> {code:java}
> // code placeholder
> @Test
> public void testChooseRandom1() {
>   // create the topology
>   NetworkTopology cluster = NetworkTopology.getInstance(new Configuration());
>   NodeElement node1 = getNewNode("node1", "/a1/b1/c1");
>   cluster.add(node1);
>   NodeElement node2 = getNewNode("node2", "/a1/b1/c1");
>   cluster.add(node2);
>   NodeElement node3 = getNewNode("node3", "/a1/b1/c2");
>   cluster.add(node3);
>   NodeElement node4 = getNewNode("node4", "/a1/b2/c3");
>   cluster.add(node4);
>   Node node = cluster.chooseRandom("/a1/b1", "/a1/b1/c1", null);
>   assertSame(node.getName(), "node3");
> }
> {code}
>  
> Exception:
> {code:java}
> // code placeholder
> java.lang.IllegalArgumentException: 1 should >= 2, and both should be 
> positive. 
> at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) 
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:567) 
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:544) 
> atorg.apache.hadoop.net.TestClusterTopology.testChooseRandom1(TestClusterTopology.java:198)
> {code}
>  
> {color:#f79232}!image-2018-12-29-15-02-19-415.png!{color}
>  
>  
> [~vagarychen] this change is imported in PR HDFS-11577, could you help to 
> check whether this is a bug ?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to