[ 
https://issues.apache.org/jira/browse/HAMA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653645#comment-13653645
 ] 

MaoYuan Xian commented on HAMA-756:
-----------------------------------

I understand the call to "FileStatus[] status = fs.listStatus(partitionDir);" 
is used to avoiding the race condition.
But, the call to "peer.getNumPeers()" should be also put between two times of 
calling to peer.sync().
We encountered the problem, when some fast task complete, some slow task just 
come to somewhere before calling peer.sync(). When these slow tasks call 
peer.sync(), the getAllPeerNames method of ZooKeeperSyncClientImpl will finally 
be trigger, where the call to "byte[] data = 
zk.getData(constructKey(taskId.getJobID(), "peers", s),  this, null);" will 
fail and make the exception "All peer names could not be retrieved!" happen.

As for the 2nd issue, 

 if (assignedID == peer.getNumPeers())
        assignedID = assignedID - 1;

can solve some promblem but not all. For example:

  // Assume desiredNum=8, peer.getNumPeers()=6
  for (FileStatus statu : status) {
      int partitionID = Integer
          .parseInt(statu.getPath().getName().split("[-]")[1]);  // Let's 
think, when partitionID=7
      int denom = desiredNum / peer.getNumPeers();  // denom=8/6=1
      int assignedID = partitionID;                // assignedID = 7
      if (denom > 1) {                          // denom value is 1, skip this 
if block
        assignedID = partitionID / denom;
      }

      if (assignedID == peer.getNumPeers())    // assignedID != 
peer.getNumPeers() here because 7 != 6
        assignedID = assignedID - 1;

      // TODO set replica factor to 1.
      // TODO and check whether we can write to specific DataNode.
      if (assignedID == peer.getPeerIndex()) {   // So, assignedID is 7, 
peer.getPeerIndex() can only possible be 0~5, no any peer will do the
                                                 //  merge work for part-7
        ...
      }

                
> Timing issue and file merging algorithm in PartitioningRunner make job fail
> ---------------------------------------------------------------------------
>
>                 Key: HAMA-756
>                 URL: https://issues.apache.org/jira/browse/HAMA-756
>             Project: Hama
>          Issue Type: Bug
>            Reporter: MaoYuan Xian
>            Assignee: Edward J. Yoon
>
> There are two major problems in bsp methor of PartitioningRunner may make the 
> partitioning fail:
> 1. The call to peer.getNumPeers() may trigger the timing issue. In the 
> special situation when some tasks complete the bsp call but some others just 
> enter the "for (FileStatus statu : status)" loop, these remaining task 
> calling to peer.getNumPeers() will trigger the problem.
> 2. The algorithm of merging the sequence files has the problem: e.g. when 
> desiredNum is 8 and partitioning task number (peer.getNumPeers()) is 6, the 
> part-7 directory can not find the handler to merging it as a file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to