A client may fail during block recovery even if its request to recover a block 
succeeds
---------------------------------------------------------------------------------------

                 Key: HDFS-2639
                 URL: https://issues.apache.org/jira/browse/HDFS-2639
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs client
    Affects Versions: 1.0.0
            Reporter: Eli Collins


The client gets stuck in the following loop if an rpc its issued to recover a 
block timed out:

{noformat}
DataStreamer#run
1.  processDatanodeError
2.     DN#recoverBlock
3.        DN#syncBlock
4.           NN#nextGenerationStamp
5.  sleep 1s
6.  goto 1
{noformat}

Once we've timed out onece at step 2 and loop, step 2 throws an IOE because the 
block is already being recovered and step 4 throws an IOE because the block GS 
is now out of date (the previous, timed-out, request got a new GS and updated 
the block). Eventually the client reaches max retries, considers all DNs bad, 
and close throws an IOE.

The client should be able to succeed if one of its requests to recover the 
block succeeded. It should still fail if another client (eg HBase via 
recoverLease or the NN via releaseLease) succesfully recovered the block. One 
way to handle this would be to not timeout the request to recover the block. 
Another would be able to make a subsequent call to recoverBlock succeed eg by 
updating the block's sequence number to be the latest value that was updated by 
the same client in the previous request (ie it can recover over itself but not 
another client).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to