A client may fail during block recovery even if its request to recover a block
succeeds
---------------------------------------------------------------------------------------
Key: HDFS-2639
URL: https://issues.apache.org/jira/browse/HDFS-2639
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs client
Affects Versions: 1.0.0
Reporter: Eli Collins
The client gets stuck in the following loop if an rpc its issued to recover a
block timed out:
{noformat}
DataStreamer#run
1. processDatanodeError
2. DN#recoverBlock
3. DN#syncBlock
4. NN#nextGenerationStamp
5. sleep 1s
6. goto 1
{noformat}
Once we've timed out onece at step 2 and loop, step 2 throws an IOE because the
block is already being recovered and step 4 throws an IOE because the block GS
is now out of date (the previous, timed-out, request got a new GS and updated
the block). Eventually the client reaches max retries, considers all DNs bad,
and close throws an IOE.
The client should be able to succeed if one of its requests to recover the
block succeeded. It should still fail if another client (eg HBase via
recoverLease or the NN via releaseLease) succesfully recovered the block. One
way to handle this would be to not timeout the request to recover the block.
Another would be able to make a subsequent call to recoverBlock succeed eg by
updating the block's sequence number to be the latest value that was updated by
the same client in the previous request (ie it can recover over itself but not
another client).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira