Stale connection makes node miss append
---------------------------------------

                 Key: HDFS-1224
                 URL: https://issues.apache.org/jira/browse/HDFS-1224
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Thanh Do


- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ available datanodes = 3
+ replica = 3 
+ disks / datanode = 1
+ failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to