Stale connection makes node miss append ---------------------------------------
Key: HDFS-1224 URL: https://issues.apache.org/jira/browse/HDFS-1224 Project: Hadoop HDFS Issue Type: Bug Reporter: Thanh Do - Summary: if a datanode crashes and restarts, it may miss an append. - Setup: + available datanodes = 3 + replica = 3 + disks / datanode = 1 + failures = 1 + failure type = crash + When/where failure happens = after the first append succeed - Details: Since each datanode maintains a pool of IPC connections, whenever it wants to make an IPC call, it first looks into the pool. If the connection is not there, it is created and put in to the pool. Otherwise the existing connection is used. Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary. After the client appends to block X successfully, dn2 crashes and restarts. Now client writes a new block Y to dn1, dn2 and dn3. The write is successful. Client starts appending to block Y. It first calls dn1.recoverBlock(). Dn1 will first create a proxy corresponding with each of the datanode in the pipeline (in order to make RPC call like getMetadataInfo( ) or updateBlock( )). However, because dn2 has just crashed and restarts, its connection in dn1's pool become stale. Dn1 uses this connection to make a call to dn2, hence an exception. Therefore, append will be made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 has been successful. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.