[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Todd Lipcon (JIRA) Thu, 30 Aug 2012 15:54:10 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445385#comment-13445385
 ]


Todd Lipcon commented on HDFS-3875:
-----------------------------------

Here's the recovery from the perspective of the NN:

{code}
2012-08-28 19:16:33,532 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
updatePipeline(block=BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581786,
 newGenerationStamp=140581806, newLength=44281856, 
newNodes=[172.29.97.219:50010], clientNam
2012-08-28 19:16:33,597 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
updatePipeline(BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581786)
 successfully to 
BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581806
{code}

Here's the recovery from the perspective of the middle node:

{code}
2012-08-28 19:16:33,531 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
replica ReplicaBeingWritten, blk_2632740624757457378_140581786, RBW
  getNumBytes()     = 44867072
  getBytesOnDisk()  = 44867072
  getVisibleLength()= 44281856
  getVolume()       = /data/2/dfs/dn/current
  getBlockFile()    = 
/data/2/dfs/dn/current/BP-1507505631-172.29.97.196-1337120439433/current/rbw/blk_2632740624757457378
  bytesAcked=44281856
  bytesOnDisk=44867072
{code}

and then the later checksum exception from the block scanner:

{code}
2012-08-28 19:23:59,275 WARN 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Second 
Verification failed for 
BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581806
org.apache.hadoop.fs.ChecksumException: Checksum failed at 44217344
{code}

Interestingly, the checksum exception noticed by the block scanner is less than 
the "acked length" seen at recovery time.

On the node in question, I see a fair number of weird errors (page allocation 
failures etc) in the kernel log. So my guess is that the machine is borked and 
was silently corrupting memory in the middle of the pipeline. Hence, because 
the recovery kicked out the wrong node, it ended up persisting a corrupt 
version of the block instead of a good one.
                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node, hdfs client
>    Affects Versions: 2.2.0-alpha
>            Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

Reply via email to