[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662144#comment-13662144
 ] 

Todd Lipcon commented on HDFS-3875:
-----------------------------------

Sorry it took me some time to get to this. A couple small questions below:

{code}
+              // Wait until the responder sends back the response
+              // and interrupt this thread.
+              Thread.sleep(3000);
{code}
Can you explain this sleep here a little further? The assumption is that the 
responder will come back and interrupt the streamer? Why do we need to wait 
instead of just bailing out immediately with the IOE? Will this cause a 
3-second delay in re-establishing the pipeline again?

----
{code}
+        // If the mirror has reported that it received a corrupt packet,
+        // do self-destruct to mark myself bad, instead of making the 
+        // mirror node bad. The mirror is guaranteed to be good without
+        // corrupt data on disk.
{code}

What if the issue is on the receiving NIC of the downstream node? In this case, 
it would be kept around in the next pipeline and likely cause an exception 
again, right?

----
{code}
+      // corrupt the date for testing.
{code}
typo: date
                
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
>                 Key: HDFS-3875
>                 URL: https://issues.apache.org/jira/browse/HDFS-3875
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Kihwal Lee
>            Priority: Critical
>         Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to