[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2014-08-24 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108798#comment-14108798
 ] 

Yongjun Zhang commented on HDFS-3875:
-

Hi [~kihwal], I filed HDFS-6937 to track the similar issue I'm seeing, so we 
can continue the discussion there. Thanks.




> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.1.0-beta, 0.23.8
>
> Attachments: hdfs-3875-wip.patch, 
> hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.branch-2.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2014-08-22 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107107#comment-14107107
 ] 

Yongjun Zhang commented on HDFS-3875:
-

HI [~kihwal],

Thanks for your earlier work for this issue. We are seeing a similar problem 
like this though we have this patch. One question about this patch:

Assuming we have a pipeline of three DNs, DN1, DN2, and DN3. DN3 detects a 
checksum error, and reports back to  DN2. DN2 decided to truncate its replica 
to the acknowledged size by calling {{static private void truncateBlock(File 
blockFile, File metaFile,}} which reads the data from the local replica file, 
calculate the checksum for the length to be truncated to, and write the 
checksum back to the meta file. 

My question is, when writing back the checksum to the meta file, this method 
doesn't check against an already computed checksum to see if it matches. 
However, DN3 does check its computed checksum against the checksum sent from 
upstream of the pipeline when reporting the checksum mismatch. If DN2 got 
something wrong in the truncateBlock method (say, for some reason the existing 
data is corrupted), then DN2 has incorrect cheksum and it's not aware of it. 
Then later when we try to recover the pipeline, and use DN2 replica as the 
source, the new DN that receives data from the DN2 will always find checksum 
error.

This is my speculation so far. Do you think this is a possibility? 

Thanks a lot.



> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.1.0-beta, 0.23.8
>
> Attachments: hdfs-3875-wip.patch, 
> hdfs-3875.branch-0.23.no.test.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.branch-2.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664125#comment-13664125
 ] 

Hudson commented on HDFS-3875:
--

Integrated in Hadoop-Mapreduce-trunk #1433 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1433/])
HDFS-3875. Issue handling checksum errors in write pipeline. Contributed by 
Kihwal Lee. (Revision 1484808)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1484808
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.0.5-beta, 0.23.8
>
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664074#comment-13664074
 ] 

Hudson commented on HDFS-3875:
--

Integrated in Hadoop-Hdfs-trunk #1406 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1406/])
HDFS-3875. Issue handling checksum errors in write pipeline. Contributed by 
Kihwal Lee. (Revision 1484808)

 Result = FAILURE
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1484808
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.0.5-beta, 0.23.8
>
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664056#comment-13664056
 ] 

Hudson commented on HDFS-3875:
--

Integrated in Hadoop-Hdfs-0.23-Build #615 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/615/])
HDFS-3875. Issue handling checksum errors in write pipeline. Contributed by 
Kihwal Lee. (Revision 1484811)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1484811
Files : 
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/FSDataset.java
* 
/hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.0.5-beta, 0.23.8
>
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663996#comment-13663996
 ] 

Hudson commented on HDFS-3875:
--

Integrated in Hadoop-Yarn-trunk #217 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/217/])
HDFS-3875. Issue handling checksum errors in write pipeline. Contributed by 
Kihwal Lee. (Revision 1484808)

 Result = FAILURE
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1484808
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.0.5-beta, 0.23.8
>
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-21 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662982#comment-13662982
 ] 

Hudson commented on HDFS-3875:
--

Integrated in Hadoop-trunk-Commit #3771 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/3771/])
HDFS-3875. Issue handling checksum errors in write pipeline. Contributed by 
Kihwal Lee. (Revision 1484808)

 Result = SUCCESS
kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1484808
Files : 
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClientFaultInjector.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestCrcCorruption.java


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 3.0.0, 2.0.5-beta, 0.23.8
>
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662436#comment-13662436
 ] 

Todd Lipcon commented on HDFS-3875:
---

OK, thanks for the explanations. +1 from me.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662358#comment-13662358
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12583880/hdfs-3875.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4417//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4417//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.branch-2.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662214#comment-13662214
 ] 

Kihwal Lee commented on HDFS-3875:
--

bq. Can you explain this sleep here a little further? The assumption is that 
the responder will come back and interrupt the streamer? Why do we need to wait 
instead of just bailing out immediately with the IOE? Will this cause a 
3-second delay in re-establishing the pipeline again?

This gives the responder time to send the checksum error back upstream, so that 
the upstream node can blow up and exclude itself from the pipeline. This may 
not be always ideal since there can be many different failure modes, but if 
anything needs to be eliminated without knowing the cause, the source seems to 
be a better candidate than the sink who actually verifies checksum.

Unless there is network issue in sending ACKs, responder will immediately 
terminate and interrupt the main writer thread, so the thread won't stay up. 
Even if the thread stays up for some reason, recoverRbw() during pipeline 
recovery will interrupt the thread, so there won't be 3 second delay.

If the last node in a pipeline has a faulty NIC, two upstream nodes will be 
eliminated (in 3 replica case) and after adding a new dn to the end of 
pipeline, the faulty node will be removed. Issues on intermediate nodes will be 
handled in less number of iterations and the worst case will be when data is 
corrupt in DFSOutputStream, which will be detected after hitting the maximum 
number of retries. There will be no recovery in this case.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662144#comment-13662144
 ] 

Todd Lipcon commented on HDFS-3875:
---

Sorry it took me some time to get to this. A couple small questions below:

{code}
+  // Wait until the responder sends back the response
+  // and interrupt this thread.
+  Thread.sleep(3000);
{code}
Can you explain this sleep here a little further? The assumption is that the 
responder will come back and interrupt the streamer? Why do we need to wait 
instead of just bailing out immediately with the IOE? Will this cause a 
3-second delay in re-establishing the pipeline again?


{code}
+// If the mirror has reported that it received a corrupt packet,
+// do self-destruct to mark myself bad, instead of making the 
+// mirror node bad. The mirror is guaranteed to be good without
+// corrupt data on disk.
{code}

What if the issue is on the receiving NIC of the downstream node? In this case, 
it would be kept around in the next pipeline and likely cause an exception 
again, right?


{code}
+  // corrupt the date for testing.
{code}
typo: date

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662120#comment-13662120
 ] 

Suresh Srinivas commented on HDFS-3875:
---

+1 for the patch




> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662098#comment-13662098
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12583835/hdfs-3875.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4416//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4416//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-20 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662016#comment-13662016
 ] 

Kihwal Lee commented on HDFS-3875:
--

Thanks for the review, Suresh. The latest patch addresses all review comments.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-17 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661161#comment-13661161
 ] 

Suresh Srinivas commented on HDFS-3875:
---

[~kihwal] the new solutions looks much better. Nice work!

Some minor comments. +1 with those addressed:
# DFSOutputStream.java
#* Initialize lastAckedSeqnoBeforeFailure to appropriate value. lastAckedSeqNo 
is initialized to -1.
#* Change info log, print warn? Instead of "Already tried 5 times" -> "Already 
retried 5 times", given total attempts are 6 and retries are 5.
# DFSClientFaultInjecto#uncorruptPacket() - does it need to throw IOException?


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-17 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661050#comment-13661050
 ] 

Suresh Srinivas commented on HDFS-3875:
---

Sorry, I have been meaning to look at this. But have not been able to spend 
time. Will review before the end of the day. 




> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-17 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661044#comment-13661044
 ] 

Thomas Graves commented on HDFS-3875:
-

Suresh, Todd,  Any comments on the latest patch?  I am hoping to get this 
committed soon for 23.8

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-05-15 Thread Lohit Vijayarenu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659229#comment-13659229
 ] 

Lohit Vijayarenu commented on HDFS-3875:


Could this be targeted for 2.0.5 release? We are seeing this exact same issue 
on one of our clusters. We are running hadoop-2.0.3-alpha release. 

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-03-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595228#comment-13595228
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12572398/hdfs-3875.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 tests included appear to have a timeout.{color}

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/4044//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4044//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-03-06 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13595134#comment-13595134
 ] 

Kihwal Lee commented on HDFS-3875:
--

The new patch forces datanodes to truncate the block being recovered to the 
acked length.  Since the nodes in the middle of write pipeline does not perform 
checksum verification and writes data to disk before getting ack back from 
downstream, the unacked portion can contain corrupt data. If the last node 
simply disappears before reporting a checksum error up, the current pipeline 
recovery mechanism can overlook the corruption in written data.

Since this truncation discards potentially corrupt portion of block, we do not 
need any explicit checksum re-verification on checksum error.

Another new feature added to the latest patch is to terminate hdfs client when 
pipeline recovery is attempted for more than 5 times while writing the same 
data packet. This likely indicates the source data is corrupt.  In a very small 
cluster, clients may run out of datanodes and fail before retrying 5 times.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.patch.txt, hdfs-3875.branch-0.23.with.test.patch.txt, 
> hdfs-3875.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-03-05 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593583#comment-13593583
 ] 

Kihwal Lee commented on HDFS-3875:
--

Sorry for getting back to this so late and thank you Suresh for the feedback.

It made me think more about the non-leaf nodes in a pipeline. If the leaf node 
disappears from the pipeline before reporting checksum error and recoverRbw() 
is done, we can end up with latent checksum error in the block. This is because 
datanodes won't discard already written data on pipeline recovery.  It looks 
like we have to make recoverRbw() to truncate blocks to the acked size to be 
really safe.

Also, client should give up after certain number of pipeline reconstructions 
for the same block.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-14 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553370#comment-13553370
 ] 

Suresh Srinivas commented on HDFS-3875:
---

Had an offline conversation with Kihwal. Here is one of the above scenarios in 
more detail (thanks Kihwal for explaining the current behavior).

Client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt), where d3 for 
some reason sees only corrupt data.
* d3 detects corruption and reports CHECKSUM_ERROR ACK to d2. Packet is not 
written to the disk on d3.
* d2 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is {SUCCESS, MIRROR_ERROR}
* client establishes the pipeline with d1 and d3 and sends the packet again.
* d3 detects corruption again and reports CHECKSUM_ERROR ACK to d1. Packet is 
not written to the disk on d3.
* d1 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* client establishes the pipeline with d3 and sends the packet again.
* d3 detects corruption again and reports CHECKSUM_ERROR ACK to d1. Packet is 
not written to the disk on d3.
* Client fails to write the packet and abandons writing the file? 

The current behavior picks the node that sees corruption (or is corrupting the 
data) repeatedly in pipeline recovery (d3 above). Also the node that did not 
see corruption gets dropped from the pipeline. If a datanode performs checksum 
verification when it gets a down stream datanode reporting checksum error 
should avoid this. With this, new recovered pipleline will recover the pipeline 
up to the point of corruption in the pipeline.

Kihwal, add comments if I missed some thing.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-14 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553193#comment-13553193
 ] 

Suresh Srinivas commented on HDFS-3875:
---

Kihwal, here is how I understand the new behavior. Correct me if I am wrong. In 
the following scenarios, client is writing in a pipeline to datanodes d1, d2 
and d3. At each point in the pipeline the data is marked as corrupt or not.

client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy even though d2 may not be corrupt.

client(not corrupt) d1(not corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy.

client(not corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* _d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR._
d1 is still considered a valid coyp. Is this correct?

client(corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives 
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
d1 is still considered a valid coyp.

In all the above cases whether a node detects checksum error or the downstream 
detects checksum error the results appears the same to the upstream nodes (as 
mirror error). Is that what you intended?


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-14 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553148#comment-13553148
 ] 

Suresh Srinivas commented on HDFS-3875:
---

Arun, I am removing the Blocker based on the input from Todd and Nicholas in 
the above comments. However, I have started reviewing this - so lets try to get 
this in 2.0.3-alpha.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-08 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547634#comment-13547634
 ] 

Arun C Murthy commented on HDFS-3875:
-

Thanks [~sureshms]! I'm looking to wrap up 2.0.3-alpha asap.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-08 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547627#comment-13547627
 ] 

Suresh Srinivas commented on HDFS-3875:
---

Arun, I will review this in next two days.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-08 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547616#comment-13547616
 ] 

Arun C Murthy commented on HDFS-3875:
-

Kihwal? Todd?

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2013-01-03 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543140#comment-13543140
 ] 

Arun C Murthy commented on HDFS-3875:
-

Any update on this? Thanks.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528417#comment-13528417
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12560280/hdfs-3875.trunk.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
  
org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3630//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3630//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-06 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526101#comment-13526101
 ] 

Kihwal Lee commented on HDFS-3875:
--

Test failures are not caused by this patch. HDFS-4282, HDFS-3806

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526068#comment-13526068
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12559661/hdfs-3875.trunk.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  
org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
  org.apache.hadoop.hdfs.server.namenode.TestEditLog

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3616//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3616//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-03 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508980#comment-13508980
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3875:
--

> The exception caught is not used below. ...

Let's also change "boolean checksumError" to "IOException checksumException" so 
that it can record the actual exception.  We should also log it.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-02 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508337#comment-13508337
 ] 

Suresh Srinivas commented on HDFS-3875:
---

It took me a lot of time to review this code. BlockReceiver code is a poorly 
documented code. One of these days I will add some javadoc to make 
understanding the code and reviewing easier :-).

Why do you have two variants of the patch - with and without tests?

Comments for patch with no tests:
# Comment against #checkSumError could have - "Indicates checksumError. When 
set block receiving and writing is stopped." It is better to initialize it to 
false than in the constructor.
# #shouldVerifyChecksum() - could we describe the condition when checksum needs 
to be verified in javadoc? Along the lines - "Checksum verified in the 
following cases - 1. if the datanode is the last one in the pipeline with no 
mirrorOut. 2. If the block is being written by another datanode for 
replication. 3. If checksum translation is needed." There is some equivalent 
comment where shouldVerifyChecksum() is presently called. That comment can be 
removed.
# receivePacket() returned -1 earlier when a block was completely written or 
length of packet received. Now it also returns -1 on checksum error. It would 
be good to add a javadoc to this method indicating returns -1.
# receivePacket() - do you see it is a good idea to print warn/info level logs 
when returning -1 on checksum error or when checksumError is set to -1? This 
will help debugging these issues on each datanode in the pipeline using the 
logs. Given that these are rare errors it should not take up too much of log 
space.
# Comment "If there is a checksum error, responder will shut it down". Can you 
please clairfy this comment saying "responder will shut itself and interrupt 
the receiver."
# In #enqueue() - why is checksumError check in synchronized block. It can be 
outside right?


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508186#comment-13508186
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3875:
--

If you have already well tested the patch.  I am okay with it.

Suresh, any comment?

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508138#comment-13508138
 ] 

Kihwal Lee commented on HDFS-3875:
--

Also, checksumError is set true before returning, so receiveBlock() will 
actually end up throwing an exception. I experimented with receiveBlock() 
throwing an exception and also simply returning and they all worked fine. I 
thought you were refering to this. 

To answer your question better, the occurance checksum error is already logged 
and the exception thrown in receiveBlock() will clearly show what happened. We 
could catch IOexception in receiveBlock(), check checksumError and rethrow. 

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508136#comment-13508136
 ] 

Kihwal Lee commented on HDFS-3875:
--

Whether it returns or throws an exception the result is not very different. The 
packet responder will log the checksum error and initiate the shutdown. If an 
exception is thrown, datanode ends up logging much more with multiple stack 
traces. I thought clean termination of the writer (DataXceiver) thread is 
acceptable, since it is a controlled shutdown with a purpose and expected 
outcome, rather than a panic shutdown. If you think throwing exception makes 
more sense, I will update the patch.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508134#comment-13508134
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3875:
--

Hi Kihwal,

In a client write pipeline, only the last datanode verifies checksum.  If there 
is a checksum error, we don't know what goes wrong.  It could be the cases that 
one of the datanodes is faulty or a network path is faulty.  So, the client 
must stop but cannot simply take out a datanode and continue.  Do you agree?

In the patch, only the last datanode possibly reports checksum error.  If it 
does, all statuses in the ack become ERROR_CHECKSUM.  The approach seems 
reasonable.

Some questions on the patch:
- receivePacket() returns -1 for checksum error.  Why not throw an exception?  
Returning -1 should mean exit normally.
- The exception caught is not used below.  Should it re-throw the exception?
{code}
+  if (shouldVerifyChecksum()) {
+try {
+  verifyChunks(dataBuf, checksumBuf);
+} catch (IOException e) {
+  // checksum error detected locally. there is no reason to continue.
+  if (responder != null) {
+((PacketResponder) responder.getRunnable()).enqueue(seqno,
+lastPacketInBlock, offsetInBlock,
+Status.ERROR_CHECKSUM);
+  }
+  // return without writing data.
+  checksumError = true;
+  return -1;
+}
{code}


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508057#comment-13508057
 ] 

Suresh Srinivas commented on HDFS-3875:
---

kihwal, I will review the patch shortly and post comments by the evening.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508041#comment-13508041
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12555643/hdfs-3875.trunk.no.test.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-hdfs-project/hadoop-hdfs.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3587//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3587//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508014#comment-13508014
 ] 

Kihwal Lee commented on HDFS-3875:
--

There is something missing in the latest patch that was in the original. The 
test failure is caused by it. I will post updated the patch in a moment


> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508006#comment-13508006
 ] 

Hadoop QA commented on HDFS-3875:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12555626/hdfs-3875.trunk.with.test.patch.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-hdfs-project/hadoop-hdfs:

  org.apache.hadoop.hdfs.TestReplaceDatanodeOnFailure

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3585//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3585//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3585//console

This message is automatically generated.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-12-01 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507907#comment-13507907
 ] 

Kihwal Lee commented on HDFS-3875:
--

Again, the test is a bit invasive as it requires modification of 
DFSOutputStream. Nevertheless, it can effectively emulate data corruption 
during transmission and verify the patch works.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt, 
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-30 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507845#comment-13507845
 ] 

Kihwal Lee commented on HDFS-3875:
--

I originally changed PipelineACK to return errors instead of making writer 
terminate. It generally worked, but some corner cases required a significant 
change in DFSOutputStream. So I decided to simplify and make it not send 
ACK-back on checksum errors and terminate as it would in case of any other 
error.  Local AckStatus is saved in each status tracking packet(not actual data 
packet), so that successful acks enqueued before checksum error still can be 
sent.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs-client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
> Attachments: hdfs-3875.trunk.no.test.patch.txt, 
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-28 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506005#comment-13506005
 ] 

Kihwal Lee commented on HDFS-3875:
--

> Does the writer able to close the file successfully?
Yes. In one case the corrupt block ended up in the middle of a file. Of course 
all replicas were corrupt, so when Nn tried to up the repl factor, all replicas 
got marked corrupt.

I will post my patch for review, not for precomit build, although I ran all 
tests and only testBlockCorruptionRecoveryPolicy2 failed probably due to change 
in how corruption recovery works.  I haven't debugged it yet.  Please take a 
look and see if the approach seems reasonable.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-27 Thread Tsz Wo (Nicholas), SZE (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505151#comment-13505151
 ] 

Tsz Wo (Nicholas), SZE commented on HDFS-3875:
--

> ... Nicholas, any comments on if this applies to old pipeline vs new pipeline?

Both the old and the new pipelines should have the similar problem since, when 
machine A sends some data to machine B and it fails, it is generally impossible 
to detect whether A, B or the network is faulty.  Of course, we can detect it 
for some special cases such as one of the machine dead.

> Potential blocker for 2.0.3-alpha.

I would say that this is not a blocker for 2.0.3-alpha since this is not a 
regression.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-27 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505119#comment-13505119
 ] 

Suresh Srinivas commented on HDFS-3875:
---

bq. while it's a nasty corruption issue, I don't think it's anything new...
I think its a good idea to keep this as blocker even if this issue is not a new 
one, given it is a corruption issue. Nicholas, any comments on if this applies 
to old pipeline vs new pipeline?

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-27 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504816#comment-13504816
 ] 

Todd Lipcon commented on HDFS-3875:
---

Thanks for looking into this, Kihwal. Your analysis makes sense to me.

Arun - not sure this should be a blocker - while it's a nasty corruption issue, 
I don't think it's anything new. AFAIK the write pipeline has always had this 
issue since the ancient days, right? Or did the HDFS-265 rewrite end up 
changing the order of the ack send and the CRC verification?

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Assignee: Kihwal Lee
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-27 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504625#comment-13504625
 ] 

Kihwal Lee commented on HDFS-3875:
--

The key is to prevent ACKing the corrupt packet. Data corruption can be avoided 
with this alone. For a better error-recovery, datanodes should return a 
specific error code to let client know. {{Status}} already has ERROR_CHECKSUM 
defined. I will have a patch ready soon.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-26 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504260#comment-13504260
 ] 

Kihwal Lee commented on HDFS-3875:
--

I don't think calling reportBadBlocks() alone does any good. Without client 
knowing details of a corruption, it won't be able to recover the block 
properly. reportBadBlocks() during create is only useful when a corruption is 
confined to one replica. If we get the in-line corruption detection and 
recovery right, this call will not be needed during write operations.

If the meaning of response in the data packet transfer is to be extended to 
cover packet corruption,

* A tail node should not ACK until the checksum of a packet is verified. 
Currently, an ack is enqueued before verifying checksum, which in case of tail 
node causes immediate transmission of ACK/SUCCESS.

* When the tail node is dropped from a pipeline, other nodes should not simply 
ack with success since that would mean checksum was okay on those nodes. 

* The portions that were ACK'ed with SUCCESS are guaranteed to be not corrupt. 
To be precise, there can be corruption on disk due to local issues, but not in 
the data each datanode received. I.e. any on-disk corruption must be an 
isolated corruption, not caused by propagated corruption.

For the second point, we could have datanodes to verify checksums when they 
lose the mirror node or explicitly get ACK/CORRUPTION. But this can be 
simplified if we can guarantee that no ACK/SUCCESS is sent back when a 
corruption is detected in the packet or the mirror node is lost. We can just 
drop the portion of data by not ACKing the corrupt packet or sending 
ACK/CORRUPTION back for it. I think client will redo the un-ACK'ed packet in 
this case. 

The worst case is rewriting some packets. But advantage is simplicity and 
avoiding checksum verification of written data.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-26 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503889#comment-13503889
 ] 

Kihwal Lee commented on HDFS-3875:
--

bq. If the two checksum methods are different, datanodes would have 
recalculated and wrote out data along with their own checksum. Even if incoming 
data was corrupt, it would appear okay on disk of these nodes.

It appears the checksum verification is already done if 
{{needsChecksumTranslation}} is true. There is one less thing to worry about. 

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-11-26 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503875#comment-13503875
 ] 

Kihwal Lee commented on HDFS-3875:
--

This sounds like the symptom I mentioned in HDFS-3874. The tail node in the 
pipeline containing three detected a corruption, but it's report failed due to 
HDFS-3874 and it just went away. Since the last of the three in the pipeline 
just disappeared, the corrupt packet was acked with {SUCCESS, SUCCESS, FAIL}. 
So recreation of pipeline using the remaining two ended up containing the 
corrupt portion of data.

bq. Depending on the above, it would report back the errorIndex appropriately 
to the client, so that the correct faulty node is removed from the pipeline.

* This should cover the cases where a particular datanode corrupting data IF 
the client checksum and storage checksum method is identical.

* If the two checksum methods are different, datanodes would have recalculated 
and wrote out data along with their own checksum. Even if incoming data was 
corrupt, it would appear okay on disk of these nodes. The tail node can detect 
corruption, but if it somehow terminates or get ignored, there is no 
retrospective scan that will tell us the integrity of the stored block, since 
the checksum may have been recreated to match the corrupted data.  Maybe we 
should force datanodes to verify checksum if the two checksums types are 
different.

* Even if we don't have the above issue, a special handling is needed for the 
case where client is corrupting data. After recreating a pipeline, the same 
thing will repeat since client moves un-acked packets to its data queue and 
resend. Fail after trying twice? Or may be the client should do self integrity 
check of the packets in the ack queue if a corruption is present in the first 
datanode.

* How will it work with reportBadBlocks() being called by the last node in the 
pipeline? The semantics of this method does not seem compatible with the blocks 
being actively written and could be recovered by calling recoverRbw().

* Given all these issues, simply failing/abandoning block may be the easiest 
way out without missing any other possible corner cases.  This will be even 
more convincing if we have any evidence showing that client-side corruption is 
the most common cause. 

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.0.2-alpha
>Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-08-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445411#comment-13445411
 ] 

Todd Lipcon commented on HDFS-3875:
---

Just to brainstorm, here's one potential solution:
- if the tail node in the pipeline detects a checksum error, then it returns a 
special error code back up the pipeline indicating this (rather than just 
disconnecting)
- if a non-tail node receives this error code, then it immediately scans its 
own block on disk (from the beginning up through the last acked length). If it 
detects a corruption on its local copy, then it should assume that _it_ is the 
faulty one, rather than the downstream neighbor. If it detects no corruption, 
then the faulty node is either the downstream mirror or the network link 
between the two, and the current behavior is reasonable.

Depending on the above, it would report back the errorIndex appropriately to 
the client, so that the correct faulty node is removed from the pipeline.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.2.0-alpha
>Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HDFS-3875) Issue handling checksum errors in write pipeline

2012-08-30 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445385#comment-13445385
 ] 

Todd Lipcon commented on HDFS-3875:
---

Here's the recovery from the perspective of the NN:

{code}
2012-08-28 19:16:33,532 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
updatePipeline(block=BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581786,
 newGenerationStamp=140581806, newLength=44281856, 
newNodes=[172.29.97.219:50010], clientNam
2012-08-28 19:16:33,597 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
updatePipeline(BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581786)
 successfully to 
BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581806
{code}

Here's the recovery from the perspective of the middle node:

{code}
2012-08-28 19:16:33,531 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
replica ReplicaBeingWritten, blk_2632740624757457378_140581786, RBW
  getNumBytes() = 44867072
  getBytesOnDisk()  = 44867072
  getVisibleLength()= 44281856
  getVolume()   = /data/2/dfs/dn/current
  getBlockFile()= 
/data/2/dfs/dn/current/BP-1507505631-172.29.97.196-1337120439433/current/rbw/blk_2632740624757457378
  bytesAcked=44281856
  bytesOnDisk=44867072
{code}

and then the later checksum exception from the block scanner:

{code}
2012-08-28 19:23:59,275 WARN 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Second 
Verification failed for 
BP-1507505631-172.29.97.196-1337120439433:blk_2632740624757457378_140581806
org.apache.hadoop.fs.ChecksumException: Checksum failed at 44217344
{code}

Interestingly, the checksum exception noticed by the block scanner is less than 
the "acked length" seen at recovery time.

On the node in question, I see a fair number of weird errors (page allocation 
failures etc) in the kernel log. So my guess is that the machine is borked and 
was silently corrupting memory in the middle of the pipeline. Hence, because 
the recovery kicked out the wrong node, it ended up persisting a corrupt 
version of the block instead of a good one.

> Issue handling checksum errors in write pipeline
> 
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: data-node, hdfs client
>Affects Versions: 2.2.0-alpha
>Reporter: Todd Lipcon
>
> We saw this issue with one block in a large test cluster. The client is 
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it 
> received from the first node. We don't know if the client sent a bad 
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it 
> threw an exception. The pipeline started up again with only one replica (the 
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and 
> unrecoverable since it is the only replica

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira