[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-20 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911526#comment-16911526
 ] 

Chen Zhang commented on HDFS-13709:
---

Got it, thanks [~jojochuang] for your explanation.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911504#comment-16911504
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


bq. In which case we need to backport the patch to branch-2? Usually the bugfix 
and some critical improvements?
At this point, most patch should go into branch-2 too, except for features not 
in Hadoop 2.x (erasure coding).
I would only put critical fixes into branch-2.8 though. It's a quite stable 
release, and code has diverged quite a lot, so the effort is non trivial.
bq. Some people open a new Jira to backport to branch-2, some update a new 
patch in the same Jira, which is better in the practice?
Either way. I think if the jira was initially in branch-3, but after awhile 
people want to add to branch-2, then better to use a new jira. If the patch 
applies without conflict/trivial conflict, then the same jira can be used.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911008#comment-16911008
 ] 

Chen Zhang commented on HDFS-13709:
---

Created a new Jira HDFS-14752 to track the branch-2 backport

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910906#comment-16910906
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~jojochuang] for reviewing this patch and merging it.

I'll provide a branch-2 patch later, btw, I've a few questions about this:
 # In which case we need to backport the patch to branch-2? Usually the bugfix 
and some critical improvements?
 # Some people open a new Jira to backport to branch-2, some update a new patch 
in the same Jira, which is better in the practice?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910744#comment-16910744
 ] 

Hudson commented on HDFS-13709:
---

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17149 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17149/])
HDFS-13709. Report bad block to NN when transfer block encounter EIO (weichiu: 
rev 360a96f342f3c8cb8246f011abb9bcb0b6ef3eaa)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/SimulatedFSDataset.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java
* (add) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DiskFileCorruptException.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestReplication.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/VolumeScanner.java


> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-19 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910630#comment-16910630
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


+1

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.005.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-17 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909643#comment-16909643
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


Thanks [~zhangchen]
 Can you help verify the failed tests are unrelated?
 Additionally, it would be great if you can add a few javadoc comments for the 
new handleBadBlock() method. Its logic can be a little convoluted given that 
there are two asynchronous threads involved (datanode and volume scanner) We 
definitely want to avoid a situation where volumescanner finds a suspect, 
calling handleBadBlock() and then the suspect is put into voumescanner's queue 
and get scanned again and again non-stop.

nit
{code:java}
assertTrue(replicaCount == 1);
{code}
better to use 
{code:java}
assertEquals("error message", 1, replicaCount);
 {code}

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-16 Thread Stephen O'Donnell (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909262#comment-16909262
 ] 

Stephen O'Donnell commented on HDFS-13709:
--

I think this change looks good now. The exception handling code is much tidier 
when passing the throwable to the constructor.

I can reuse this new method handleBadBlock() in HDFS-14706 once we get this one 
committed.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-15 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907858#comment-16907858
 ] 

Hadoop QA commented on HDFS-13709:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 25s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 53s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 15s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
30s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}141m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailureWithRandomECPolicy |
|   | hadoop.hdfs.server.datanode.TestDataNodeMetrics |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 |
| JIRA Issue | HDFS-13709 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977666/HDFS-13709.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 095ec96cb126 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 167acd8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27517/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27517/testReport/ |
| Max. process+thread count | 3523 (vs. ulimit of 5500) |
| 

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907779#comment-16907779
 ] 

Chen Zhang commented on HDFS-13709:
---

uploaded patch v4 to fix checkstyle and asflicense error, also fixed a failed ut

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.004.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-14 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907623#comment-16907623
 ] 

Hadoop QA commented on HDFS-13709:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
3s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 1 new + 293 unchanged - 0 fixed = 294 total (was 293) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  5s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}117m 36s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
34s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}171m  6s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReplication |
|   | hadoop.hdfs.server.datanode.TestDiskError |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl |
|   | hadoop.hdfs.server.datanode.TestDataNodeMetrics |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-13709 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977635/HDFS-13709.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ccc6056910db 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 167acd8 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27511/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-14 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907505#comment-16907505
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~sodonnell] for your suggestion, updated the code and upload patch v3

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.003.patch, 
> HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-13 Thread Stephen O'Donnell (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906329#comment-16906329
 ] 

Stephen O'Donnell commented on HDFS-13709:
--

[~jojochuang] pointed me to this Jira as I am working on the related one in 
HDFS-14706. I just have on minor comment here. In your definition of the new 
exception class "DiskFileCorruptException", if you add a method like:

 
{code:java}
public DiskFileCorruptException(String msg, Throwable cause) {
super(msg, cause);
}{code}
Then you can avoid having to adjust the stack trace etc when you create this 
exception, so you can change this:
{code:java}
+if (ioe.getMessage().startsWith(EIO_ERROR)) {
+  DiskFileCorruptException de = new DiskFileCorruptException("Original 
Exception : " + ioe);
+  de.initCause(ioe);
+  de.setStackTrace(ioe.getStackTrace());
+  throw de;
+}{code}
To just this:
{code:java}
if (ioe.getMessage().startsWith(EIO_ERROR)) {
  throw new DiskFileCorruptException("A disk IO error occurred", ioe);
}{code}

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-12 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905038#comment-16905038
 ] 

Hadoop QA commented on HDFS-13709:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
55s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  7s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 42s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 293 unchanged - 0 fixed = 297 total (was 293) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 58s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 41s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
35s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}163m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReplication |
|   | hadoop.hdfs.server.datanode.TestDiskError |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
|   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-13709 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12977302/HDFS-13709.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 4d397d9473fc 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 8fbf8b2 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27476/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-12 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904918#comment-16904918
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~jojochuang] for mentioning me at HDFS-14706,
This Jira and HDFS-14706 both introduce the reportBadBlock in different places, 
I agree with you that we need to reuse the logic of handle bad blocks.

I've added a method \{{handleBadBlock}} in DataNode to handle bad-blocks, using 
the following logic:
 # If it's called by scanner, then reportBadBlock to NN at any time
 # If it's the exception from other way(e.g. BlockSender), will first identify 
whether it's a bad block according to the type of exception. If it's a bad 
block, then try to markSuspectBlock if blockScanner is enabled, or report to NN 
if scanner disabled
 # I leave some specific logic in the 
\{{VolumeScanner#ScanResultHandler.handle()}} method, I think they are only 
related with scanner, not all situation

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.002.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-09 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904156#comment-16904156
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


The quick solution is to replace
{code}
reportBadBlock(bpos, b, "Can't replicate block " + b
+ " because the possible disk error: " + ie.getMessage());
{code}
with
{code}
blockScanner.markSuspectBlock()
{code}
but since your scanner is turned off, this is probably not going to affect you.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-08-05 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899813#comment-16899813
 ] 

Chen Zhang commented on HDFS-13709:
---

Hi [~jojochuang], any suggestions to push this Jira forward?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-30 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896336#comment-16896336
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


[~kihwal] any ideas about the checksum computation overhead? I think that's the 
biggest concern other than refactoring the code.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-30 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896219#comment-16896219
 ] 

Chen Zhang commented on HDFS-13709:
---

Thanks [~jojochuang] for your detailed comments.

1. For your first comments:
{quote}I thought we already verify checksum during block transfer, but I was 
wrong. Here's the code in {{DataNode#transferBlock}}
{quote}
I've checked the code there in detail, actually the checksum verification work 
is did by the BlockReceiver during block transfer
{code:java}
// DataNode.java
2573 blockSender = new BlockSender(b, 0, b.getNumBytes(),
2574false, false, true, DataNode.this, null, cachingStrategy); {code}
The sixth parameter is true, which will make blockSender send the checksum to 
peer. {{BlockReceiver#verifyChunks()}} will call {{reportRemoteBadBlock}}() 
when checksum error

But this case, checksum verification won't help. EIO will simply abort the 
transfer block procedure, no one knows the replica is corrupted if it's not 
accessed by client or VolumeScanner.

 

2. For your second suggestion:
{quote}It would be great if we can consolidate the error handling to support 
both cases
{quote}
It's a little different between these 2 logics:
 * VolumeScanner report bad block for all {{IOException}} besides 
{{FileNotFoundException}}, because it just scan disk, all IOException comes 
from disk I/O.
 * DataTransfer thread should only report bad block when the block access 
reports EIO error, because {{IOException}} is very normal during data transfer 
on network and it is hard to identify the root cause.

I'm glad to consolidate the error handling to support both, but I can't figure 
out a good way of doing that, do you have any idea?

Thanks again.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-29 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895552#comment-16895552
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


Additionally, while I think the patch is good, the block scanner (VolumeScanner 
class) uses BlockSender to detect checksum verification error, so what you do 
duplicates the logic there.

{code:title=VolumeScanner#scanBlock()}
try {
  blockSender = new BlockSender(block, 0, -1,
  false, true, true, datanode, null,
  CachingStrategy.newDropBehind());
  throttler.setBandwidth(bytesPerSec);
  long bytesRead = blockSender.sendBlock(nullStream, null, throttler);
  resultHandler.handle(block, null);
  metrics.incrBlocksVerified();
  return bytesRead;
} catch (IOException e) {
  resultHandler.handle(block, e);
}
{code}
It would be great if we can consolidate the error handling to support both 
cases.

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-29 Thread Wei-Chiu Chuang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895550#comment-16895550
 ] 

Wei-Chiu Chuang commented on HDFS-13709:


[~zhangchen] what is the version of Hadoop you're using?

When client receives a block, it verifies using the checksum. If the 
verification fails it reports to NameNode and NameNode schedules a replacement 
block.

If a block is not being accessed by client, then this can happen.

 

I thought we already verify checksum during block transfer, but I was wrong. 
Here's the code in {{DataNode#transferBlock}}
{code:java}
if (replicaNotExist || replicaStateNotFinalized) {
  String errStr = "Can't send invalid block " + block;
  LOG.info(errStr);
  bpos.trySendErrorReport(DatanodeProtocol.INVALID_BLOCK, errStr);
  return;
}
if (blockFileNotExist) {
  // Report back to NN bad block caused by non-existent block file.
  reportBadBlock(bpos, block, "Can't replicate block " + block
  + " because the block file doesn't exist, or is not accessible");
  return;
}
if (lengthTooShort) {
  // Check if NN recorded length matches on-disk length 
  // Shorter on-disk len indicates corruption so report NN the corrupt block
  reportBadBlock(bpos, block, "Can't replicate block " + block
  + " because on-disk length " + data.getLength(block) 
  + " is shorter than NameNode recorded length " + block.getNumBytes());
  return;
}
 {code}
We only report bad blocks when the block is missing or the length doesn't 
match. We don't do checksum. Not sure why. Is there a computation overhead 
concern?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-17 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887218#comment-16887218
 ] 

Erik Krogen commented on HDFS-13709:


I don't know much about this area, I would prefer for someone else to take a 
look. Maybe [~kihwal] would be interested?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-15 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885512#comment-16885512
 ] 

Hadoop QA commented on HDFS-13709:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 44s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 5 new + 278 unchanged - 0 fixed = 283 total (was 278) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 87m 30s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
33s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}140m 46s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
|   | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.web.TestWebHDFS |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=18.09.7 Server=18.09.7 Image:yetus/hadoop:bdbca0e |
| JIRA Issue | HDFS-13709 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12929708/HDFS-13709.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 033447b5bcf3 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 61bbdee |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_212 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27227/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/27227/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2019-07-15 Thread Chen Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885346#comment-16885346
 ] 

Chen Zhang commented on HDFS-13709:
---

[~linyiqun] [~jojochuang] [~xkrogen] can you help to review this issue?

> Report bad block to NN when transfer block encounter EIO exception
> --
>
> Key: HDFS-13709
> URL: https://issues.apache.org/jira/browse/HDFS-13709
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Chen Zhang
>Assignee: Chen Zhang
>Priority: Major
> Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

2018-07-01 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529018#comment-16529018
 ] 

genericqa commented on HDFS-13709:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
36s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  5s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  4s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 94m 54s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
31s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}158m 13s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestReencryptionWithKMS |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | HDFS-13709 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12929708/HDFS-13709.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 30f58f52840e 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / cdb0844 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24528/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24528/testReport/ |
| asflicense | 
https://builds.apache.org/job/PreCommit-HDFS-Build/24528/artifact/out/patch-asflicense-problems.txt
 |
| Max. process+thread count | 3076 (vs. ulimit of 1) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output |