[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 4:52 AM:
--

[~weichiu]

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
[https://github.com/apache/orc/issues/1939]
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:53 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:52 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:51 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:50 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current online data also appears this kind of EC storage data damage 
problem, about the problem description 
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang commented on HDFS-15759:
-

Hello, our current online data also appears this kind of EC storage data damage 
problem, about the problem description 
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17529) RBF: Improve router state store cache entry deletion

2024-05-22 Thread Felix N (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix N updated HDFS-17529:
---
Summary: RBF: Improve router state store cache entry deletion  (was: 
Improve router state store cache entry deletion)

> RBF: Improve router state store cache entry deletion
> 
>
> Key: HDFS-17529
> URL: https://issues.apache.org/jira/browse/HDFS-17529
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, rbf
>Reporter: Felix N
>Assignee: Felix N
>Priority: Major
>  Labels: pull-request-available
>
> Current implementation for router state store update is quite inefficient, so 
> much that when routers are removed and a lot of NameNodeMembership records 
> are deleted in a short burst, the deletions triggered a router safemode in 
> our cluster and caused a lot of troubles.
> This ticket aims to improve the deletion process for ZK state store 
> implementation.
> See HDFS-17532 for the other half of this improvement



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17528) FsImageValidation: set txid when saving a new image

2024-05-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848796#comment-17848796
 ] 

ASF GitHub Bot commented on HDFS-17528:
---

szetszwo commented on PR #6828:
URL: https://github.com/apache/hadoop/pull/6828#issuecomment-2126096755

   @vinayakumarb , thanks a lot for reviewing this!




> FsImageValidation: set txid when saving a new image
> ---
>
> Key: HDFS-17528
> URL: https://issues.apache.org/jira/browse/HDFS-17528
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: tools
>Reporter: Tsz-wo Sze
>Assignee: Tsz-wo Sze
>Priority: Major
>  Labels: pull-request-available
>
> - When the fsimage is specified as a file and the FsImageValidation tool 
> saves a new image (for removing inaccessible inodes), the txid is not set.  
> Then, the resulted image will have 0 as its txid.
> - When the fsimage is specified as a directory, the txid is set.  However, it 
> will get NPE since NameNode metrics is uninitialized (although the metrics is 
> not used by FsImageValidation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Moved] (HDFS-17534) RBF: Support leader follower mode for multiple subclusters

2024-05-22 Thread Yuanbo Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanbo Liu moved HADOOP-19183 to HDFS-17534:


Component/s: rbf
 (was: RBF)
Key: HDFS-17534  (was: HADOOP-19183)
Project: Hadoop HDFS  (was: Hadoop Common)

> RBF: Support leader follower mode for multiple subclusters
> --
>
> Key: HDFS-17534
> URL: https://issues.apache.org/jira/browse/HDFS-17534
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Yuanbo Liu
>Priority: Major
>
> Currently there are five modes in multiple subclusters like
> HASH, LOCAL, RANDOM, HASH_ALL,SPACE;
> Proposal a new mode called leader/follower mode. routers try to write to 
> leader subcluster as many as possible. When routers read data, put leader 
> subcluster into first rank.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org