[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 4:52 AM: -- [~weichiu] Hello, our current production data also has this kind of EC storage data damage problem, about the problem description [https://github.com/apache/orc/issues/1939] I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:53 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:52 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:51 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I was wondering if cherry picked your current code (GitHub pull request #2869), Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang edited comment on HDFS-15759 at 5/23/24 3:50 AM: -- Hello, our current production data also has this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! was (Author: ruilaing): Hello, our current online data also appears this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang commented on HDFS-15759: - Hello, our current online data also appears this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17529) RBF: Improve router state store cache entry deletion
[ https://issues.apache.org/jira/browse/HDFS-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix N updated HDFS-17529: --- Summary: RBF: Improve router state store cache entry deletion (was: Improve router state store cache entry deletion) > RBF: Improve router state store cache entry deletion > > > Key: HDFS-17529 > URL: https://issues.apache.org/jira/browse/HDFS-17529 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, rbf >Reporter: Felix N >Assignee: Felix N >Priority: Major > Labels: pull-request-available > > Current implementation for router state store update is quite inefficient, so > much that when routers are removed and a lot of NameNodeMembership records > are deleted in a short burst, the deletions triggered a router safemode in > our cluster and caused a lot of troubles. > This ticket aims to improve the deletion process for ZK state store > implementation. > See HDFS-17532 for the other half of this improvement -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17528) FsImageValidation: set txid when saving a new image
[ https://issues.apache.org/jira/browse/HDFS-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848796#comment-17848796 ] ASF GitHub Bot commented on HDFS-17528: --- szetszwo commented on PR #6828: URL: https://github.com/apache/hadoop/pull/6828#issuecomment-2126096755 @vinayakumarb , thanks a lot for reviewing this! > FsImageValidation: set txid when saving a new image > --- > > Key: HDFS-17528 > URL: https://issues.apache.org/jira/browse/HDFS-17528 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Reporter: Tsz-wo Sze >Assignee: Tsz-wo Sze >Priority: Major > Labels: pull-request-available > > - When the fsimage is specified as a file and the FsImageValidation tool > saves a new image (for removing inaccessible inodes), the txid is not set. > Then, the resulted image will have 0 as its txid. > - When the fsimage is specified as a directory, the txid is set. However, it > will get NPE since NameNode metrics is uninitialized (although the metrics is > not used by FsImageValidation). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Moved] (HDFS-17534) RBF: Support leader follower mode for multiple subclusters
[ https://issues.apache.org/jira/browse/HDFS-17534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu moved HADOOP-19183 to HDFS-17534: Component/s: rbf (was: RBF) Key: HDFS-17534 (was: HADOOP-19183) Project: Hadoop HDFS (was: Hadoop Common) > RBF: Support leader follower mode for multiple subclusters > -- > > Key: HDFS-17534 > URL: https://issues.apache.org/jira/browse/HDFS-17534 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Yuanbo Liu >Priority: Major > > Currently there are five modes in multiple subclusters like > HASH, LOCAL, RANDOM, HASH_ALL,SPACE; > Proposal a new mode called leader/follower mode. routers try to write to > leader subcluster as many as possible. When routers read data, put leader > subcluster into first rank. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org