[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065
 ] 

ruiliang edited comment on HDFS-15759 at 6/7/24 7:55 AM:
-

When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]

buffers[this.dataBlkNum + ixx].equals(outputs[ixx] =true ?

outputs = {ByteBuffer[2]@3271} 
 0 = {HeapByteBuffer@3455} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3459} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}


was (Author: ruilaing):
When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this 

[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-06-07 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065
 ] 

ruiliang edited comment on HDFS-15759 at 6/7/24 7:54 AM:
-

When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]
{code:java}
buffers = {ByteBuffer[5]@3270} 
 0 = {HeapByteBuffer@3430} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 1 = {HeapByteBuffer@3434} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 2 = {HeapByteBuffer@3438} "java.nio.HeapByteBuffer[pos=65536 lim=65536 
cap=65536]"
 3 = {HeapByteBuffer@3504} "java.nio.HeapByteBuffer[pos=0 lim=65536 cap=65536]"
  hb = {byte[65536]@3511} [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, +65,436 more]{code}
Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481)     
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
         at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507)         
at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59)
         at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221)
         at 
org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201)
         at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943)
         at 
org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112)
         at 
org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251)     
    at 
org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290)  
       at 
org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333)
         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355)         
... 6 more
 {code}


was (Author: ruilaing):
When I validate a block that has been corrupted many times, does it appear 
normal?

ByteBuffer hb show [0..]

!image-2024-06-07-15-52-26-294.png!

Can this situation be judged as an anomaly?

 
{code:java}
hdfs  debug verifyEC  -file /file.orc
24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not 
available in your platform... using builtin-java codec where applicable
Checking EC block group: blk_-9223372036492703744
Status: OK
{code}
 

check orc file
{code:java}
Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java  
Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer 
in skip_ip/_skip_file.         at 
org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360)         
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879)         at 
org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873)         at 
org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345)         at 
org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276)         at 
org.apache.orc.tools.FileDump.main(FileDump.java:137)         at 
org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: 
java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed 
= 7752508 in column 3 kind LENGTH         at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(

[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 4:52 AM:
--

[~weichiu]

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
[https://github.com/apache/orc/issues/1939]
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:53 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I skip patches related to HDFS-14768,HDFS-15186, and HDFS-15240?

The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:52 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:51 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I was wondering if cherry picked your current code (GitHub pull request #2869),
Can I not repair the patches related to HDFS-14768,HDFS-15186, and HDFS-15240?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode

2024-05-22 Thread ruiliang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806
 ] 

ruiliang edited comment on HDFS-15759 at 5/23/24 3:50 AM:
--

Hello, our current production data also has this kind of EC storage data damage 
problem, about the problem description
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!


was (Author: ruilaing):
Hello, our current online data also appears this kind of EC storage data damage 
problem, about the problem description 
https://github.com/apache/orc/issues/1939
I would like to ask if cherry picked your current code (GitHub pull request 
#2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 
related patches?
The current version of hdfs is 3.1.0.
Thank you!

> EC: Verify EC reconstruction correctness on DataNode
> 
>
> Key: HDFS-15759
> URL: https://issues.apache.org/jira/browse/HDFS-15759
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, ec, erasure-coding
>Affects Versions: 3.4.0
>Reporter: Toshihiko Uchida
>Assignee: Toshihiko Uchida
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.2.3
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> EC reconstruction on DataNode has caused data corruption: HDFS-14768, 
> HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and 
> the corruption is neither detected nor auto-healed by HDFS. It is obviously 
> hard for users to monitor data integrity by themselves, and even if they find 
> corrupted data, it is difficult or sometimes impossible to recover them.
> To prevent further data corruption issues, this feature proposes a simple and 
> effective way to verify EC reconstruction correctness on DataNode at each 
> reconstruction process.
> It verifies correctness of outputs decoded from inputs as follows:
> 1. Decoding an input with the outputs;
> 2. Compare the decoded input with the original input.
> For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs 
> [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from 
> [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0.
> When an EC reconstruction task goes wrong, the comparison will fail with high 
> probability.
> Then the task will also fail and be retried by NameNode.
> The next reconstruction will succeed if the condition triggered the failure 
> is gone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org