[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2023-04-26 Thread Tao Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716656#comment-17716656
 ] 

Tao Li commented on HDFS-16422:
---

Hi [~cndaimin] , have you tested the impact on performance? And have you used 
it in a production environment? Thanks.
 
 

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-04-14 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522275#comment-17522275
 ] 

Steve Loughran commented on HDFS-16422:
---

this patch is going in to 3.3.3, for people who need it

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.4
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-04-11 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520357#comment-17520357
 ] 

Takanobu Asanuma commented on HDFS-16422:
-

[~cndaimin] Thanks for your confirmation!

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-04-10 Thread daimin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520328#comment-17520328
 ] 

daimin commented on HDFS-16422:
---

[~tasanuma] I think NativeRSRawDecoder is thread safe after HDFS-16422, and it 
is not before.

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-04-10 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520310#comment-17520310
 ] 

Takanobu Asanuma commented on HDFS-16422:
-

Hi [~cndaimin],

bq. In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is 
not thread safe, the read/write lock seems unable to protect the native 
decodeImpl method.

Do you mean HDFS-16422 also made NativeRSRawDecoder thread safe? Or 
NativeRSRawDecoder is still not thread safe after HDFS-16422?

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-03-23 Thread daimin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511224#comment-17511224
 ] 

daimin commented on HDFS-16422:
---

[~jingzhao] I tested this again, and my test steps are:
 # Setup a cluster with 11 datanodes, and write 4 EC RS-8-2 files: 1g, 2g, 4g, 
8g
 # Stop one datanode
 # Check md5sum of these files through HDFS FUSE, this is a simple way to 
create concurrent preads(indirect IO on FUSE)

Here is test result:
 * md5sum check before datanode down:

{quote}md5sum /mnt/fuse/*g
5e6c32c0b572e2ff24fb14f93c4cc45b  /mnt/fuse/1g
782173623681c129558c09e89251f46d  /mnt/fuse/2g
e107f9a83a383b98aa23fdd3171b589c  /mnt/fuse/4g
adb81da2c34161f249439597c515db1d  /mnt/fuse/8g
{quote} * md5sum after datanode down, with native(ISA-L) decoder:

{quote}md5sum /mnt/fuse/*g
206288b264b92af42563a14a242aa629  /mnt/fuse/1g
bc86f9f549912d78c8b3d02ada5621a2  /mnt/fuse/2g
c201356b7437e6aac1b574ade08b6ccb  /mnt/fuse/4g
ef2e6f6b4b6ab96a24e5f734e93bacc3  /mnt/fuse/8g
{quote} * md5sum after datanode down, with pure Java decoder:

{quote}md5sum /mnt/fuse/*g
5e6c32c0b572e2ff24fb14f93c4cc45b  /mnt/fuse/1g
782173623681c129558c09e89251f46d  /mnt/fuse/2g
e107f9a83a383b98aa23fdd3171b589c  /mnt/fuse/4g
adb81da2c34161f249439597c515db1d  /mnt/fuse/8g
{quote}
In conclusion: RSRawDecoder seems to be thread safe, NativeRSRawDecoder is not 
thread safe, the read/write lock seems unable to protect the native decodeImpl 
method.

And I also tested on md5sum check on same file with native(ISA-L) decoder, the 
result is different every time.
{quote}for i in \{1..5};do md5sum /mnt/fuse/1g;done
2e68ea6738dccb4f248df81b5c55d464  /mnt/fuse/1g
54944120797266fc4e26bd465ae5e67a  /mnt/fuse/1g
ef4d099269fb117e357015cf424723a9  /mnt/fuse/1g
6a40dbca2636ae796b6380385ddfbc83  /mnt/fuse/1g
126fc40073dcebb67d413de95571c08b  /mnt/fuse/1g
{quote}
IMO, HADOOP-15499 did improve the performance of decoder, however it breaked 
the correctness of decode method when invoked concurrently. We should take 
synchronized back, and I will submit a new PR later to do this work. Thanks 
[~jingzhao] again.

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-03-16 Thread daimin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507969#comment-17507969
 ] 

daimin commented on HDFS-16422:
---

I tested both NativeRSRawDecoder and RSRawDecoder before, and they seems both 
not thread safe to decode, therefore I simply add a synchronized to the decode 
method.

In consideration of HADOOP-15499, I will do some more tests to find out what's 
missing of the original lock protection.

Thanks for your reminding. [~jingzhao] 

 

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16422) Fix thread safety of EC decoding during concurrent preads

2022-03-09 Thread Jing Zhao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503858#comment-17503858
 ] 

Jing Zhao commented on HDFS-16422:
--

Looks like we already have r/w lock protection in AbstractNativeRawDecoder and 
its subclasses (NativeRSRawDecoder and NativeXORRawDecoder). So does that mean 
the extra protection is only necessary for other decoder implementations (such 
as RSRawDecoder)? 

HADOOP-15499 used r/w lock to replace the original object monitor (i.e. 
synchronized) so as to improve performance. Now looks like we're adding 
"synchronized" back to the APIs defined in the parent class.

I guess instead of updating the decode APIs in RawErasureDecoder, we may want 
to only fix the subclasses without lock protection. What do you think, 
[~weichiu] [~cndaimin] ?

> Fix thread safety of EC decoding during concurrent preads
> -
>
> Key: HDFS-16422
> URL: https://issues.apache.org/jira/browse/HDFS-16422
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsclient, ec, erasure-coding
>Affects Versions: 3.3.0, 3.3.1
>Reporter: daimin
>Assignee: daimin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.3, 3.3.3
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Reading data on an erasure-coded file with missing replicas(internal block of 
> block group) will cause online reconstruction: read dataUnits part of data 
> and decode them into the target missing data. Each DFSStripedInputStream 
> object has a RawErasureDecoder object, and when we doing pread concurrently, 
> RawErasureDecoder.decode will be invoked concurrently too. 
> RawErasureDecoder.decode is not thread safe, as a result of that we get wrong 
> data from pread occasionally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org