[jira] [Updated] (SPARK-34536) zstd-jni lead read less shuffle data

haiyangyu (Jira) Thu, 25 Feb 2021 02:00:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-34536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


haiyangyu updated SPARK-34536:
------------------------------
    Description: 
h2. BackGround

I find a rare case which lead some partitions read less data when use zstd;
h2. Detail

I saved normal shuffle data and wrong shuffle data and found the wrong shuffle 
data was the head part of the normal shuffle data, and I found zstd-jni tag 
1.3.3-2 has the problems which can read  a head part of whole data and normal 
exit.

The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read 
function call, this doesn't meet the standard of InputStream, InputStream will 
not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to 
ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream 
will consider the 0 as the end of read and exit, this can lead data loss.

zstd-jni issues:

https://github.com/luben/zstd-jni/issues/159

zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows:

!image-2021-02-25-17-50-49-427.png|width=544,height=232!

So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for 
spark2.4 has a wide use in production.

 

The BufferedInputStream's code as follows:

!image-2021-02-25-17-51-49-998.png|width=539,height=310!

  was:
h2. BackGround

I find a rare case which lead some partitions read less data when use zstd;
h2. Detail

I saved normal shuffle data and wrong shuffle data and found the wrong shuffle 
data was the head part of the normal shuffle data, and I found zstd-jni tag 
1.3.3-2 has the problems which can read  a head part of whole data and normal 
exit.

The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read 
function call, this doesn't meet the standard of InputStream, InputStream will 
not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to 
ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream 
will consider the 0 as the end of read and exit, this can lead data loss.

zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows:

!image-2021-02-25-17-50-49-427.png|width=544,height=232!

So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for 
spark2.4 has a wide use in production.

 

The BufferedInputStream's code as follows:

!image-2021-02-25-17-51-49-998.png|width=539,height=310!


> zstd-jni lead read less shuffle data
> ------------------------------------
>
>                 Key: SPARK-34536
>                 URL: https://issues.apache.org/jira/browse/SPARK-34536
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.7
>            Reporter: haiyangyu
>            Priority: Major
>              Labels: data-loss
>         Attachments: image-2021-02-25-17-50-49-427.png, 
> image-2021-02-25-17-51-49-998.png
>
>
> h2. BackGround
> I find a rare case which lead some partitions read less data when use zstd;
> h2. Detail
> I saved normal shuffle data and wrong shuffle data and found the wrong 
> shuffle data was the head part of the normal shuffle data, and I found 
> zstd-jni tag 1.3.3-2 has the problems which can read  a head part of whole 
> data and normal exit.
> The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read 
> function call, this doesn't meet the standard of InputStream, InputStream 
> will not return 0 unless len is 0; Spark will use a BufferedInputStream 
> wrapped to ZstdInputStream, when ZstdInputStream read call return 0, 
> BufferedInputStream will consider the 0 as the end of read and exit, this can 
> lead data loss.
> zstd-jni issues:
> https://github.com/luben/zstd-jni/issues/159
> zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows:
> !image-2021-02-25-17-50-49-427.png|width=544,height=232!
> So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 
> for spark2.4 has a wide use in production.
>  
> The BufferedInputStream's code as follows:
> !image-2021-02-25-17-51-49-998.png|width=539,height=310!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34536) zstd-jni lead read less shuffle data

Reply via email to