[ https://issues.apache.org/jira/browse/SPARK-34536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
haiyangyu updated SPARK-34536: ------------------------------ Description: h2. BackGround I find a rare case which lead some partitions read less data when use zstd; h2. Detail I saved normal shuffle data and wrong shuffle data and found the wrong shuffle data was the head part of the normal shuffle data, and I found zstd-jni tag 1.3.3-2 has the problems which can read a head part of whole data and normal exit. The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read function call, this doesn't meet the standard of InputStream, InputStream will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss. zstd-jni issues: https://github.com/luben/zstd-jni/issues/159 zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows: !image-2021-02-25-17-50-49-427.png|width=544,height=232! So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production. The BufferedInputStream's code as follows: !image-2021-02-25-17-51-49-998.png|width=539,height=310! was: h2. BackGround I find a rare case which lead some partitions read less data when use zstd; h2. Detail I saved normal shuffle data and wrong shuffle data and found the wrong shuffle data was the head part of the normal shuffle data, and I found zstd-jni tag 1.3.3-2 has the problems which can read a head part of whole data and normal exit. The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read function call, this doesn't meet the standard of InputStream, InputStream will not return 0 unless len is 0; Spark will use a BufferedInputStream wrapped to ZstdInputStream, when ZstdInputStream read call return 0, BufferedInputStream will consider the 0 as the end of read and exit, this can lead data loss. zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows: !image-2021-02-25-17-50-49-427.png|width=544,height=232! So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 for spark2.4 has a wide use in production. The BufferedInputStream's code as follows: !image-2021-02-25-17-51-49-998.png|width=539,height=310! > zstd-jni lead read less shuffle data > ------------------------------------ > > Key: SPARK-34536 > URL: https://issues.apache.org/jira/browse/SPARK-34536 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0, 2.4.7 > Reporter: haiyangyu > Priority: Major > Labels: data-loss > Attachments: image-2021-02-25-17-50-49-427.png, > image-2021-02-25-17-51-49-998.png > > > h2. BackGround > I find a rare case which lead some partitions read less data when use zstd; > h2. Detail > I saved normal shuffle data and wrong shuffle data and found the wrong > shuffle data was the head part of the normal shuffle data, and I found > zstd-jni tag 1.3.3-2 has the problems which can read a head part of whole > data and normal exit. > The ZstdInputStream in zstd-jni(tag 1.3.3-2) will return 0 after a read > function call, this doesn't meet the standard of InputStream, InputStream > will not return 0 unless len is 0; Spark will use a BufferedInputStream > wrapped to ZstdInputStream, when ZstdInputStream read call return 0, > BufferedInputStream will consider the 0 as the end of read and exit, this can > lead data loss. > zstd-jni issues: > https://github.com/luben/zstd-jni/issues/159 > zstd-jni has fixed this problems in tag 1.4.4-3, the code as follows: > !image-2021-02-25-17-50-49-427.png|width=544,height=232! > So, I think it's nessary to upgrade zstd-jni version to 1.4.4-3 in spark2.4 > for spark2.4 has a wide use in production. > > The BufferedInputStream's code as follows: > !image-2021-02-25-17-51-49-998.png|width=539,height=310! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org