[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-25 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369411#comment-17369411
 ] 

Chenren Shao commented on HDFS-14099:
-

I have confirmed that this issue has been resolved. Thanks, both!

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
> return (jint) 0;
> }
> }
> {code}
> This call here is correct, but *Finished* no longer be set to false, even if 
> there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need 
> to be decompressed.
> The second is in *org.apache.hadoop.io.compress.DecompressorStream* by 
> *decompressor.reset()*, because *Finished* is always true after decompressed 
> a *Frame*.
> {code:java}
> if (decompressor.finished()) {
>   // First see if there was any leftover buffered input from previous
>   // stream; if not, attempt to refill buffer.  If refill -> EOF, we're
>   // all done; else reset, fix up input buffer, and get ready for next
>   // concatenated substream/"member".
>   int nRemaining = decompressor.getRemaining();
>   if (nRemaining == 0) {
> int m = getCompressedData();
> if (m == -1) {
>   // apparently the previous end-of-stream was also end-of-file:
>   // return success, as if we had never called getCompressedData()
>   eof = true;
>   return -1;
> }
> decompressor.reset();
> decompressor.setInput(buffer, 0, m);
> lastBytesSent = m;
>   } else {
> // looks like it's a concatenated stream:  re

[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-23 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368435#comment-17368435
 ] 

Chenren Shao commented on HDFS-14099:
-

Taking a deeper look at HADOOP-17096 and found this fix only affects 
compression. I am not sure how it could impact decompression issue that I 
encounter here. 

[~xuzq_zander] when you did your test, which patch did you use: 
[https://patch-diff.githubusercontent.com/raw/apache/hadoop/pull/441.patch]
or [^HDFS-14099-trunk-003.patch] ? In my previous test, we used the latter and 
still got the error of 

```

{{java.lang.InternalError: Unknown frame descriptor
at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
 Method)
at 
org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)}}

{{```}}

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
> return (jint) 0;
> }
> }
> {code}
> This call here is correct, but *Finished* no longer be set to false, even if 
> there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need 
> to be decompressed.
> The second is in *org.apache.hadoop.io.compress.DecompressorStream* by 
> *decompressor.reset()*, because *Finished* is always true after decompressed 
> a *Frame*.
> {code:java}
> if (decompressor.finished()) {
>   // First see if there w

[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-23 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368231#comment-17368231
 ] 

Chenren Shao commented on HDFS-14099:
-

Thank you, [~xuzq_zander] and [~weichiu]. As you suspected, I tried the patch 
on the top of 3.2.1, so it is very likely that HADOOP-17096 was the issue. I 
will apply the patch for HADOOP-17096 and try again.

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
> return (jint) 0;
> }
> }
> {code}
> This call here is correct, but *Finished* no longer be set to false, even if 
> there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need 
> to be decompressed.
> The second is in *org.apache.hadoop.io.compress.DecompressorStream* by 
> *decompressor.reset()*, because *Finished* is always true after decompressed 
> a *Frame*.
> {code:java}
> if (decompressor.finished()) {
>   // First see if there was any leftover buffered input from previous
>   // stream; if not, attempt to refill buffer.  If refill -> EOF, we're
>   // all done; else reset, fix up input buffer, and get ready for next
>   // concatenated substream/"member".
>   int nRemaining = decompressor.getRemaining();
>   if (nRemaining == 0) {
> int m = getCompressedData();
> if (m == -1) {
>   // apparently the previous end-of-stream was also end-of-file:
>   // return success, as if we had never called getCompressedData()
>   eof = true;
>   return -1;
> }
> decompress

[jira] [Comment Edited] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-09 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122
 ] 

Chenren Shao edited comment on HDFS-14099 at 6/9/21, 9:53 PM:
--

Hi, all. 

I found that hadoop cannot process multi-frame zstd files and we applied this 
patch and still was not able to process it. The error message is the same as 
the one posted here.

 

I attached the problematic file 
[here|https://drive.google.com/file/d/12oGYQL63jmSBDwFi208jDNzFSP_CrraL/view?usp=sharing]
 and we can reproduce the issue by reading it via spark. This file was created 
by essentially running `cat file1.zst file2.zst > output.zst`. You can run 
`zstd -d output.zst` to decompress it without any issue, but spark.read will 
cause problem. Spark read of file1.zst and file2.zst doesn't have problem.


was (Author: cshao239):
Hi, all. 

I found that hadoop cannot process multi-frame zstd files and we applied this 
patch and still was not able to process it. The error message is the same as 
the one posted here.

 

I will try to attach the problematic file here and we can reproduce the issue 
by reading it via spark. This file was created by essentially running `cat 
file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to 
decompress it without any issue, but spark.read will cause problem. Spark read 
of file1.zst and file2.zst doesn't have problem.

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
>   

[jira] [Comment Edited] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-09 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122
 ] 

Chenren Shao edited comment on HDFS-14099 at 6/9/21, 2:35 PM:
--

Hi, all. 

I found that hadoop cannot process multi-frame zstd files and we applied this 
patch and still was not able to process it. The error message is the same as 
the one posted here.

 

I will try to attach the problematic file here and we can reproduce the issue 
by reading it via spark. This file was created by essentially running `cat 
file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to 
decompress it without any issue, but spark.read will cause problem. Spark read 
of file1.zst and file2.zst doesn't have problem.


was (Author: cshao239):
Hi, all. 

I found that hadoop cannot process multi-frame files and we applied this patch 
and still was not able to process it. The error message is the same as the one 
posted here.

 

I will try to attach the problematic file here and we can reproduce the issue 
by reading it via spark. This file was created by essentially running `cat 
file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to 
decompress it without any issue, but spark.read will cause problem. Spark read 
of file1.zst and file2.zst doesn't have problem.

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
> return (jint) 0;
> }
> }
> {code}
> This call here is correct, but *Fin

[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor

2021-06-09 Thread Chenren Shao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122
 ] 

Chenren Shao commented on HDFS-14099:
-

Hi, all. 

I found that hadoop cannot process multi-frame files and we applied this patch 
and still was not able to process it. The error message is the same as the one 
posted here.

 

I will try to attach the problematic file here and we can reproduce the issue 
by reading it via spark. This file was created by essentially running `cat 
file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to 
decompress it without any issue, but spark.read will cause problem. Spark read 
of file1.zst and file2.zst doesn't have problem.

> Unknown frame descriptor when decompressing multiple frames in 
> ZStandardDecompressor
> 
>
> Key: HDFS-14099
> URL: https://issues.apache.org/jira/browse/HDFS-14099
> Project: Hadoop HDFS
>  Issue Type: Bug
> Environment: Hadoop Version: hadoop-3.0.3
> Java Version: 1.8.0_144
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, 
> HDFS-14099-trunk-003.patch
>
>
> We need to use the ZSTD compression algorithm in Hadoop. So I write a simple 
> demo like this for testing.
> {code:java}
> // code placeholder
> while ((size = fsDataInputStream.read(bufferV2)) > 0 ) {
>   countSize += size;
>   if (countSize == 65536 * 8) {
> if(!isFinished) {
>   // finish a frame in zstd
>   cmpOut.finish();
>   isFinished = true;
> }
> fsDataOutputStream.flush();
> fsDataOutputStream.hflush();
>   }
>   if(isFinished) {
> LOG.info("Will resetState. N=" + n);
> // reset the stream and write again
> cmpOut.resetState();
> isFinished = false;
>   }
>   cmpOut.write(bufferV2, 0, size);
>   bufferV2 = new byte[5 * 1024 * 1024];
>   n++;
> }
> {code}
>  
> And I use "*hadoop fs -text*"  to read this file and failed. The error as 
> blow.
> {code:java}
> Exception in thread "main" java.lang.InternalError: Unknown frame descriptor
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native
>  Method)
> at 
> org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
> at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
> at java.io.InputStream.read(InputStream.java:101)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127)
> at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101)
> at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96)
> at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
> at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303)
> at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285)
> at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269)
> at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> {code}
>  
> So I had to look the code, include jni, then found this bug.
> *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*.
> The first is  in *ZStandardDecompressor.c.* 
> {code:java}
> if (size == 0) {
> (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, 
> JNI_TRUE);
> size_t result = dlsym_ZSTD_initDStream(stream);
> if (dlsym_ZSTD_isError(result)) {
> THROW(env, "java/lang/InternalError", 
> dlsym_ZSTD_getErrorName(result));
> return (jint) 0;
> }
> }
> {code}
> This call here is correct, but *Finished* no longer be set to false, even if 
> there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need 
> to be decompressed.
> The second is in *org.apache.hadoop.io.compress.DecompressorStream* by 
> *decompressor.reset()*, because *Finished* is always true after decompressed 
> a *Frame*.
> {code:java}
> if (decompressor.finished()) {
>   // First see if there was any leftover buffered input from previous
>   // stream; if not, attempt to refill buffer.  If refill -> EOF, we're
>   // all done; else reset, fix up input buffer, and get ready for next
>   // concatenated substream/"member".
>   in