[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369411#comment-17369411 ] Chenren Shao commented on HDFS-14099: - I have confirmed that this issue has been resolved. Thanks, both! > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); > return (jint) 0; > } > } > {code} > This call here is correct, but *Finished* no longer be set to false, even if > there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need > to be decompressed. > The second is in *org.apache.hadoop.io.compress.DecompressorStream* by > *decompressor.reset()*, because *Finished* is always true after decompressed > a *Frame*. > {code:java} > if (decompressor.finished()) { > // First see if there was any leftover buffered input from previous > // stream; if not, attempt to refill buffer. If refill -> EOF, we're > // all done; else reset, fix up input buffer, and get ready for next > // concatenated substream/"member". > int nRemaining = decompressor.getRemaining(); > if (nRemaining == 0) { > int m = getCompressedData(); > if (m == -1) { > // apparently the previous end-of-stream was also end-of-file: > // return success, as if we had never called getCompressedData() > eof = true; > return -1; > } > decompressor.reset(); > decompressor.setInput(buffer, 0, m); > lastBytesSent = m; > } else { > // looks like it's a concatenated stream: re
[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368435#comment-17368435 ] Chenren Shao commented on HDFS-14099: - Taking a deeper look at HADOOP-17096 and found this fix only affects compression. I am not sure how it could impact decompression issue that I encounter here. [~xuzq_zander] when you did your test, which patch did you use: [https://patch-diff.githubusercontent.com/raw/apache/hadoop/pull/441.patch] or [^HDFS-14099-trunk-003.patch] ? In my previous test, we used the latter and still got the error of ``` {{java.lang.InternalError: Unknown frame descriptor at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)}} {{```}} > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); > return (jint) 0; > } > } > {code} > This call here is correct, but *Finished* no longer be set to false, even if > there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need > to be decompressed. > The second is in *org.apache.hadoop.io.compress.DecompressorStream* by > *decompressor.reset()*, because *Finished* is always true after decompressed > a *Frame*. > {code:java} > if (decompressor.finished()) { > // First see if there w
[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368231#comment-17368231 ] Chenren Shao commented on HDFS-14099: - Thank you, [~xuzq_zander] and [~weichiu]. As you suspected, I tried the patch on the top of 3.2.1, so it is very likely that HADOOP-17096 was the issue. I will apply the patch for HADOOP-17096 and try again. > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); > return (jint) 0; > } > } > {code} > This call here is correct, but *Finished* no longer be set to false, even if > there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need > to be decompressed. > The second is in *org.apache.hadoop.io.compress.DecompressorStream* by > *decompressor.reset()*, because *Finished* is always true after decompressed > a *Frame*. > {code:java} > if (decompressor.finished()) { > // First see if there was any leftover buffered input from previous > // stream; if not, attempt to refill buffer. If refill -> EOF, we're > // all done; else reset, fix up input buffer, and get ready for next > // concatenated substream/"member". > int nRemaining = decompressor.getRemaining(); > if (nRemaining == 0) { > int m = getCompressedData(); > if (m == -1) { > // apparently the previous end-of-stream was also end-of-file: > // return success, as if we had never called getCompressedData() > eof = true; > return -1; > } > decompress
[jira] [Comment Edited] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122 ] Chenren Shao edited comment on HDFS-14099 at 6/9/21, 9:53 PM: -- Hi, all. I found that hadoop cannot process multi-frame zstd files and we applied this patch and still was not able to process it. The error message is the same as the one posted here. I attached the problematic file [here|https://drive.google.com/file/d/12oGYQL63jmSBDwFi208jDNzFSP_CrraL/view?usp=sharing] and we can reproduce the issue by reading it via spark. This file was created by essentially running `cat file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to decompress it without any issue, but spark.read will cause problem. Spark read of file1.zst and file2.zst doesn't have problem. was (Author: cshao239): Hi, all. I found that hadoop cannot process multi-frame zstd files and we applied this patch and still was not able to process it. The error message is the same as the one posted here. I will try to attach the problematic file here and we can reproduce the issue by reading it via spark. This file was created by essentially running `cat file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to decompress it without any issue, but spark.read will cause problem. Spark read of file1.zst and file2.zst doesn't have problem. > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); >
[jira] [Comment Edited] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122 ] Chenren Shao edited comment on HDFS-14099 at 6/9/21, 2:35 PM: -- Hi, all. I found that hadoop cannot process multi-frame zstd files and we applied this patch and still was not able to process it. The error message is the same as the one posted here. I will try to attach the problematic file here and we can reproduce the issue by reading it via spark. This file was created by essentially running `cat file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to decompress it without any issue, but spark.read will cause problem. Spark read of file1.zst and file2.zst doesn't have problem. was (Author: cshao239): Hi, all. I found that hadoop cannot process multi-frame files and we applied this patch and still was not able to process it. The error message is the same as the one posted here. I will try to attach the problematic file here and we can reproduce the issue by reading it via spark. This file was created by essentially running `cat file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to decompress it without any issue, but spark.read will cause problem. Spark read of file1.zst and file2.zst doesn't have problem. > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); > return (jint) 0; > } > } > {code} > This call here is correct, but *Fin
[jira] [Commented] (HDFS-14099) Unknown frame descriptor when decompressing multiple frames in ZStandardDecompressor
[ https://issues.apache.org/jira/browse/HDFS-14099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360122#comment-17360122 ] Chenren Shao commented on HDFS-14099: - Hi, all. I found that hadoop cannot process multi-frame files and we applied this patch and still was not able to process it. The error message is the same as the one posted here. I will try to attach the problematic file here and we can reproduce the issue by reading it via spark. This file was created by essentially running `cat file1.zst file2.zst > output.zst`. You can run `zstd -d output.zst` to decompress it without any issue, but spark.read will cause problem. Spark read of file1.zst and file2.zst doesn't have problem. > Unknown frame descriptor when decompressing multiple frames in > ZStandardDecompressor > > > Key: HDFS-14099 > URL: https://issues.apache.org/jira/browse/HDFS-14099 > Project: Hadoop HDFS > Issue Type: Bug > Environment: Hadoop Version: hadoop-3.0.3 > Java Version: 1.8.0_144 >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14099-trunk-001.patch, HDFS-14099-trunk-002.patch, > HDFS-14099-trunk-003.patch > > > We need to use the ZSTD compression algorithm in Hadoop. So I write a simple > demo like this for testing. > {code:java} > // code placeholder > while ((size = fsDataInputStream.read(bufferV2)) > 0 ) { > countSize += size; > if (countSize == 65536 * 8) { > if(!isFinished) { > // finish a frame in zstd > cmpOut.finish(); > isFinished = true; > } > fsDataOutputStream.flush(); > fsDataOutputStream.hflush(); > } > if(isFinished) { > LOG.info("Will resetState. N=" + n); > // reset the stream and write again > cmpOut.resetState(); > isFinished = false; > } > cmpOut.write(bufferV2, 0, size); > bufferV2 = new byte[5 * 1024 * 1024]; > n++; > } > {code} > > And I use "*hadoop fs -text*" to read this file and failed. The error as > blow. > {code:java} > Exception in thread "main" java.lang.InternalError: Unknown frame descriptor > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zstd.ZStandardDecompressor.decompress(ZStandardDecompressor.java:181) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:98) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:66) > at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:127) > at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:101) > at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:96) > at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331) > at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:303) > at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:285) > at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:269) > at > org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) > at org.apache.hadoop.fs.shell.Command.run(Command.java:176) > at org.apache.hadoop.fs.FsShell.run(FsShell.java:328) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) > at org.apache.hadoop.fs.FsShell.main(FsShell.java:391) > {code} > > So I had to look the code, include jni, then found this bug. > *ZSTD_initDStream(stream)* method may by called twice in the same *Frame*. > The first is in *ZStandardDecompressor.c.* > {code:java} > if (size == 0) { > (*env)->SetBooleanField(env, this, ZStandardDecompressor_finished, > JNI_TRUE); > size_t result = dlsym_ZSTD_initDStream(stream); > if (dlsym_ZSTD_isError(result)) { > THROW(env, "java/lang/InternalError", > dlsym_ZSTD_getErrorName(result)); > return (jint) 0; > } > } > {code} > This call here is correct, but *Finished* no longer be set to false, even if > there is some datas (a new frame) in *CompressedBuffer* or *UserBuffer* need > to be decompressed. > The second is in *org.apache.hadoop.io.compress.DecompressorStream* by > *decompressor.reset()*, because *Finished* is always true after decompressed > a *Frame*. > {code:java} > if (decompressor.finished()) { > // First see if there was any leftover buffered input from previous > // stream; if not, attempt to refill buffer. If refill -> EOF, we're > // all done; else reset, fix up input buffer, and get ready for next > // concatenated substream/"member". > in