[ https://issues.apache.org/jira/browse/HADOOP-17096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Akira Ajisaka updated HADOOP-17096: ----------------------------------- Summary: Fix ZStandardCompressor input buffer offset (was: ZStandardCompressor throws java.lang.InternalError: Error (generic)) > Fix ZStandardCompressor input buffer offset > ------------------------------------------- > > Key: HADOOP-17096 > URL: https://issues.apache.org/jira/browse/HADOOP-17096 > Project: Hadoop Common > Issue Type: Bug > Components: io > Affects Versions: 3.2.1 > Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 > linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end > environment (eg running an actual hadoop job with zstd compression) because > it's very sensitive to the exact input and output characteristics. I > reproduced the bug by turning one of the existing unit tests into a crude > fuzzer, but I'm not sure upstream will accept that patch, so I've attached it > separately on this ticket. > Note that the existing unit test for testCompressingWithOneByteOutputBuffer > fails to reproduce this bug. This is because it's using the license file as > input, and this file is too small. libzstd has internal buffering (in our > environment it seems to be 128 kilobytes), and the license file is only 10 > kilobytes. Thus libzstd is able to consume all the input and compress it in a > single call, then return pieces of its internal buffer one byte at a time. > Since all the input is consumed in a single call, uncompressedDirectBufOff > and uncompressedDirectBufLen are both set to zero and thus the bug does not > reproduce. > Reporter: Stephen Jung (Stripe) > Assignee: Stephen Jung (Stripe) > Priority: Major > Labels: pull-request-available > Fix For: 3.2.2, 3.3.1, 3.4.0 > > Attachments: fuzztest.patch > > Time Spent: 10m > Remaining Estimate: 0h > > A bug in index handling causes ZStandardCompressor.c to pass a malformed > ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets > thrown. The crux of the issue is two variables, uncompressedDirectBufLen and > uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff > from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is > counted from uncompressedDirectBufOff. However, libzstd considers pos and > size to both be counted from the start of the buffer. As a result, this line > https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228 > causes a malformed buffer to be passed to libzstd, where pos>size. Here's a > longer description of the bug in case this abstract explanation is unclear: > ---- > Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with > five bytes of input. This results in uncompressedDirectBufOff=0 and > uncompressedDirectBufLen=5 > (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146). > Then we call compress(), which initializes a ZSTD_inBuffer > (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196). > The definition of those libzstd structs is here > https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that > we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The > ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd > returns from the compression function, it updates the ZSTD_inBuffer struct to > indicate how many bytes were consumed > (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920). > Note that pos is advanced, but size is unchanged. > Now, libzstd does not guarantee that the entire input will be compressed in a > single call of the compression function. (Some of the compression libraries > used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is > not one of them.) So the hadoop native code updates uncompressedDirectBufOff > and uncompressedDirectBufLen using the updated ZSTD_inBuffer: > https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228 > Now, returning to our example, we started with 5 bytes of uncompressed input. > Suppose libzstd compressed 4 of those bytes, leaving one unread. This would > result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four > bytes were consumed). The hadoop native code would then set > uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 > (five minus four equals one). > Since some of the input was not consumed, we will eventually call compress() > again. Then we instantiate another ZSTD_inBuffer struct, this time with > size=1 and pos=4. This is a bug - libzstd expects size and pos to both be > counted from the start of the buffer, therefore pos>size is unsound. So it > returns an error > https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932 > which gets escalated as a java.lang.InternalError. > I will be submitting a pull request on github with a fix for this bug. The > key is that the hadoop code should handle offsets the same way libzstd does, > ie uncompressedDirectBufLen should be counted from the start of > uncompressedDirectBuf, not from uncompressedDirectBufOff. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org