[ 
https://issues.apache.org/jira/browse/HADOOP-17096?focusedWorklogId=509899&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-509899
 ]

ASF GitHub Bot logged work on HADOOP-17096:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Nov/20 19:38
            Start Date: 10/Nov/20 19:38
    Worklog Time Spent: 10m 
      Work Description: jojochuang merged pull request #2104:
URL: https://github.com/apache/hadoop/pull/2104


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 509899)
    Remaining Estimate: 0h
            Time Spent: 10m

> ZStandardCompressor throws java.lang.InternalError: Error (generic)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-17096
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17096
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 3.2.1
>         Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 
> linking to libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end 
> environment (eg running an actual hadoop job with zstd compression) because 
> it's very sensitive to the exact input and output characteristics. I 
> reproduced the bug by turning one of the existing unit tests into a crude 
> fuzzer, but I'm not sure upstream will accept that patch, so I've attached it 
> separately on this ticket.
> Note that the existing unit test for testCompressingWithOneByteOutputBuffer 
> fails to reproduce this bug. This is because it's using the license file as 
> input, and this file is too small. libzstd has internal buffering (in our 
> environment it seems to be 128 kilobytes), and the license file is only 10 
> kilobytes. Thus libzstd is able to consume all the input and compress it in a 
> single call, then return pieces of its internal buffer one byte at a time. 
> Since all the input is consumed in a single call, uncompressedDirectBufOff 
> and uncompressedDirectBufLen are both set to zero and thus the bug does not 
> reproduce.
>            Reporter: Stephen Jung (Stripe)
>            Assignee: Stephen Jung (Stripe)
>            Priority: Major
>         Attachments: fuzztest.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> A bug in index handling causes ZStandardCompressor.c to pass a malformed 
> ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets 
> thrown. The crux of the issue is two variables, uncompressedDirectBufLen and 
> uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff 
> from the start of uncompressedDirectBuf, then uncompressedDirectBufLen is 
> counted from uncompressedDirectBufOff. However, libzstd considers pos and 
> size to both be counted from the start of the buffer. As a result, this line 
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228
>  causes a malformed buffer to be passed to libzstd, where pos>size. Here's a 
> longer description of the bug in case this abstract explanation is unclear:
> ----
> Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with 
> five bytes of input. This results in uncompressedDirectBufOff=0 and 
> uncompressedDirectBufLen=5 
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).
> Then we call compress(), which initializes a ZSTD_inBuffer 
> (https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196).
>  The definition of those libzstd structs is here 
> https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that 
> we set size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The 
> ZSTD_inBuffer gets passed to libzstd, compression happens, etc. When libzstd 
> returns from the compression function, it updates the ZSTD_inBuffer struct to 
> indicate how many bytes were consumed 
> (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920).
>  Note that pos is advanced, but size is unchanged.
> Now, libzstd does not guarantee that the entire input will be compressed in a 
> single call of the compression function. (Some of the compression libraries 
> used by hadoop, such as snappy, _do_ provide this guarantee, but libzstd is 
> not one of them.) So the hadoop native code updates uncompressedDirectBufOff 
> and uncompressedDirectBufLen using the updated ZSTD_inBuffer: 
> https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228
> Now, returning to our example, we started with 5 bytes of uncompressed input. 
> Suppose libzstd compressed 4 of those bytes, leaving one unread. This would 
> result in a ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four 
> bytes were consumed). The hadoop native code would then set 
> uncompressedDirectBufOff=4, but it would also set uncompressedDirectBufLen=1 
> (five minus four equals one).
> Since some of the input was not consumed, we will eventually call compress() 
> again. Then we instantiate another ZSTD_inBuffer struct, this time with 
> size=1 and pos=4. This is a bug - libzstd expects size and pos to both be 
> counted from the start of the buffer, therefore pos>size is unsound. So it 
> returns an error 
> https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932
>  which gets escalated as a java.lang.InternalError.
> I will be submitting a pull request on github with a fix for this bug. The 
> key is that the hadoop code should handle offsets the same way libzstd does, 
> ie uncompressedDirectBufLen should be counted from the start of 
> uncompressedDirectBuf, not from uncompressedDirectBufOff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to