[ 
https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panagiotis Garefalakis updated ORC-1078:
----------------------------------------
    Fix Version/s: 1.8.0
                   1.7.3
                   1.6.13

> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
>                 Key: ORC-1078
>                 URL: https://issues.apache.org/jira/browse/ORC-1078
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.6.12, 1.7.2
>            Reporter: Yu-Wen Lai
>            Assignee: Yu-Wen Lai
>            Priority: Major
>             Fix For: 1.8.0, 1.7.3, 1.6.13
>
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Buffer.java:244)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for 
> column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
> limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
> uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
>       Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
>       Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
> 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
>       Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
>       Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> To understand this issue, we need to understand the meaning of each number 
> for the row group index. For each compressed stream, we need 3 numbers to 
> record positions. The first number is the position of current compressed 
> stream, followed by the number of bytes left in the uncompressed buffer, and 
> finally the number of values left in the RLE writer. Let's take entry 3 for 
> explanation. 109907 is the position of compressed stream after we processed 
> all the values for entry 2, 934 uncompressed bytes for entry 2 still need to 
> be consumed, and 398 values for entry 2 in RLE writer still need to consumed. 
> Here we have 6 numbers because TimeStamp columns use two streams, one for 
> seconds and the other for nanoseconds.
> The issue happened when entry 2 is selected and read due to incorrect end 
> offset for this row group. To be more specific, when compression size is 
> smaller than 4096, there are edge cases we cannot accommodate all the blocks 
> by the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
>     int bufferSize,
>     boolean isLast,
>     long nextGroupOffset,
>     long streamLength) {
>   // figure out the worst case last location
>   // if adjacent groups have the same compressed block offset then stretch 
> the slop
>   // by factor of 2 to safely accommodate the next compression block.
>   // One for the current compression block and another for the next 
> compression block.
>   long slop = isCompressed? 
>     2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
>   return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + 
> slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
> 1027 * 2 = 2054 (current implementation). That causes seeking outside of 
> range. Here we just need 4 bytes for each value, but it can use 8 bytes at 
> worst case.
> In terms of the worst case, we might have uncompressed block in compressed 
> stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 + 
> header bytes) / C.
> C = 1024 -> factor should be 5
> C = 512 -> factor should be 9 ... and so forth.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to