[ 
https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17470289#comment-17470289
 ] 

Gopal Vijayaraghavan commented on ORC-1078:
-------------------------------------------

>From discussion with [~hsnusonic]

{code}
MAX_SCOPE is fixed as 512 in RunLengthIntegerWriterV2.
When compression size is smaller than 2048, we cannot guarantee 2 blocks are 
enough for all the values. Let's take an exaggerated example, if compression 
size is 4 bytes, it is impossible to encode 511 values in 4 bytes even with 
compression.
{code}

> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
>                 Key: ORC-1078
>                 URL: https://issues.apache.org/jira/browse/ORC-1078
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Yu-Wen Lai
>            Assignee: Yu-Wen Lai
>            Priority: Major
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Buffer.java:244)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for 
> column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
> limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
> uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
>       Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
>       Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
> 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
>       Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
>       Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> The issue happened when entry 2 is selected and read due to incorrect end 
> offset for this row group. To be more specific, when compression size is 
> smaller than 2048, there is edge case we cannot accommodate all the blocks by 
> the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
>     int bufferSize,
>     boolean isLast,
>     long nextGroupOffset,
>     long streamLength) {
>   // figure out the worst case last location
>   // if adjacent groups have the same compressed block offset then stretch 
> the slop
>   // by factor of 2 to safely accommodate the next compression block.
>   // One for the current compression block and another for the next 
> compression block.
>   long slop = isCompressed? 
>     2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
>   return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + 
> slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
> 1027 * 2 = 2054. That causes seeking outside of range.
> In terms of the worst case, we might have uncompressed block in compressed 
> stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 + 
> header bytes) / C.
> C = 1024 -> factor should be 3
> C = 512 -> factor should be 5 ... and so forth.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to