Yu-Wen Lai created ORC-1078:
-------------------------------

             Summary: Row group end offset doesn't accommodate all the blocks
                 Key: ORC-1078
                 URL: https://issues.apache.org/jira/browse/ORC-1078
             Project: ORC
          Issue Type: Bug
            Reporter: Yu-Wen Lai


The error message in current master:
{code:java}
java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:244)
    at 
org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
    at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for 
column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>

Row group indices:
      Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
      Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
      Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
      Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
The issue happened when entry 2 is selected and read due to incorrect end 
offset for this row group. To be more specific, when compression size is 
smaller than 2048, there is edge case we cannot accommodate all the blocks by 
the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
    int bufferSize,
    boolean isLast,
    long nextGroupOffset,
    long streamLength) {
  // figure out the worst case last location
  // if adjacent groups have the same compressed block offset then stretch the 
slop
  // by factor of 2 to safely accommodate the next compression block.
  // One for the current compression block and another for the next compression 
block.
  long slop = isCompressed? 
    2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
  return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
1027 * 2 = 2054. That causes seeking outside of range.

In terms of the worst case, we might have uncompressed block in compressed 
stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 + 
header bytes) / C.
C = 1024 -> factor should be 3
C = 512 -> factor should be 4 ... and so forth.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to