Yu-Wen Lai created ORC-1078:
-------------------------------
Summary: Row group end offset doesn't accommodate all the blocks
Key: ORC-1078
URL: https://issues.apache.org/jira/browse/ORC-1078
Project: ORC
Issue Type: Bug
Reporter: Yu-Wen Lai
The error message in current master:
{code:java}
java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at
org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
at
org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
at
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
at
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
at
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for
column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674
limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623
uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>
Row group indices:
Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max:
2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
The issue happened when entry 2 is selected and read due to incorrect end
offset for this row group. To be more specific, when compression size is
smaller than 2048, there is edge case we cannot accommodate all the blocks by
the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
int bufferSize,
boolean isLast,
long nextGroupOffset,
long streamLength) {
// figure out the worst case last location
// if adjacent groups have the same compressed block offset then stretch the
slop
// by factor of 2 to safely accommodate the next compression block.
// One for the current compression block and another for the next compression
block.
long slop = isCompressed?
2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop =
1027 * 2 = 2054. That causes seeking outside of range.
In terms of the worst case, we might have uncompressed block in compressed
stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 4 +
header bytes) / C.
C = 1024 -> factor should be 3
C = 512 -> factor should be 4 ... and so forth.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)