[
https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Panagiotis Garefalakis updated ORC-1078:
----------------------------------------
Fix Version/s: 1.8.0
1.7.3
1.6.13
> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
> Key: ORC-1078
> URL: https://issues.apache.org/jira/browse/ORC-1078
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.6.12, 1.7.2
> Reporter: Yu-Wen Lai
> Assignee: Yu-Wen Lai
> Priority: Major
> Fix For: 1.8.0, 1.7.3, 1.6.13
>
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
> at java.nio.Buffer.position(Buffer.java:244)
> at
> org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
> at
> org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
> at
> org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
> at
> org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
> at
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for
> column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674
> limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623
> uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
> Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
> Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max:
> 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
> Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
> Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max:
> 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> To understand this issue, we need to understand the meaning of each number
> for the row group index. For each compressed stream, we need 3 numbers to
> record positions. The first number is the position of current compressed
> stream, followed by the number of bytes left in the uncompressed buffer, and
> finally the number of values left in the RLE writer. Let's take entry 3 for
> explanation. 109907 is the position of compressed stream after we processed
> all the values for entry 2, 934 uncompressed bytes for entry 2 still need to
> be consumed, and 398 values for entry 2 in RLE writer still need to consumed.
> Here we have 6 numbers because TimeStamp columns use two streams, one for
> seconds and the other for nanoseconds.
> The issue happened when entry 2 is selected and read due to incorrect end
> offset for this row group. To be more specific, when compression size is
> smaller than 4096, there are edge cases we cannot accommodate all the blocks
> by the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
> int bufferSize,
> boolean isLast,
> long nextGroupOffset,
> long streamLength) {
> // figure out the worst case last location
> // if adjacent groups have the same compressed block offset then stretch
> the slop
> // by factor of 2 to safely accommodate the next compression block.
> // One for the current compression block and another for the next
> compression block.
> long slop = isCompressed?
> 2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
> return isLast ? streamLength : Math.min(streamLength, nextGroupOffset +
> slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop =
> 1027 * 2 = 2054 (current implementation). That causes seeking outside of
> range. Here we just need 4 bytes for each value, but it can use 8 bytes at
> worst case.
> In terms of the worst case, we might have uncompressed block in compressed
> stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 +
> header bytes) / C.
> C = 1024 -> factor should be 5
> C = 512 -> factor should be 9 ... and so forth.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)