[jira] [Updated] (ORC-1078) Row group end offset doesn't accommodate all the blocks

Yu-Wen Lai (Jira) Mon, 10 Jan 2022 17:11:07 -0800


     [ 
https://issues.apache.org/jira/browse/ORC-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yu-Wen Lai updated ORC-1078:
----------------------------
    Description: 
The error message in current master:
{code:java}
java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:244)
    at 
org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
    at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for 
column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>

Row group indices:
      Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
      Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
      Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
      Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
To understand this issue, we need to understand the meaning of each number for 
the row group index. For each compressed stream, we need 3 numbers to record 
positions. The first number is the position of current compressed stream, 
followed by the number of bytes left in the uncompressed buffer, and finally 
the number of values left in the RLE writer. Let's take entry 3 for 
explanation. 109907 is the position of compressed stream after we processed all 
the values for entry 2, 934 uncompressed bytes for entry 2 still need to be 
consumed, and 398 values for entry 2 in RLE writer still need to consumed. Here 
we have 6 numbers because TimeStamp columns use two streams, one for seconds 
and the other for nanoseconds.

The issue happened when entry 2 is selected and read due to incorrect end 
offset for this row group. To be more specific, when compression size is 
smaller than 4096, there are edge cases we cannot accommodate all the blocks by 
the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
    int bufferSize,
    boolean isLast,
    long nextGroupOffset,
    long streamLength) {
  // figure out the worst case last location
  // if adjacent groups have the same compressed block offset then stretch the 
slop
  // by factor of 2 to safely accommodate the next compression block.
  // One for the current compression block and another for the next compression 
block.
  long slop = isCompressed? 
    2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
  return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
1027 * 2 = 2054 (current implementation). That causes seeking outside of range. 
Here we just need 4 bytes for each value, but it can use 8 bytes at worst case.

In terms of the worst case, we might have uncompressed block in compressed 
stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 + 
header bytes) / C.
C = 1024 -> factor should be 5
C = 512 -> factor should be 9 ... and so forth.

  was:
The error message in current master:
{code:java}
java.lang.IllegalArgumentException
    at java.nio.Buffer.position(Buffer.java:244)
    at 
org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
    at 
org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
    at 
org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
    at 
org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
The same error can appear a little differently in older version:
{code:java}
java.io.IOException: Seek outside of data in compressed stream Stream for 
column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
limit: 36674 range 0 = 75
282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
uncompressed: 1024 to 1024 to 111956{code}
Here is the info extracted from the problematic orc file:
{code:java}
Compression: ZLIB
Compression size: 1024
Calendar: Julian/Gregorian
Type: struct<col:timestamp>

Row group indices:
      Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
      Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
      Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
      Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
The issue happened when entry 2 is selected and read due to incorrect end 
offset for this row group. To be more specific, when compression size is 
smaller than 2048, there is edge case we cannot accommodate all the blocks by 
the factor of 2 (please see the code snippet below).
{code:java}
public static long estimateRgEndOffset(boolean isCompressed,
    int bufferSize,
    boolean isLast,
    long nextGroupOffset,
    long streamLength) {
  // figure out the worst case last location
  // if adjacent groups have the same compressed block offset then stretch the 
slop
  // by factor of 2 to safely accommodate the next compression block.
  // One for the current compression block and another for the next compression 
block.
  long slop = isCompressed? 
    2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
  return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + slop);
}{code}
In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
1027 * 2 = 2054. That causes seeking outside of range. Here we just need 4 
bytes for each value, but it can use 8 bytes at worst case.

In terms of the worst case, we might have uncompressed block in compressed 
stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 + 
header bytes) / C.
C = 1024 -> factor should be 5
C = 512 -> factor should be 9 ... and so forth.


> Row group end offset doesn't accommodate all the blocks
> -------------------------------------------------------
>
>                 Key: ORC-1078
>                 URL: https://issues.apache.org/jira/browse/ORC-1078
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Yu-Wen Lai
>            Assignee: Yu-Wen Lai
>            Priority: Major
>
> The error message in current master:
> {code:java}
> java.lang.IllegalArgumentException
>     at java.nio.Buffer.position(Buffer.java:244)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.setCurrent(InStream.java:453)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeaderByte(InStream.java:462)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:474)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528)
>     at 
> org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:515){code}
> The same error can appear a little differently in older version:
> {code:java}
> java.io.IOException: Seek outside of data in compressed stream Stream for 
> column 15 kind DATA position: 111956 length: 383146 range: 0 offset: 36674 
> limit: 36674 range 0 = 75
> 282 to 36674; range 1 = 151666 to 40267; range 2 = 228805 to 41623 
> uncompressed: 1024 to 1024 to 111956{code}
> Here is the info extracted from the problematic orc file:
> {code:java}
> Compression: ZLIB
> Compression size: 1024
> Calendar: Julian/Gregorian
> Type: struct<col:timestamp>
> Row group indices:
>       Entry 0: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 21:09:56.000999999 positions: 0,0,0,0,0,0
>       Entry 1: count: 10000 hasNull: false min: 2016-03-09 11:36:36.0 max: 
> 2021-12-06 21:53:33.000999999 positions: 37509,683,98,2630,296,3
>       Entry 2: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2026-05-29 00:53:10.000999999 positions: 75265,256,229,5902,370,8
>       Entry 3: count: 10000 hasNull: false min: 2010-01-01 00:00:00.0 max: 
> 2021-12-06 18:14:52.000999999 positions: 109907,934,398,8581,322,18{code}
> To understand this issue, we need to understand the meaning of each number 
> for the row group index. For each compressed stream, we need 3 numbers to 
> record positions. The first number is the position of current compressed 
> stream, followed by the number of bytes left in the uncompressed buffer, and 
> finally the number of values left in the RLE writer. Let's take entry 3 for 
> explanation. 109907 is the position of compressed stream after we processed 
> all the values for entry 2, 934 uncompressed bytes for entry 2 still need to 
> be consumed, and 398 values for entry 2 in RLE writer still need to consumed. 
> Here we have 6 numbers because TimeStamp columns use two streams, one for 
> seconds and the other for nanoseconds.
> The issue happened when entry 2 is selected and read due to incorrect end 
> offset for this row group. To be more specific, when compression size is 
> smaller than 4096, there are edge cases we cannot accommodate all the blocks 
> by the factor of 2 (please see the code snippet below).
> {code:java}
> public static long estimateRgEndOffset(boolean isCompressed,
>     int bufferSize,
>     boolean isLast,
>     long nextGroupOffset,
>     long streamLength) {
>   // figure out the worst case last location
>   // if adjacent groups have the same compressed block offset then stretch 
> the slop
>   // by factor of 2 to safely accommodate the next compression block.
>   // One for the current compression block and another for the next 
> compression block.
>   long slop = isCompressed? 
>     2 * (OutStream.HEADER_SIZE + bufferSize): WORST_UNCOMPRESSED_SLOP;
>   return isLast ? streamLength : Math.min(streamLength, nextGroupOffset + 
> slop);
> }{code}
> In our case, we need slop > 934 (buffer) + 398 * 4 + header bytes, but slop = 
> 1027 * 2 = 2054 (current implementation). That causes seeking outside of 
> range. Here we just need 4 bytes for each value, but it can use 8 bytes at 
> worst case.
> In terms of the worst case, we might have uncompressed block in compressed 
> stream. Suppose compression size = C, the factor = 1 (buffer) + (511 * 8 + 
> header bytes) / C.
> C = 1024 -> factor should be 5
> C = 512 -> factor should be 9 ... and so forth.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ORC-1078) Row group end offset doesn't accommodate all the blocks

Reply via email to