Quanlong Huang created ORC-1087:
-----------------------------------

             Summary: Seek overflow in an uncompressed chunk
                 Key: ORC-1087
                 URL: https://issues.apache.org/jira/browse/ORC-1087
             Project: ORC
          Issue Type: Bug
          Components: C++
    Affects Versions: 1.7.2, 1.7.1, 1.7.0
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang
         Attachments: scan_with_sarg.cc, seek-issue-snappy-500k.orc

Reading the attached ORC file with SearchArgument "{{{}sr_return_amt > 
10000{}}}" using the C++ reader will fail with
{code:java}
Corrupt PATCHED_BASE encoded data (pl==0)!{code}
It's ok to read it without the SearchArgument. The java reader is able to read 
it with the same SearchArgument.

Attached the source codes (scan_with_sarg.cc) for reproducing the issue. Build 
the ORC lib and compile it by
{code:bash}
g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include 
-Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/ 
-Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/ 
-Lzstd_ep-prefix/src/zstd_ep-build/lib/ 
-Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd 
-lprotobuf
{code}
Run it as
{code:bash}
$ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/" 
./scan_with_sarg 
leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0)
terminate called after throwing an instance of 'orc::ParseError'
  what():  Corrupt PATCHED_BASE encoded data (pl==0)!
Aborted (core dumped)
{code}
*RCA*

The sarg introduces a seek to RowGroup 42. The following codes in 
{{DecompressionStream::seek}} didn't handle the case when 
uncompressedBufferLength < posInChunk. Then seeks to an illegal position and 
the length overflow.
{code:cpp}
if (headerPosition == seekedPosition
    && inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) {
  position.next(); // Skip the input level position.
  size_t posInChunk = position.next(); // Chunk level position.
  // Overflow here! uncompressedBufferLength=30950, posInChunk=39498
  outputBufferLength = uncompressedBufferLength - posInChunk;
  outputBuffer = outputBufferStart + posInChunk;
  return;
}{code}
That chunk is an uncompressed chunk, and the whole chunk is read in pieces. The 
position (posInChunk) hasn't been read out yet. We need to handle this case.

I think this only happens on uncompressed chunks. For compressed chunks, they 
are decompressed as a whole. So posInChunk will always be valid in the output 
buffer.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to