Quanlong Huang created ORC-1087:
-----------------------------------
Summary: Seek overflow in an uncompressed chunk
Key: ORC-1087
URL: https://issues.apache.org/jira/browse/ORC-1087
Project: ORC
Issue Type: Bug
Components: C++
Affects Versions: 1.7.2, 1.7.1, 1.7.0
Reporter: Quanlong Huang
Assignee: Quanlong Huang
Attachments: scan_with_sarg.cc, seek-issue-snappy-500k.orc
Reading the attached ORC file with SearchArgument "{{{}sr_return_amt >
10000{}}}" using the C++ reader will fail with
{code:java}
Corrupt PATCHED_BASE encoded data (pl==0)!{code}
It's ok to read it without the SearchArgument. The java reader is able to read
it with the same SearchArgument.
Attached the source codes (scan_with_sarg.cc) for reproducing the issue. Build
the ORC lib and compile it by
{code:bash}
g++ scan_with_sarg.cc -o scan_with_sarg -I../c++/include -Ic++/include
-Lc++/src/ -Lsnappy_ep-prefix/src/snappy_ep-build/
-Llz4_ep-prefix/src/lz4_ep-build/ -Lzlib_ep-prefix/src/zlib_ep-build/
-Lzstd_ep-prefix/src/zstd_ep-build/lib/
-Lprotobuf_ep-prefix/src/protobuf_ep-build/ -lorc -lz -lsnappy -llz4 -lzstd
-lprotobuf
{code}
Run it as
{code:bash}
$ LD_LIBRARY_PATH="$LD_LIBRARY_PATH:zstd_ep-prefix/src/zstd_ep-build/lib/"
./scan_with_sarg
leaf-0 = (column(id=17) <= 10000), expr = (not leaf-0)
terminate called after throwing an instance of 'orc::ParseError'
what(): Corrupt PATCHED_BASE encoded data (pl==0)!
Aborted (core dumped)
{code}
*RCA*
The sarg introduces a seek to RowGroup 42. The following codes in
{{DecompressionStream::seek}} didn't handle the case when
uncompressedBufferLength < posInChunk. Then seeks to an illegal position and
the length overflow.
{code:cpp}
if (headerPosition == seekedPosition
&& inputBufferStartPosition <= headerPosition + 3 && inputBufferStart) {
position.next(); // Skip the input level position.
size_t posInChunk = position.next(); // Chunk level position.
// Overflow here! uncompressedBufferLength=30950, posInChunk=39498
outputBufferLength = uncompressedBufferLength - posInChunk;
outputBuffer = outputBufferStart + posInChunk;
return;
}{code}
That chunk is an uncompressed chunk, and the whole chunk is read in pieces. The
position (posInChunk) hasn't been read out yet. We need to handle this case.
I think this only happens on uncompressed chunks. For compressed chunks, they
are decompressed as a whole. So posInChunk will always be valid in the output
buffer.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)