[
https://issues.apache.org/jira/browse/ORC-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated ORC-614:
------------------------------
Affects Version/s: 1.7.0
> Implement efficient seek() in decompression streams
> ---------------------------------------------------
>
> Key: ORC-614
> URL: https://issues.apache.org/jira/browse/ORC-614
> Project: ORC
> Issue Type: Improvement
> Components: C++
> Affects Versions: 1.7.0
> Reporter: Csaba Ringhofer
> Assignee: Gang Wu
> Priority: Major
> Fix For: 1.7.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The current implementation of
> ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of
> the decompressor and the underlying file reader and throws away their
> buffers. The buffers can still have usable data in the following cases;
> 1. If the new row group's start position is in the same compressed chunk we
> were reading, then we just jumped to another position within the same
> uncompressed buffer, so both the original compressed buffer and the
> decompressed buffer can be reused. This is a very common scenario with the
> default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk
> can contain 3 full row groups of 8 byte int without any encoding.
> 2. If the new row group's start position is in another compressed chunk, but
> it starts in the current compressed buffer (as we have read ahead during
> file reading), then the compressed buffer can be kept and only the
> uncompressed buffer needs to be dropped. This is the usual case in Apache
> Impala, as 8 MB block size is used which leads to reading the whole stream to
> the buffer for typical columns.
> The lack of these optimizations lead to regression during the testing of
> https://github.com/apache/orc/pull/476, which uses seek() when a row group is
> skipped due to predicate push down, as all seeks caused the whole stream to
> be read again.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)