[jira] [Updated] (ORC-614) Implement efficient seek() in decompression streams

Dongjoon Hyun (Jira) Thu, 03 Jun 2021 21:33:06 -0700


     [ 
https://issues.apache.org/jira/browse/ORC-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated ORC-614:
------------------------------
    Affects Version/s: 1.7.0

> Implement efficient seek() in decompression streams
> ---------------------------------------------------
>
>                 Key: ORC-614
>                 URL: https://issues.apache.org/jira/browse/ORC-614
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 1.7.0
>            Reporter: Csaba Ringhofer
>            Assignee: Gang Wu
>            Priority: Major
>             Fix For: 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current implementation of 
> ZlibDecompressionStream/BlockDecompressionStream::seek resets the state of 
> the decompressor and the underlying file reader and throws away their 
> buffers. The buffers can still have usable data in the following cases;
> 1. If the new row group's start position is in the same compressed chunk we 
> were reading, then we just jumped to another position within the same 
> uncompressed buffer, so both the original compressed buffer and the 
> decompressed  buffer can be reused. This is a very common scenario with the 
> default ORC configs of unaligned 256KB>=chunks and 10K row groups, e.g. chunk 
> can contain 3 full row groups of 8 byte int without any encoding.
> 2.  If the new row group's start position is in another compressed chunk, but 
> it starts in the current compressed  buffer (as we have read ahead during 
> file reading), then the compressed buffer can be kept and only the 
> uncompressed buffer needs to be dropped. This is the usual case in Apache 
> Impala, as 8 MB block size is used which leads to reading the whole stream to 
> the buffer for typical columns.
> The lack of these optimizations lead to regression during the testing of 
> https://github.com/apache/orc/pull/476, which uses seek() when a row group is 
> skipped due to predicate push down, as all seeks caused the whole stream to 
> be read again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ORC-614) Implement efficient seek() in decompression streams

Reply via email to