Re: ORC double encoding optimization proposal

Dain Sundstrom Mon, 26 Mar 2018 22:39:54 -0700

On Mar 26, 2018, at 8:19 PM, Xiening Dai <xndai....@live.com> wrote:
> 
> But that approach still doesn’t help when one column has multiple large 
> streams. Let’s say we have two streams and each one is 50M in size. With 
> current reader implementation, we read 4M chunk every time from each stream, 
> and requires a seek since the chunks are 50M apart. Alternatively we can read 
> both streams with sequential IO, but we would end up holding the 100M 
> compressed data in memory, which is not an effective use of reader memory. 
> Note that this problem exists even without predicate pushdown.


I recently tuned the IO strategy in our implementations, and when you work out 
the math the performance advantage of large IOs falls off very quickly once you 
get to a couple of megabytes.  This is because the transfer time starts to 
dominate over the seeks, so we also put a max size on read sizes to keep buffer 
memory lower.  

For us two sequential large streams take twice the buffer memory, but the IO 
cost is effectively the same.  Where we would run into problems in small 
streams/columns between large columns, since there is no potential for shared 
reads.

-dain

Re: ORC double encoding optimization proposal

Reply via email to