Runs of 1M is not common case. I am not sure how vertica stores the run 
lengths. It seems like variable length integers are used. 
ORC does not use variable length integers for storing run length. Using 
variable length integer has advantage of storing much longer runs but for 
repeating shorter runs, it wastes lots of bytes. ORC uses fixed lengths to 
store run length (7 bits in older version and 9 bits in newer version) and so 
it is good for shorter runs.

There are two versions for RLE in ORC. Old version 0.11 uses 127 as max run 
length so that it can be packed in lower 7 bits of a byte. In the new version 
0.12 ORC uses 511 as max run length as it uses 9 bits to store run length. The 
new version of ORC uses a different encoding if the runs are smaller (<10) 
which saves a byte. 

Thanks
Prasanth Jayachandran

On Nov 11, 2013, at 6:22 AM, qihua wu <wuqihu...@gmail.com> wrote:

> In vertica, if I have a column sorted, and the same value repeat 1M times, it 
> only used very small storage as it only stores (value, 1M). But in ORC, looks 
> like the max length is less than 200 ( not very sure, but at about the same 
> level of hundreds), why restrict the max run length? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to