Hi,

I wanted to call out one observation we have seen when performing some 
benchmarks on ORC.
I remember there was a time when the default stripe size was 256MB now we have 
the default at 64MB.

We see big penalty of staying with the default stripe size of 64MB especially 
when you compare with the default rowgroup size of 128MB from Parquet. Given 
that the manner in which limit is enforced is also different between ORC 
(memory size) and Parquet (serialized size except for the current active pages 
which is memory size) it gives us a much larger magnification.

For the same data we ran ORC and Parquet with default configurations we see 
that ORC generates 28K stripes to Parquet 3.5K row groups, this is also 
reflected in the performance difference, Parquet operation (filtered read of 
data) took approximately half the cpu seconds as compared to ORC.

Once we adjust the stripe size in ORC to 512MB that matches(roughly equivalent 
serialized size of the atomic unit in both) with 128MB row group size in 
parquet we see the following:
* the number of stripes and number of row groups are roughly equivalent
* the cpu seconds used for the operation are roughly the same

With that background. I wanted to check on what the rationale is in having the 
default stripe size at 64MB. I see from the commit history this is inherited 
from Hive.
It will be great to get some context around the value of 64MB as the default 
and would we be with a higher default for the same.

Please share your thoughts.

Thanks,
Pavan

Reply via email to