Hi, I wanted to call out one observation we have seen when performing some benchmarks on ORC. I remember there was a time when the default stripe size was 256MB now we have the default at 64MB.
We see big penalty of staying with the default stripe size of 64MB especially when you compare with the default rowgroup size of 128MB from Parquet. Given that the manner in which limit is enforced is also different between ORC (memory size) and Parquet (serialized size except for the current active pages which is memory size) it gives us a much larger magnification. For the same data we ran ORC and Parquet with default configurations we see that ORC generates 28K stripes to Parquet 3.5K row groups, this is also reflected in the performance difference, Parquet operation (filtered read of data) took approximately half the cpu seconds as compared to ORC. Once we adjust the stripe size in ORC to 512MB that matches(roughly equivalent serialized size of the atomic unit in both) with 128MB row group size in parquet we see the following: * the number of stripes and number of row groups are roughly equivalent * the cpu seconds used for the operation are roughly the same With that background. I wanted to check on what the rationale is in having the default stripe size at 64MB. I see from the commit history this is inherited from Hive. It will be great to get some context around the value of 64MB as the default and would we be with a higher default for the same. Please share your thoughts. Thanks, Pavan