Hi John,

Here is my experience on the stripe size. For a given table, when the
stripe size is increased, the size of a column in a stripe increases, which
means the ORC reader can read a column from disks in a more efficient way
because the reader can sequentially read more data (assuming the reader and
the HDFS block are co-located). But, a larger stripe size may decrease the
number of concurrent Map tasks reading an ORC file because a Map task needs
to process at least one stripe (seems a stripe is not splitable right now).
If you can get enough degree of parallelism, I think increasing the stripe
size generally gives you better data reading efficiency in one task.
However, on HDDs, the benefit from increasing the stripe size on data
reading efficiency in a Map task is getting smaller with the increase of
the stripe size. So, for a table with only a few columns (assuming a single
ORC file is used), using a smaller stripe size may not significantly affect
data reading efficiency in a task, and you can potentially have more
concurrent tasks to read this ORC file. So, I think you need to tradeoff
the data reading efficiency in a single task (larger stripe size -> better
data reading efficiency in a task) and the degree of parallelism (smaller
stripe size -> more concurrent tasks to read an ORC file) when determining
the right stripe size.

btw, I have a paper studying file formats and it has some related contents.
Here is the link:
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-13-5.pdf.

Thanks,

Yin


On Tue, Nov 12, 2013 at 8:51 PM, Lefty Leverenz <leftylever...@gmail.com>wrote:

> If you get some useful advice, let's improve the doc.
>
> -- Lefty
>
>
> On Tue, Nov 12, 2013 at 6:15 PM, John Omernik <j...@omernik.com> wrote:
>
>> I am looking for guidance (read examples) on tuning ORC settings for my
>> data.  I see the documentation that shows the defaults, as well as a brief
>> description of what it is.  What I am looking for is some examples of
>> things to try.  *Note: I understand that nobody wants to make sweeping
>> declaring of set this setting without knowing the data*  That said, I would
>> love to see some examples, specifically around:
>>
>> orc.row.index.stride
>>
>> orc.compress.size
>>
>> orc.stripe.size
>>
>>
>> For example, I'd love to see some statements like:
>>
>>
>> If your data has lots of columns of small data, and you'd like better x,
>> try changing y setting because this allows hive to do z when querying.
>>
>>
>> If your data has few columns of large data, try changing y and this
>> allows hive to do z while querying.
>>
>>
>> It would be really neat to see some examples so we can get in and tune
>> our data. Right now, everything is a crapshoot for me, and I don't know if
>> there are detrimental affects that may make themselves known later.
>>
>>
>> Any input would be welcome.
>>
>
>

Reply via email to