[
https://issues.apache.org/jira/browse/HIVE-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048531#comment-14048531
]
Gopal V commented on HIVE-7231:
-------------------------------
Tests on 1Tb proving that this does cut down on padding, but it progressively
writes smaller and smaller stripes within a block.
I saw 12MB, 8Mb stripes being written before the 3.2Mb stripe size trigger sets
in and triggers a pad event.
{code}
Resetting stripe size via (1.0 - 0.000000) * (0.663954 * 66945840) = 44448964
Resetting stripe size via (1.0 - 0.000000) * (0.495154 * 44448964) = 22009074
Resetting stripe size via (1.0 - 0.000000) * (0.358696 * 22009074) = 7894571
Resetting stripe size via (1.0 - 0.000000) * (0.263782 * 7894571) = 2082443
Resetting stripe size via (1.0 - 0.000000) * (0.581675 * 2082443) = 1211304
Resetting stripe size via (1.0 - 0.000000) * (0.814780 * 1211304) = 986946
Resetting stripe size via (1.0 - 0.000000) * (0.772579 * 986946) = 762494
{code}
I think I might undo the "as a fraction of stripe size" bit and make sure that
the padding amount is a fraction of the HDFS block size for consistent stripe
sizes as much as possible.
> Improve ORC padding
> -------------------
>
> Key: HIVE-7231
> URL: https://issues.apache.org/jira/browse/HIVE-7231
> Project: Hive
> Issue Type: Improvement
> Components: File Formats
> Affects Versions: 0.14.0
> Reporter: Prasanth J
> Assignee: Prasanth J
> Labels: orcfile
> Attachments: HIVE-7231.1.patch, HIVE-7231.2.patch, HIVE-7231.3.patch,
> HIVE-7231.4.patch, HIVE-7231.5.patch, HIVE-7231.6.patch
>
>
> Current ORC padding is not optimal because of fixed stripe sizes within
> block. The padding overhead will be significant in some cases. Also padding
> percentage relative to stripe size is not configurable.
--
This message was sent by Atlassian JIRA
(v6.2#6252)