Gopal V created HIVE-7428:
-----------------------------
Summary: OrcSplit fails to account for columnar projections in its
size estimates
Key: HIVE-7428
URL: https://issues.apache.org/jira/browse/HIVE-7428
Project: Hive
Issue Type: Bug
Reporter: Gopal V
Currently, ORC generates splits based on stripe offset + stripe length.
This means that the splits for all columnar projections are exactly the same
size, despite reading the footer which gives the estimated sizes for each
column.
This is a hold-out from FileSplit which uses getLen() as the I/O cost of
reading a file in a map-task.
RCFile didn't have a footer with column statistics information, but for ORC
this would be extremely useful to reduce task overheads when processing
extremely wide tables with highly selective column projections.
--
This message was sent by Atlassian JIRA
(v6.2#6252)