Gopal V created HIVE-7428:
-----------------------------

             Summary: OrcSplit fails to account for columnar projections in its 
size estimates
                 Key: HIVE-7428
                 URL: https://issues.apache.org/jira/browse/HIVE-7428
             Project: Hive
          Issue Type: Bug
            Reporter: Gopal V


Currently, ORC generates splits based on stripe offset + stripe length.

This means that the splits for all columnar projections are exactly the same 
size, despite reading the footer which gives the estimated sizes for each 
column.

This is a hold-out from FileSplit which uses getLen() as the I/O cost of 
reading a file in a map-task.

RCFile didn't have a footer with column statistics information, but for ORC 
this would be extremely useful to reduce task overheads when processing 
extremely wide tables with highly selective column projections.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to