[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13551248#comment-13551248
 ] 

Owen O'Malley commented on HIVE-3874:
-------------------------------------

Doug, of course Trevni could be modified arbitrarily to match the needs of 
Hive. But Hive will benefit more if there is a deep integration between the 
file format and the query engine. Both HBase and Accumulo have file formats 
that were originally based on Hadoop's TFile. But the need for integration with 
the query engine was such that their projects were better served by having the 
file format in their project rather than an upstream project. 

Of course the Avro project is free to copy any of the ORC code into Trevni, but 
Hive has the need to innovate in this area without asking Avro to make changes 
and waiting for them to be released. 
                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: OrcFileIntro.pptx
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to