[ 
https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331425#comment-15331425
 ] 

Prasanth Jayachandran commented on HIVE-13985:
----------------------------------------------

Here is a summary of changes
1) Added a new proto object FileTail (This is from c++ version 
https://github.com/apache/orc/blob/master/proto/orc_proto.proto#L227)
2) OrcTail is the class that wraps FileTail and serialized footer (ByteBuffer). 
    - FileTail is used in OrcSplit to serialize minimal footer in OrcSplit 
(strips off file level column statistics which is not required in task side)
    - OrcTail is cached by LocalCache (reconstructs every other objects from 
this)
3) Encodes file length in the OrcSplit which avoids 1 file system call to get 
file status on task side
4) Bunch of file system counters based unit tests to make sure we are not 
making excessive file system calls

[~sershe] could you please review the changes?

> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
>                 Key: HIVE-13985
>                 URL: https://issues.apache.org/jira/browse/HIVE-13985
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 2.2.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-2.1.patch, 
> HIVE-13985.1.patch, HIVE-13985.2.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during 
> split generation. Similarly, this jira will fix issues with additional file 
> system invocations on the task side. To avoid reading footers on the task 
> side, users can set hive.orc.splits.include.file.footer to true which will 
> serialize the orc footers on the splits. But this has issues with serializing 
> unwanted information like column statistics and other metadata which are not 
> really required for reading orc split on the task side. We can reduce the 
> payload on the orc splits by serializing only the minimum required 
> information (stripe information, types, compression details). This will 
> decrease the payload on the orc splits and can potentially avoid OOMs in 
> application master (AM) during split generation. This jira also address other 
> issues concerning the AM cache. The local cache used by AM is soft reference 
> cache. This can introduce unpredictability across multiple runs of the same 
> query. We can cache the serialized footer in the local cache and also use 
> strong reference cache which should avoid memory pressure and will have 
> better predictability.
> One other improvement that we can do is when 
> hive.orc.splits.include.file.footer is set to false, on the task side we make 
> one additional file system call to know the size of the file. If we can 
> serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to