[jira] [Updated] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Prasanth Jayachandran (JIRA) Thu, 16 Jun 2016 16:28:57 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prasanth Jayachandran updated HIVE-13985:
-----------------------------------------
    Target Version/s: 1.3.0, 2.2.0  (was: 2.2.0)

> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
>                 Key: HIVE-13985
>                 URL: https://issues.apache.org/jira/browse/HIVE-13985
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 1.3.0, 2.2.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, 
> HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, 
> HIVE-13985-branch-2.1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch, 
> HIVE-13985.3.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during 
> split generation. Similarly, this jira will fix issues with additional file 
> system invocations on the task side. To avoid reading footers on the task 
> side, users can set hive.orc.splits.include.file.footer to true which will 
> serialize the orc footers on the splits. But this has issues with serializing 
> unwanted information like column statistics and other metadata which are not 
> really required for reading orc split on the task side. We can reduce the 
> payload on the orc splits by serializing only the minimum required 
> information (stripe information, types, compression details). This will 
> decrease the payload on the orc splits and can potentially avoid OOMs in 
> application master (AM) during split generation. This jira also address other 
> issues concerning the AM cache. The local cache used by AM is soft reference 
> cache. This can introduce unpredictability across multiple runs of the same 
> query. We can cache the serialized footer in the local cache and also use 
> strong reference cache which should avoid memory pressure and will have 
> better predictability.
> One other improvement that we can do is when 
> hive.orc.splits.include.file.footer is set to false, on the task side we make 
> one additional file system call to know the size of the file. If we can 
> serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Reply via email to