[ https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prasanth Jayachandran updated HIVE-13985: ----------------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 2.1.0 1.3.0 Status: Resolved (was: Patch Available) Thanks [~sershe] for the reviews! Committed to branch-2.1 and master as well. > ORC improvements for reducing the file system calls in task side > ---------------------------------------------------------------- > > Key: HIVE-13985 > URL: https://issues.apache.org/jira/browse/HIVE-13985 > Project: Hive > Issue Type: Bug > Components: ORC > Affects Versions: 1.3.0, 2.2.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Fix For: 1.3.0, 2.1.0, 2.2.0 > > Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, > HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, > HIVE-13985-branch-2.1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch, > HIVE-13985.3.patch, HIVE-13985.4.patch, HIVE-13985.5.patch, HIVE-13985.6.patch > > > HIVE-13840 fixed some issues with addition file system invocations during > split generation. Similarly, this jira will fix issues with additional file > system invocations on the task side. To avoid reading footers on the task > side, users can set hive.orc.splits.include.file.footer to true which will > serialize the orc footers on the splits. But this has issues with serializing > unwanted information like column statistics and other metadata which are not > really required for reading orc split on the task side. We can reduce the > payload on the orc splits by serializing only the minimum required > information (stripe information, types, compression details). This will > decrease the payload on the orc splits and can potentially avoid OOMs in > application master (AM) during split generation. This jira also address other > issues concerning the AM cache. The local cache used by AM is soft reference > cache. This can introduce unpredictability across multiple runs of the same > query. We can cache the serialized footer in the local cache and also use > strong reference cache which should avoid memory pressure and will have > better predictability. > One other improvement that we can do is when > hive.orc.splits.include.file.footer is set to false, on the task side we make > one additional file system call to know the size of the file. If we can > serialize the file length in the orc split this can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)