[ https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331425#comment-15331425 ]
Prasanth Jayachandran commented on HIVE-13985: ---------------------------------------------- Here is a summary of changes 1) Added a new proto object FileTail (This is from c++ version https://github.com/apache/orc/blob/master/proto/orc_proto.proto#L227) 2) OrcTail is the class that wraps FileTail and serialized footer (ByteBuffer). - FileTail is used in OrcSplit to serialize minimal footer in OrcSplit (strips off file level column statistics which is not required in task side) - OrcTail is cached by LocalCache (reconstructs every other objects from this) 3) Encodes file length in the OrcSplit which avoids 1 file system call to get file status on task side 4) Bunch of file system counters based unit tests to make sure we are not making excessive file system calls [~sershe] could you please review the changes? > ORC improvements for reducing the file system calls in task side > ---------------------------------------------------------------- > > Key: HIVE-13985 > URL: https://issues.apache.org/jira/browse/HIVE-13985 > Project: Hive > Issue Type: Bug > Components: ORC > Affects Versions: 2.2.0 > Reporter: Prasanth Jayachandran > Assignee: Prasanth Jayachandran > Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-2.1.patch, > HIVE-13985.1.patch, HIVE-13985.2.patch > > > HIVE-13840 fixed some issues with addition file system invocations during > split generation. Similarly, this jira will fix issues with additional file > system invocations on the task side. To avoid reading footers on the task > side, users can set hive.orc.splits.include.file.footer to true which will > serialize the orc footers on the splits. But this has issues with serializing > unwanted information like column statistics and other metadata which are not > really required for reading orc split on the task side. We can reduce the > payload on the orc splits by serializing only the minimum required > information (stripe information, types, compression details). This will > decrease the payload on the orc splits and can potentially avoid OOMs in > application master (AM) during split generation. This jira also address other > issues concerning the AM cache. The local cache used by AM is soft reference > cache. This can introduce unpredictability across multiple runs of the same > query. We can cache the serialized footer in the local cache and also use > strong reference cache which should avoid memory pressure and will have > better predictability. > One other improvement that we can do is when > hive.orc.splits.include.file.footer is set to false, on the task side we make > one additional file system call to know the size of the file. If we can > serialize the file length in the orc split this can be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)