Pankit Thapar created HIVE-8038:
-----------------------------------
Summary: Decouple ORC files split calculation logic from
Filesystem's get file location implementation
Key: HIVE-8038
URL: https://issues.apache.org/jira/browse/HIVE-8038
Project: Hive
Issue Type: Improvement
Components: File Formats
Affects Versions: 0.13.1
Reporter: Pankit Thapar
Fix For: 0.14.0
What is the Current Logic
======================
1.get the file blocks from FileSystem.getFileBlockLocations() which returns an
array of BlockLocation
2.In SplitGenerator.createSplit(), check if split only spans one block or
multiple blocks.
3.If split spans just one block, then using the array index (index =
offset/blockSize), get the corresponding host having the blockLocation
4.If the split spans multiple blocks, then get all hosts that have at least 80%
of the max of total data in split hosted by any host.
5.add the split to a list of splits
Issue with Current Logic
=====================
Dependency on FileSystem API’s logic for block location calculations. It
returns an array and we need to rely on FileSystem to
make all blocks of same size if we want to directly access a block from the
array.
What is the Fix
=============
1a.get the file blocks from FileSystem.getFileBlockLocations() which returns an
array of BlockLocation
1b.convert the array into a tree map <offset, BlockLocation> and return it
through getLocationsWithOffSet()
2.In SplitGenerator.createSplit(), check if split only spans one block or
multiple blocks.
3.If split spans just one block, then using Tree.floorEntry(key), get the
highest entry smaller than offset for the split and get the corresponding host.
4a.If the split spans multiple blocks, get a submap, which contains all entries
containing blockLocations from the offset to offset + length
4b.get all hosts that have at least 80% of the max of total data in split
hosted by any host.
5.add the split to a list of splits
What are the major changes in logic
==============================
1. store BlockLocations in a Map instead of an array
2. Call SHIMS.getLocationsWithOffSet() instead of getLocations()
3. one block case is checked by "if(offset + length <= start.getOffset() +
start.getLength())" instead of "if((offset % blockSize) + length <= blockSize)"
What is the affect on Complexity (Big O)
=================================
1. We add a O(n) loop to build a TreeMap from an array but its a one time cost
and would not be called for each split
2. In case of one block case, we can get the block in O(logn) worst case which
was O(1) before
3. Getting the submap is O(logn)
4. In case of multiple block case, building the list of hosts is O(m) which was
O(n) & m < n as previously we were iterating
over all the block locations but now we are only iterating only blocks that
belong to that range go offsets that we need.
What are the benefits of the change
==============================
1. With this fix, we do not depend on the blockLocations returned by FileSystem
to figure out the block corresponding to the offset and blockSize
2. Also, it is not necessary that block lengths is same for all blocks for all
FileSystems
3. Previously we were using blockSize for one block case and block.length for
multiple block case, which is not the case now. We figure out the block
depending upon the actual length and offset of the block
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)