[
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650904#action_12650904
]
Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------
a few comments:
- can u explain whether the blocks are split by racks or nodes? (the data
structure says nodeToBlock etc. - but all the comments refer to 'racks')
if we are combining splits by nodes - then wouldn't it make sense to also sort
the nodes by racks first (and perhaps only then by number of blocks)? (so that
we can combine blocks that cannot be combined within a given node with other
blocks in the same rack?)
- in getMoreSplits() - i didn't understand:
+ if (minSize == 0 || curSplitSize > minSize || !iter.hasNext()) {
it seems iter.hasNext() is being used to try to detect the end of the loop -
but iter,hasNext() can be false even in the middle of the loop - right? (the
way i am reading it - iter is being used to 'seek' to the current rack(/node)
in the nodeToBlocks array based on sorting by number of blocks in rack)
- somewhat confused by how overflow blocks (curSplitSize < minSize) are being
handled. looks like with current scheme - if there is even one overflow block
from current rack - then it will be combined with blocks available from the
next rack. this seems to have some issues:
* the racks list is going to have both the racks - but the blocks are
probably overwhelmingly dominated by the next rack
* the racks list is not cleared after the overflow blocks are dealt with in
the first split created on the next rack. so the next of splits will all have
the previous rack in the racks list unnecessarily (I presume this will lead to
incorrect inference about the locality of splits)
instead - we could have collected all the overflow blocks from each rack and
just combined them separately into splits at the end. would be simpler to
understand/code and not change the overall amount of locality much i imagine
> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
> Key: HADOOP-4565
> URL: https://issues.apache.org/jira/browse/HADOOP-4565
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Attachments: CombineMultiFile.patch, CombineMultiFile2.patch,
> CombineMultiFile3.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on
> file sizes. Each splits contains a few files an each split are roughly equal
> in size. It would be efficient if we can extend this InputFormat to create
> splits such each all the blocks in one split and either node-local or
> rack-local.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.