[
https://issues.apache.org/jira/browse/HADOOP-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651097#action_12651097
]
Joydeep Sen Sarma commented on HADOOP-4565:
-------------------------------------------
a few other comments:
- do we think this patch totally supersedes
multifileinputformat/multifilesplit?
- if not - should CombineFileSplit extend MultiFileSplit? (the argument being
that in that case CombineFileRecordReader can work for both MultiFileSplit and
CombineFileSplit).
In general - this is not going to be the last implementation of a
multifilesplit/format - so it would be good to have the surrounding classes
(recordreaders etc.) be built in a way that more implementations of a
multifilesplit can be easily accomodated.
- CombineFileInputFormat does not implement getRecordReader (throws an
exception) - shouldn't it just be an abstract class then?
- one of the bigger problems with MultiFileInputFormat was the lack of concrete
implementations. I think it just makes sense to provide a full implementation
of combinefileinputformat for text files (and perhaps sequencefiles) at least
that can be used without writing code by lay users.
- as an aside - i don't understand now why sorting racks/nodes by number of
blocks matters at all. for each rack/node - one would coalesce blocks into
splits. what overflows goes into micellaneous bucket. this protocol does not
depend on walking through the racks/nodes in a particular order. what seems
more important is that overflow blocks are first combined by rack (but i am
confused about the whole rack vs. node thing)
> MultiFileInputSplit can use data locality information to create splits
> ----------------------------------------------------------------------
>
> Key: HADOOP-4565
> URL: https://issues.apache.org/jira/browse/HADOOP-4565
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
> Attachments: CombineMultiFile.patch, CombineMultiFile2.patch,
> CombineMultiFile3.patch
>
>
> The MultiFileInputFormat takes a set of paths and creates splits based on
> file sizes. Each splits contains a few files an each split are roughly equal
> in size. It would be efficient if we can extend this InputFormat to create
> splits such each all the blocks in one split and either node-local or
> rack-local.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.