[
https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610930#comment-13610930
]
Josh Wills commented on CRUNCH-165:
-----------------------------------
Yeah, I read through Pig's implementation of this last night. It's basically a
bin packing problem: we get all of the original input splits from a particular
input format, and we have a target size for each of the splits we want to run.
If one of the original splits is greater than the target size, then we run it
as is. For the splits that are smaller than the target size, we try to package
them together into a single combined split. Pig's impl is sort of complicated;
I think that doing something simple based on the first-fit decreasing heuristic
[1] will do the trick here. I'm going to make this my project for today.
[1] http://en.wikipedia.org/wiki/Bin_packing_problem
> Pipelines should automatically use CombineFileInputFormat where input
> consists of many small files
> --------------------------------------------------------------------------------------------------
>
> Key: CRUNCH-165
> URL: https://issues.apache.org/jira/browse/CRUNCH-165
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.4.0
> Reporter: Dave Beech
> Assignee: Josh Wills
> Attachments: CRUNCH-165.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would
> be used if the input data consisted of many small files, making the resulting
> mapreduce jobs more efficient by giving individual mappers more data to
> process. This would be a nice feature for Crunch to have, too.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira