[
https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831161#comment-13831161
]
Chao Shi commented on CRUNCH-165:
---------------------------------
Hi guys,
I encountered a problem raised by this patch. I'm using HFileInputFormat, which
overrides FileInputFormat's listStatus() to pick some input files at deeper
hierarchy. When I pass several input paths to HFileSource, it is wrapped into a
CrunchCombineFileInputFormat. When CrunchCombineFileInputFormat#getSplits is
called, it does not call the internal HFileInputFormat#listStatus. Instead, it
calls FileInputFormat's. This behavior is implemented in CombineFileInputFormat.
{code}
if (format instanceof FileInputFormat &&
!conf.getBoolean(RuntimeParameters.DISABLE_COMBINE_FILE, false)) {
format = new CrunchCombineFileInputFormat<Object, Object>(job);
}
{code}
A straight-forward fix is to change "format instanceof FileInputFormat" to
"format.getClass() == FileInputFormat.class", but this limits this optimization
to only sequence files. I'm looking for any better ideas.
> Pipelines should automatically use CombineFileInputFormat where input
> consists of many small files
> --------------------------------------------------------------------------------------------------
>
> Key: CRUNCH-165
> URL: https://issues.apache.org/jira/browse/CRUNCH-165
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.4.0
> Reporter: Dave Beech
> Assignee: Josh Wills
> Fix For: 0.8.0
>
> Attachments: CRUNCH-165-jwills.patch, CRUNCH-165-v3.patch,
> CRUNCH-165-v4.patch, CRUNCH-165.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would
> be used if the input data consisted of many small files, making the resulting
> mapreduce jobs more efficient by giving individual mappers more data to
> process. This would be a nice feature for Crunch to have, too.
--
This message was sent by Atlassian JIRA
(v6.1#6144)