[jira] [Commented] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files

Chao Shi (JIRA) Sun, 24 Nov 2013 19:40:16 -0800

    [ 
https://issues.apache.org/jira/browse/CRUNCH-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831161#comment-13831161
 ]


Chao Shi commented on CRUNCH-165:
---------------------------------

Hi guys,

I encountered a problem raised by this patch. I'm using HFileInputFormat, which 
overrides FileInputFormat's listStatus() to pick some input files at deeper 
hierarchy. When I pass several input paths to HFileSource, it is wrapped into a 
CrunchCombineFileInputFormat. When CrunchCombineFileInputFormat#getSplits is 
called, it does not call the internal HFileInputFormat#listStatus. Instead, it 
calls FileInputFormat's. This behavior is implemented in CombineFileInputFormat.

{code}
      if (format instanceof FileInputFormat && 
!conf.getBoolean(RuntimeParameters.DISABLE_COMBINE_FILE, false)) {
        format = new CrunchCombineFileInputFormat<Object, Object>(job);
      }
{code}

A straight-forward fix is to change "format instanceof FileInputFormat" to 
"format.getClass() == FileInputFormat.class", but this limits this optimization 
to only sequence files. I'm looking for any better ideas.

> Pipelines should automatically use CombineFileInputFormat where input 
> consists of many small files
> --------------------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-165
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-165
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.4.0
>            Reporter: Dave Beech
>            Assignee: Josh Wills
>             Fix For: 0.8.0
>
>         Attachments: CRUNCH-165-jwills.patch, CRUNCH-165-v3.patch, 
> CRUNCH-165-v4.patch, CRUNCH-165.patch
>
>
> Hive had a feature introduced in HIVE-74 whereby CombineFileInputFormat would 
> be used if the input data consisted of many small files, making the resulting 
> mapreduce jobs more efficient by giving individual mappers more data to 
> process. This would be a nice feature for Crunch to have, too.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (CRUNCH-165) Pipelines should automatically use CombineFileInputFormat where input consists of many small files

Reply via email to