[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Doug Cutting (JIRA) Mon, 11 Jun 2007 14:31:46 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12503635
 ]


Doug Cutting commented on HADOOP-1440:
--------------------------------------

bq. That does not really address the problem this Jira tries to address:

I think it does.  Whether the input files are splittable is up to the input 
format.  If reduce is disabled, then I proposed (above) that the order of the 
input splits should determine the numbering of output files.  So what's not 
addressed?

Changing the kernel to base the output file names directly on the input file 
names would break a number of abstraction boundaries.  But I don't see how this 
is required.  The list of input files and output files should correspond 
one-to-one.

> JobClient should not sort input-splits
> --------------------------------------
>
>                 Key: HADOOP-1440
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1440
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.12.3
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Milind Bhandarkar
>             Fix For: 0.14.0
>
>
> Currently, the JobClient sorts the InputSplits returned by InputFormat in 
> descending order, so that the map tasks corresponding to larger input-splits 
> are scheduled first for execution than smaller ones. However, this causes 
> problems in applications that produce data-sets partitioned similarly to the 
> input partition with -reducer NONE.
> With -reducer NONE, map task i produces part-i. Howver, in the typical 
> applications that use -reducer NONE it should produce a partition that has 
> the same index as the input parrtition.
> (Of course, this requires that each partition should be fed in its entirety 
> to a map, rather than splitting it into blocks, but that is a separate issue.)
> Thus, sorting input splits should be either controllable via a configuration 
> variable, or the FileInputFormat should sort the splits and JobClient should 
> honor the order of splits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1440) JobClient should not sort input-splits

Reply via email to