[ http://issues.apache.org/jira/browse/HADOOP-38?page=all ]

Doug Cutting updated HADOOP-38:
-------------------------------

    Fix Version: 0.1.0

> default splitter should incorporate fs block size
> -------------------------------------------------
>
>          Key: HADOOP-38
>          URL: http://issues.apache.org/jira/browse/HADOOP-38
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Reporter: Doug Cutting
>      Fix For: 0.1.0

>
> By default, the file splitting code should operate as follows.
>   inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
>   output is <file,start,length>*
>   totalSize = sum of all file sizes;
>   desiredSplitSize = totalSize / numMapTasks;
>   if (desiredSplitSize > fsBlockSize)             /* new */
>     desiredSplitSize = fsBlockSize;
>   if (desiredSplitSize < minSplitSize)
>     desiredSplitSize = minSplitSize;
>   chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum.  We'll try to chop 
> input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an 
> entire block, then we'll permit tasks whose input is smaller than a 
> filesystem block, down to a minimum split size.
> This handles cases where:
>   - each input record takes a lot of time to process.  In this case we want 
> to make sure we use all of the cluster.  Thus it is important to permit 
> splits smaller than the fs block size.
>   - input i/o dominates.  In this case we want to permit the placement of 
> tasks on hosts where their data is local.  This is only possible if splits 
> are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add 
> it.
> Does this sound reasonble?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to