[ http://issues.apache.org/jira/browse/HADOOP-38?page=all ]
Doug Cutting updated HADOOP-38:
-------------------------------
Fix Version: 0.1.0
> default splitter should incorporate fs block size
> -------------------------------------------------
>
> Key: HADOOP-38
> URL: http://issues.apache.org/jira/browse/HADOOP-38
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Reporter: Doug Cutting
> Fix For: 0.1.0
>
> By default, the file splitting code should operate as follows.
> inputs are <file>*, numMapTasks, minSplitSize, fsBlockSize
> output is <file,start,length>*
> totalSize = sum of all file sizes;
> desiredSplitSize = totalSize / numMapTasks;
> if (desiredSplitSize > fsBlockSize) /* new */
> desiredSplitSize = fsBlockSize;
> if (desiredSplitSize < minSplitSize)
> desiredSplitSize = minSplitSize;
> chop input files into desiredSplitSize chunks & return them
> In other words, the numMapTasks is a desired minimum. We'll try to chop
> input into at least numMapTasks chunks, each ideally a single fs block.
> If there's not enough input data to create numMapTasks tasks, each with an
> entire block, then we'll permit tasks whose input is smaller than a
> filesystem block, down to a minimum split size.
> This handles cases where:
> - each input record takes a lot of time to process. In this case we want
> to make sure we use all of the cluster. Thus it is important to permit
> splits smaller than the fs block size.
> - input i/o dominates. In this case we want to permit the placement of
> tasks on hosts where their data is local. This is only possible if splits
> are fs block size or smaller.
> Are there other common cases that this algorithm does not handle well?
> The part marked 'new' above is not currently implemented, but I'd like to add
> it.
> Does this sound reasonble?
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira