[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13563424#comment-13563424
 ] 

Jerry Chen commented on MAPREDUCE-4882:
---------------------------------------

[~gelesh]
Map task will choose the splill file dir on local disks according to the 
estimating size if there are mutliple local dirs configuraed. The wrong 
estimating size may cause a wrong decision such as choosing the smaller space 
dir according to the give size (the wrong one) while the actual spill is larger 
and thus cause disk full error, although there may be another disk dir with 
enough space available.

                
> Error in estimating the length of the output file in Spill Phase
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-4882
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4882
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 1.0.3
>         Environment: Any Environment
>            Reporter: Lijie Xu
>              Labels: patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The sortAndSpill() method in MapTask.java has an error in estimating the 
> length of the output file. 
> The "long size" should be "(bufvoid - bufstart) + bufend" not "(bufvoid - 
> bufend) + bufstart" when "bufend < bufstart".
> Here is the original code in MapTask.java.
>  private void sortAndSpill() throws IOException, ClassNotFoundException,
>                                        InterruptedException {
>       //approximate the length of the output file to be the length of the
>       //buffer + header lengths for the partitions
>       long size = (bufend >= bufstart
>           ? bufend - bufstart
>           : (bufvoid - bufend) + bufstart) +
>                   partitions * APPROX_HEADER_LENGTH;
>       FSDataOutputStream out = null;
> ------------------------------------------------------------------------------
> I had a test on "TeraSort". A snippet from mapper's log is as follows:
> MapTask: Spilling map output: record full = true
> MapTask: bufstart = 157286200; bufend = 10485460; bufvoid = 199229440
> MapTask: kvstart = 262142; kvend = 131069; length = 655360
> MapTask: Finished spill 3
> In this occasioin, Spill Bytes should be (199229440 - 157286200) + 10485460 = 
> 52428700 (52 MB) because the number of spilled records is 524287 and each 
> record costs 100B.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to