[ 
http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12443037 ] 
            
Devaraj Das commented on HADOOP-331:
------------------------------------

A modified proposal (after Doug's comments)
   -  Keep an array (length = # of partitions) whose elements will be lists 
containing elements of the form <spill#, start-key-val-offset, 
end-key-val-offset>*. This is obtained after the sort and before the spill of 
the buffer. The spill# helps in locating the section of the sorted partition 
data within a spill.
  - Merge goes over this array element by element and for each element, it 
looks at the contained list elements and creates a file appending the sections 
one by one. Then does mergesort with just one file as input with the output set 
to the final output file (opened in append mode).  An index file is updated to 
contain <part#, start-offset, compressedlength>
  - At the end of the above, we have the final output file and the index file.
The negative with this approach is that it requires trips to disk for 
reading/writing the sections before merge. So if we had 10 sections for a 
particular partition, we would have to read 10 chunks from the sorted spills 
and write them to disk as a single file. The plus is that the key format 
doesn't need to change at all (so minimum code change) and the number of 
comparisons during merge are reduced because we merge chunks of a single 
partition at a time.

We can have the interator over keys/values as suggested by Owen and the block 
compression optimization suggested by Doug. Although the iterator won't be 
required for the map outputs if we adopt the approach outlined above, but will 
be required for the reduces, right Owen? 

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>                 Key: HADOOP-331
>                 URL: http://issues.apache.org/jira/browse/HADOOP-331
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.3.2
>            Reporter: eric baldeschwieler
>         Assigned To: Devaraj Das
>
> The current strategy of writing a file per target map is consuming a lot of 
> unused buffer space (causing out of memory crashes) and puts a lot of burden 
> on the FS (many opens, inodes used, etc).  
> I propose that we write a single file containing all output and also write an 
> index file IDing which byte range in the file goes to each reduce.  This will 
> remove the issue of buffer waste, address scaling issues with number of open 
> files and generally set us up better for scaling.  It will also have 
> advantages with very small inputs, since the buffer cache will reduce the 
> number of seeks needed and the data serving node can open a single file and 
> just keep it open rather than needing to do directory and open ops on every 
> request.
> The only issue I see is that in cases where the task output is substantiallyu 
> larger than its input, we may need to spill multiple times.  In this case, we 
> can do a merge after all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to