[jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index

paul sutter (JIRA) Thu, 29 Jun 2006 10:35:03 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12418501 ]


paul sutter commented on HADOOP-331:
------------------------------------


one more suggestion (plea): could the map-output-spill-extraction take place on 
the mapper side (perhaps in the http server), such that the reducer be able to 
retreive fully-formed map outputs as currently?

incidently, we have individual map inputs that are 5-10GB in size, which are 
compressed so they cant be split, which would yield potentially 50-100 spills 
per mapper. it would be nice if the reducer (or whatever process is fetching 
the files) didnt have to do that reassembly, but rather that it happened on the 
mapper side.

> map outputs should be written to a single output file with an index
> -------------------------------------------------------------------
>
>          Key: HADOOP-331
>          URL: http://issues.apache.org/jira/browse/HADOOP-331
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.3.2
>     Reporter: eric baldeschwieler
>     Assignee: Yoram Arnon
>      Fix For: 0.5.0

>
> The current strategy of writing a file per target map is consuming a lot of 
> unused buffer space (causing out of memory crashes) and puts a lot of burden 
> on the FS (many opens, inodes used, etc).  
> I propose that we write a single file containing all output and also write an 
> index file IDing which byte range in the file goes to each reduce.  This will 
> remove the issue of buffer waste, address scaling issues with number of open 
> files and generally set us up better for scaling.  It will also have 
> advantages with very small inputs, since the buffer cache will reduce the 
> number of seeks needed and the data serving node can open a single file and 
> just keep it open rather than needing to do directory and open ops on every 
> request.
> The only issue I see is that in cases where the task output is substantiallyu 
> larger than its input, we may need to spill multiple times.  In this case, we 
> can do a merge after all spills are complete (or during the final spill).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index

Reply via email to