Re: [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index

Eric Baldeschwieler Fri, 30 Jun 2006 14:14:10 -0700

ok.

Merging on the mapper would certainly make sense in that scenario.


On Jun 29, 2006, at 10:32 AM, paul sutter (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-331?page=comments#action_12418501 ]
paul sutter commented on HADOOP-331:
------------------------------------
one more suggestion (plea): could the map-output-spill-extractiontake place on the mapper side (perhaps in the http server), suchthat the reducer be able to retreive fully-formed map outputs ascurrently?
incidently, we have individual map inputs that are 5-10GB in size,which are compressed so they cant be split, which would yieldpotentially 50-100 spills per mapper. it would be nice if thereducer (or whatever process is fetching the files) didnt have todo that reassembly, but rather that it happened on the mapper side.
map outputs should be written to a single output file with an index
-------------------------------------------------------------------

         Key: HADOOP-331
         URL: http://issues.apache.org/jira/browse/HADOOP-331
     Project: Hadoop
        Type: Improvement
  Components: mapred
    Versions: 0.3.2
    Reporter: eric baldeschwieler
    Assignee: Yoram Arnon
     Fix For: 0.5.0
The current strategy of writing a file per target map is consuminga lot of unused buffer space (causing out of memory crashes) andputs a lot of burden on the FS (many opens, inodes used, etc).I propose that we write a single file containing all output andalso write an index file IDing which byte range in the file goesto each reduce. This will remove the issue of buffer waste,address scaling issues with number of open files and generally setus up better for scaling. It will also have advantages with verysmall inputs, since the buffer cache will reduce the number ofseeks needed and the data serving node can open a single file andjust keep it open rather than needing to do directory and open opson every request.The only issue I see is that in cases where the task output issubstantiallyu larger than its input, we may need to spillmultiple times. In this case, we can do a merge after all spillsare complete (or during the final spill).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-331) map outputs should be written to a single output file with an index

Reply via email to