ok.
Merging on the mapper would certainly make sense in that scenario.
On Jun 29, 2006, at 10:32 AM, paul sutter (JIRA) wrote:
[ http://issues.apache.org/jira/browse/HADOOP-331?
page=comments#action_12418501 ]
paul sutter commented on HADOOP-331:
------------------------------------
one more suggestion (plea): could the map-output-spill-extraction
take place on the mapper side (perhaps in the http server), such
that the reducer be able to retreive fully-formed map outputs as
currently?
incidently, we have individual map inputs that are 5-10GB in size,
which are compressed so they cant be split, which would yield
potentially 50-100 spills per mapper. it would be nice if the
reducer (or whatever process is fetching the files) didnt have to
do that reassembly, but rather that it happened on the mapper side.
map outputs should be written to a single output file with an index
-------------------------------------------------------------------
Key: HADOOP-331
URL: http://issues.apache.org/jira/browse/HADOOP-331
Project: Hadoop
Type: Improvement
Components: mapred
Versions: 0.3.2
Reporter: eric baldeschwieler
Assignee: Yoram Arnon
Fix For: 0.5.0
The current strategy of writing a file per target map is consuming
a lot of unused buffer space (causing out of memory crashes) and
puts a lot of burden on the FS (many opens, inodes used, etc).
I propose that we write a single file containing all output and
also write an index file IDing which byte range in the file goes
to each reduce. This will remove the issue of buffer waste,
address scaling issues with number of open files and generally set
us up better for scaling. It will also have advantages with very
small inputs, since the buffer cache will reduce the number of
seeks needed and the data serving node can open a single file and
just keep it open rather than needing to do directory and open ops
on every request.
The only issue I see is that in cases where the task output is
substantiallyu larger than its input, we may need to spill
multiple times. In this case, we can do a merge after all spills
are complete (or during the final spill).
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira