[ 
http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378992 ] 

paul sutter commented on HADOOP-195:
------------------------------------


Owen,

A couple of small ideas:

- Could you fit the tempfiles in a RAM disk? This would just be a hack to 
determine whether the disk physics of small files are a factor here, both on 
the mapper end and the reducer end. Note that you need 2X the space on the 
reducer end, because it keeps two copies of the data around in small-file-form. 

- If small files are shown to be a problem (as I am guessing), and (as Doug 
suggests) we want to optimize for that case, perhaps the best thing to do would 
be to send the map output data directly to the reducer, and have the reducers 
write them to disk in some log structured format, maintaining a list of 
segments that were abandoned mid-stream and are to be ignored in the processing 
step. This way you'd have all sequential disk access

Thanks. Sorry for the volume of responses here, but its an area of great 
interest to us.

Paul

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3
>  Attachments: netstat.log, netstat.xls
>
> The data transfer of the map output should be transfered via http instead 
> rpc, because rpc is very slow for this application and the timeout behavior 
> is suboptimal. (server sends data and client ignores it because it took more 
> than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to