[ 
http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378341 ] 

Dominik Friedrich commented on HADOOP-195:
------------------------------------------

paul,

the file IO test were really pretty simple. To be honest I don't remember my 
actual setup anymore but only that I saw not much difference between normal 
stream IO and NIO but the data was by far not gigabytes.

I used default endian-ness, direct buffers and buffered io. From what I read 
before memory mapping gives almost no performance gain against buffered io for 
streaming io. It makes if you've random access within a limited region of a 
file. In both cases you've buffered IO because of the OS's file system buffer. 
From what I know about OSs there is no copy to kernel space and the file system 
buffer of a modern OS is hardly to beat performance wise. Unbuffered IO would 
actually decrease the performance because the file system cannot change the 
write order and do other tricks to reduce seeks. In general I don't think there 
is much space for file IO performance improvements in hadoop except using e.g. 
APR through JNI.

To improve the sorting performance I'd start by looking at the algorithm 
itself, because there seem to be better algorithms out there. This huge 
difference in performance cannot be caused by suboptimal implementation. 
Bottlenecks are file and network IO so the goal is to reduce those. I haven't 
used Nutch/hadoop for some time now and so I'm not up to date with the current 
code. 

This is a really interesting problem, could be a nice project for Google's 
Summer of Code.

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3

>
> The data transfer of the map output should be transfered via http instead 
> rpc, because rpc is very slow for this application and the timeout behavior 
> is suboptimal. (server sends data and client ignores it because it took more 
> than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to