[ http://issues.apache.org/jira/browse/HADOOP-195?page=comments#action_12378341 ]
Dominik Friedrich commented on HADOOP-195: ------------------------------------------ paul, the file IO test were really pretty simple. To be honest I don't remember my actual setup anymore but only that I saw not much difference between normal stream IO and NIO but the data was by far not gigabytes. I used default endian-ness, direct buffers and buffered io. From what I read before memory mapping gives almost no performance gain against buffered io for streaming io. It makes if you've random access within a limited region of a file. In both cases you've buffered IO because of the OS's file system buffer. From what I know about OSs there is no copy to kernel space and the file system buffer of a modern OS is hardly to beat performance wise. Unbuffered IO would actually decrease the performance because the file system cannot change the write order and do other tricks to reduce seeks. In general I don't think there is much space for file IO performance improvements in hadoop except using e.g. APR through JNI. To improve the sorting performance I'd start by looking at the algorithm itself, because there seem to be better algorithms out there. This huge difference in performance cannot be caused by suboptimal implementation. Bottlenecks are file and network IO so the goal is to reduce those. I haven't used Nutch/hadoop for some time now and so I'm not up to date with the current code. This is a really interesting problem, could be a nice project for Google's Summer of Code. > transfer map output transfer with http instead of rpc > ----------------------------------------------------- > > Key: HADOOP-195 > URL: http://issues.apache.org/jira/browse/HADOOP-195 > Project: Hadoop > Type: Improvement > Components: mapred > Versions: 0.2 > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Fix For: 0.3 > > The data transfer of the map output should be transfered via http instead > rpc, because rpc is very slow for this application and the timeout behavior > is suboptimal. (server sends data and client ignores it because it took more > than 10 seconds to be received.) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
