[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755823#action_12755823 ]
Jay Booth commented on HDFS-516: -------------------------------- Yeah, I was puzzled by the performance too. I dug through the DFS code and I'm saving a bit on new socket and object creation, maybe a couple instructions here and there, but that shouldn't add up to 100 seconds for a gigabyte (approx 20 blocks). I'm calling read() a bajillion times in a row so it's conceivable (although unlikely) that I'm pegging the CPU and that's the limiting factor. I'm busy for a couple days but will get back to you with some figures from netstat, top and whatever else I can think of, along with another streaming case that works with read(b, off, len) to see if that changes things. I'll do a little more digging into DFS as well to see if I can isolate the cause. I definitely did run them several times on the same machine and another time on a different cluster with similar results, so it wasn't simply bad luck on the rack placement on EC2 (well maybe but unlikely). Will report back when I have more numbers. After I get those, my roadmap for this is to add checksum support and better DatanodeInfo caching. User groups would come after that. > Low Latency distributed reads > ----------------------------- > > Key: HDFS-516 > URL: https://issues.apache.org/jira/browse/HDFS-516 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jay Booth > Priority: Minor > Attachments: hdfs-516-20090912.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > I created a method for low latency random reads using NIO on the server side > and simulated OS paging with LRU caching and lookahead on the client side. > Some applications could include lucene searching (term->doc and doc->offset > mappings are likely to be in local cache, thus much faster than nutch's > current FsDirectory impl and binary search through record files (bytes at > 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.