[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12755823#action_12755823
 ] 

Jay Booth commented on HDFS-516:
--------------------------------

Yeah, I was puzzled by the performance too.  I dug through the DFS code and I'm 
saving a bit on new socket and object creation, maybe a couple instructions 
here and there, but that shouldn't add up to 100 seconds for a gigabyte (approx 
20 blocks).  I'm calling read() a bajillion times in a row so it's conceivable 
(although unlikely) that I'm pegging the CPU and that's the limiting factor.  

I'm busy for a couple days but will get back to you with some figures from 
netstat, top and whatever else I can think of, along with another streaming 
case that works with read(b, off, len) to see if that changes things.  I'll do 
a little more digging into DFS as well to see if I can isolate the cause.  I 
definitely did run them several times on the same machine and another time on a 
different cluster with similar results, so it wasn't simply bad luck on the 
rack placement on EC2 (well maybe but unlikely).

Will report back when I have more numbers.  After I get those, my roadmap for 
this is to add checksum support and better DatanodeInfo caching.  User groups 
would come after that.

> Low Latency distributed reads
> -----------------------------
>
>                 Key: HDFS-516
>                 URL: https://issues.apache.org/jira/browse/HDFS-516
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jay Booth
>            Priority: Minor
>         Attachments: hdfs-516-20090912.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I created a method for low latency random reads using NIO on the server side 
> and simulated OS paging with LRU caching and lookahead on the client side.  
> Some applications could include lucene searching (term->doc and doc->offset 
> mappings are likely to be in local cache, thus much faster than nutch's 
> current FsDirectory impl and binary search through record files (bytes at 
> 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to