[ https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750246#action_12750246 ]
Jay Booth commented on HDFS-516: -------------------------------- I did some benchmarking, here are the results: Each test ran 1000 searches to warm, then 5000 searches to benchmark. Binary search of a 20GB sorted sequence file of 20 million 1kb records. Tests were run from the namenode in a 4-node EC2 medium cluster, 1.7 GB of ram each. 1 namenode and 3 datanodes. >From HDFS to a 512MB cached RadFS there was a 4X average improvement in search >times, from 102ms to 24ms. Each search was, theoretically, 24.25 reads (log 2 of 20 million). Not actually measured. I only ran each set once. The 90th percent line trends the right way, although the max line is a little spikey. I'll add a 99th % in future benchmarks. HDFS, baseline: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.DistributedFileSystem Done, Search Times: Mean: 102.17840000000015 Variance: 5939.660105461091 Median: 97.0 Max: 3095.0 Min: 33.0 90th pct: 130.0 Rad, no cache Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 68.55640000000002 Variance: 233.8335857571515 Median: 67.0 Max: 379.0 Min: 26.0 90th pct: 79.0 Rad, 16MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 42.039799999999985 Variance: 237.83818359671966 Median: 40.0 Max: 203.0 Min: 5.0 90th pct: 59.0 Rad, 128MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 29.850600000000007 Variance: 202.08189601920367 Median: 27.0 Max: 203.0 Min: 1.0 90th pct: 45.0 Rad, 512MB cache: Warming with 1000 searches Executed 5000 random searches with FS class org.apache.hadoop.hdfs.rad.RadFileSystem Done, Search Times: Mean: 24.274600000000014 Variance: 250.3052558911758 Median: 22.0 Max: 687.0 Min: 0.0 90th pct: 36.0 I could still shave a point or two by cleaning up my caching system to be more graceful with its lookahead mechanism, but not bad for now. I'll pretty it up and post a first attempt at a final patch soon. > Low Latency distributed reads > ----------------------------- > > Key: HDFS-516 > URL: https://issues.apache.org/jira/browse/HDFS-516 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Jay Booth > Priority: Minor > Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, > radfs.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > I created a method for low latency random reads using NIO on the server side > and simulated OS paging with LRU caching and lookahead on the client side. > Some applications could include lucene searching (term->doc and doc->offset > mappings are likely to be in local cache, thus much faster than nutch's > current FsDirectory impl and binary search through record files (bytes at > 1/2, 1/4, 1/8 marks are likely to be cached) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.