[ 
https://issues.apache.org/jira/browse/HDFS-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750246#action_12750246
 ] 

Jay Booth commented on HDFS-516:
--------------------------------

I did some benchmarking, here are the results:

Each test ran 1000 searches to warm, then 5000 searches to benchmark.
Binary search of a 20GB sorted sequence file of 20 million 1kb records.
Tests were run from the namenode in a 4-node EC2 medium cluster, 1.7 GB of ram 
each.  1 namenode and 3 datanodes.  

>From HDFS to a 512MB cached RadFS there was a 4X average improvement in search 
>times, from 102ms to 24ms.
Each search was, theoretically, 24.25 reads (log 2 of 20 million).  Not 
actually measured.
I only ran each set once.  The 90th percent line trends the right way, although 
the max line is a little spikey.  I'll add a 99th % in future benchmarks.

HDFS, baseline:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.DistributedFileSystem
Done, Search Times:
Mean:     102.17840000000015
Variance: 5939.660105461091
Median:   97.0
Max:      3095.0
Min:      33.0
90th pct: 130.0

Rad, no cache
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean:     68.55640000000002
Variance: 233.8335857571515
Median:   67.0
Max:      379.0
Min:      26.0
90th pct: 79.0

Rad, 16MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean:     42.039799999999985
Variance: 237.83818359671966
Median:   40.0
Max:      203.0
Min:      5.0
90th pct: 59.0

Rad, 128MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times: 
Mean:     29.850600000000007
Variance: 202.08189601920367
Median:   27.0
Max:      203.0
Min:      1.0
90th pct: 45.0

Rad, 512MB cache:
Warming with 1000 searches
Executed 5000 random searches with FS class 
org.apache.hadoop.hdfs.rad.RadFileSystem
Done, Search Times:
Mean:     24.274600000000014
Variance: 250.3052558911758
Median:   22.0
Max:      687.0
Min:      0.0
90th pct: 36.0


I could still shave a point or two by cleaning up my caching system to be more 
graceful with its lookahead mechanism, but not bad for now.  I'll pretty it up 
and post a first attempt at a final patch soon.

> Low Latency distributed reads
> -----------------------------
>
>                 Key: HDFS-516
>                 URL: https://issues.apache.org/jira/browse/HDFS-516
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Jay Booth
>            Priority: Minor
>         Attachments: hdfs-516-20090824.patch, hdfs-516-20090831.patch, 
> radfs.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I created a method for low latency random reads using NIO on the server side 
> and simulated OS paging with LRU caching and lookahead on the client side.  
> Some applications could include lucene searching (term->doc and doc->offset 
> mappings are likely to be in local cache, thus much faster than nutch's 
> current FsDirectory impl and binary search through record files (bytes at 
> 1/2, 1/4, 1/8 marks are likely to be cached)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to