[ 
https://issues.apache.org/jira/browse/HDFS-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Booth updated HDFS-918:
---------------------------

    Attachment: hbase-hdfs-benchmarks.ods

Benchmarked on EC2 this weekend, I set up 0.20.2-append clean, a copy with my 
multiplex patch applied, and a third copy which only ports filechannel pooling 
to the current architecture (can submit that patch later, it's at home).


All runs were with HBase block caching disabled to highlight the difference in 
filesystem access speeds.  

This is running across a decently small dataset (little less than 1GB) so all 
files are presumably in memory for the majority of test duration.

Run involved 6 clients reading 1,000,000 rows each divided over 10 mappers.  
Cluster setup was 3x EC2 High-CPU XL, 1 NN/JT/ZK/Master and 2x DN/TT/RS.  Ran 
in 3 batches of 3 runs each.  Cluster was restarted in between each batch for 
each run type because we're changing DN implementation.


Topline numbers (rest are in document):

Total Run Averages                      
                        
Test    clean   pool    multiplex
random  21159050.44     19448216.89     16806247
scan    436106.89       442452.54       443262.56
sequential      19298239.78     17871047.67     14987028.44

Pool is 7.5% gain, multiplex is more like 20% for random reads
                        
Only batches 2+3 (batch 1 was a little messed up and doesn't track with others) 
                
Test    clean   pool    multiplex
random  20555308.67     18425017        16987643.33
scan    426849  427277.98       448031
sequential      18665323.67     16969885.83     15102404

Pool is 10% gain, multiplex is 17% or so for random reads

Per row for random read (batches 2+3 only):
clean: 3.42ms
pool: 3.07ms
multiplex: 2.83ms


> Use single Selector and small thread pool to replace many instances of 
> BlockSender for reads
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-918
>                 URL: https://issues.apache.org/jira/browse/HDFS-918
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Jay Booth
>             Fix For: 0.22.0
>
>         Attachments: hbase-hdfs-benchmarks.ods, hdfs-918-20100201.patch, 
> hdfs-918-20100203.patch, hdfs-918-20100211.patch, hdfs-918-20100228.patch, 
> hdfs-918-20100309.patch, hdfs-918-branch20-append.patch, 
> hdfs-918-branch20.2.patch, hdfs-918-TRUNK.patch, hdfs-multiplex.patch
>
>
> Currently, on read requests, the DataXCeiver server allocates a new thread 
> per request, which must allocate its own buffers and leads to 
> higher-than-optimal CPU and memory usage by the sending threads.  If we had a 
> single selector and a small threadpool to multiplex request packets, we could 
> theoretically achieve higher performance while taking up fewer resources and 
> leaving more CPU on datanodes available for mapred, hbase or whatever.  This 
> can be done without changing any wire protocols.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to