Hi Khaled,
is it the same with fewer threads? 1500 handler threads seems to be a lot. 
Typically a good number of threads depends on the hardware (number of cores, 
number of spindles, etc). I cannot think of any type of scenario where more 
than 100 would give any improvement.

How large is the payload per KV retrieved that way? If large (as in a few 100k) 
you definitely want to lower the number of the handler threads.
How much heap do you give the region server? Does the working set fit into the 
cache? (i.e. in the metrics, do you see the eviction count going up, if so it 
does not fit into the cache).

If the working set does not fit into the cache (eviction count goes up) then 
HBase will need to bring a new block in from disk on each Get (assuming the 
Gets are more or less random as far as the server is concerned).
In case you'll benefit from reducing the HFile block size (from 64k to 8k or 
even 4k).

Lastly I don't think we tested the performance of using reverse scan this way, 
there is probably room to optimize this.
Can you restructure your keys to allow forwards scanning? For example you could 
store the time as MAX_LONG-time. Or you could invert all the bits of the time 
portion of the key, so that it sort the other way. Then you could do a forward 
scan.

Let us know how it goes.

-- Lars


----- Original Message -----
From: Khaled Elmeleegy <[email protected]>
To: "[email protected]" <[email protected]>
Cc: 
Sent: Thursday, October 2, 2014 12:12 AM
Subject: HBase read performance

Hi,

I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client 
reading ~1000 keys from an HBase table. These keys happen to fall on the same 
region server. For my reads I use reverse scan to read each key as I want the 
key prior to a specific time stamp (time stamps are stored in reverse order). I 
don't believe gets can accomplish that, right? so I use scan, with caching set 
to 1.

I use 2000 reader threads in the client and on HBase, I've set 
hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is 
very slow and can take up to 10s in total. Timing an individual getScanner(..) 
call on the client side, it can easily take few hundreds of ms. I also got the 
following metrics from the region server in question:

"queueCallTime_mean" : 2.190855525775637,
"queueCallTime_median" : 0.0,
"queueCallTime_75th_percentile" : 0.0,
"queueCallTime_95th_percentile" : 1.0,
"queueCallTime_99th_percentile" : 556.9799999999818,

"processCallTime_min" : 0,
"processCallTime_max" : 12755,
"processCallTime_mean" : 105.64873440912682,
"processCallTime_median" : 0.0,
"processCallTime_75th_percentile" : 2.0,
"processCallTime_95th_percentile" : 7917.95,
"processCallTime_99th_percentile" : 8876.89,

"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
 : 89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
 : 11300,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
 : 654.4949739797315,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
 : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
 : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
 : 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
 : 113.0,

Where "delta" is the name of the table I am querying.

In addition to all this, i monitored the hardware resources (CPU, disk, and 
network) of both the client and the region server and nothing seems anywhere 
near saturation. So I am puzzled by what's going on and where this time is 
going.

Few things to note based on the above measurements: both medians of IPC 
processCallTime and queueCallTime are basically zero (ms I presume, right?). 
However, scanNext_median is 101 (ms too, right?). I am not sure how this adds 
up. Also, even though the 101 figure seems outrageously high and I don't know 
why, still all these scans should be happening in parallel, so the overall call 
should finish fast, given that no hardware resource is contended, right? but 
this is not what's happening, so I have to be missing something(s). 

So, any help is appreciated there.

Thanks,
Khaled

Reply via email to