Hi,
I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a client
reading ~1000 keys from an HBase table. These keys happen to fall on the same
region server. For my reads I use reverse scan to read each key as I want the
key prior to a specific time stamp (time stamps are stored in reverse order). I
don't believe gets can accomplish that, right? so I use scan, with caching set
to 1.
I use 2000 reader threads in the client and on HBase, I've set
hbase.regionserver.handler.count to 1500. With this setup, my scatter gather is
very slow and can take up to 10s in total. Timing an individual getScanner(..)
call on the client side, it can easily take few hundreds of ms. I also got the
following metrics from the region server in question:
"queueCallTime_mean" : 2.190855525775637,
"queueCallTime_median" : 0.0,
"queueCallTime_75th_percentile" : 0.0,
"queueCallTime_95th_percentile" : 1.0,
"queueCallTime_99th_percentile" : 556.9799999999818,
"processCallTime_min" : 0,
"processCallTime_max" : 12755,
"processCallTime_mean" : 105.64873440912682,
"processCallTime_median" : 0.0,
"processCallTime_75th_percentile" : 2.0,
"processCallTime_95th_percentile" : 7917.95,
"processCallTime_99th_percentile" : 8876.89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min"
: 89,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max"
: 11300,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean"
: 654.4949739797315,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median"
: 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile"
: 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile"
: 101.0,
"namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile"
: 113.0,
Where "delta" is the name of the table I am querying.
In addition to all this, i monitored the hardware resources (CPU, disk, and
network) of both the client and the region server and nothing seems anywhere
near saturation. So I am puzzled by what's going on and where this time is
going.
Few things to note based on the above measurements: both medians of IPC
processCallTime and queueCallTime are basically zero (ms I presume, right?).
However, scanNext_median is 101 (ms too, right?). I am not sure how this adds
up. Also, even though the 101 figure seems outrageously high and I don't know
why, still all these scans should be happening in parallel, so the overall call
should finish fast, given that no hardware resource is contended, right? but
this is not what's happening, so I have to be missing something(s).
So, any help is appreciated there.
Thanks,
Khaled