Do check again on the heap size of the region servers. The default unconfigured size is 1G; too small for much of anything. Check your RS logs -- look for lines produced by the JVMPauseMonitor thread. They usually correlate with long GC pauses or other process-freeze events.
Get is implemented as a Scan of a single row, so a reverse scan of a single row should be functionally equivalent. In practice, I have seen discrepancy between the latencies reported by the RS and the latencies experienced by the client. I've not investigated this area thoroughly. On Thu, Oct 2, 2014 at 10:05 AM, Khaled Elmeleegy <[email protected]> wrote: > Thanks Lars for your quick reply. > > Yes performance is similar with less handlers (I tried with 100 first). > > The payload is not big ~1KB or so. The working set doesn't seem to fit in > memory as there are many cache misses. However, disk is far from being a > bottleneck. I checked using iostat. I also verified that neither the > network nor the CPU of the region server or the client are a bottleneck. > This leads me to believe that likely this is a software bottleneck, > possibly due to a misconfiguration on my side. I just don't know how to > debug it. A clear disconnect I see is the individual request latency as > reported by metrics on the region server (IPC processCallTime vs scanNext) > vs what's measured on the client. Does this sound right? Any ideas on how > to better debug it? > > About this trick with the timestamps to be able to do a forward scan, > thanks for pointing it out. Actually, I am aware of it. The problem I have > is, sometimes I want to get the key after a particular timestamp and > sometimes I want to get the key before, so just relying on the key order > doesn't work. Ideally, I want a reverse get(). I thought reverse scan can > do the trick though. > > Khaled > > ---------------------------------------- > > Date: Thu, 2 Oct 2014 09:40:37 -0700 > > From: [email protected] > > Subject: Re: HBase read performance > > To: [email protected] > > > > Hi Khaled, > > is it the same with fewer threads? 1500 handler threads seems to be a > lot. Typically a good number of threads depends on the hardware (number of > cores, number of spindles, etc). I cannot think of any type of scenario > where more than 100 would give any improvement. > > > > How large is the payload per KV retrieved that way? If large (as in a > few 100k) you definitely want to lower the number of the handler threads. > > How much heap do you give the region server? Does the working set fit > into the cache? (i.e. in the metrics, do you see the eviction count going > up, if so it does not fit into the cache). > > > > If the working set does not fit into the cache (eviction count goes up) > then HBase will need to bring a new block in from disk on each Get > (assuming the Gets are more or less random as far as the server is > concerned). > > In case you'll benefit from reducing the HFile block size (from 64k to > 8k or even 4k). > > > > Lastly I don't think we tested the performance of using reverse scan > this way, there is probably room to optimize this. > > Can you restructure your keys to allow forwards scanning? For example > you could store the time as MAX_LONG-time. Or you could invert all the bits > of the time portion of the key, so that it sort the other way. Then you > could do a forward scan. > > > > Let us know how it goes. > > > > -- Lars > > > > > > ----- Original Message ----- > > From: Khaled Elmeleegy <[email protected]> > > To: "[email protected]" <[email protected]> > > Cc: > > Sent: Thursday, October 2, 2014 12:12 AM > > Subject: HBase read performance > > > > Hi, > > > > I am trying to do a scatter/gather on hbase (0.98.6.1), where I have a > client reading ~1000 keys from an HBase table. These keys happen to fall on > the same region server. For my reads I use reverse scan to read each key as > I want the key prior to a specific time stamp (time stamps are stored in > reverse order). I don't believe gets can accomplish that, right? so I use > scan, with caching set to 1. > > > > I use 2000 reader threads in the client and on HBase, I've set > hbase.regionserver.handler.count to 1500. With this setup, my scatter > gather is very slow and can take up to 10s in total. Timing an individual > getScanner(..) call on the client side, it can easily take few hundreds of > ms. I also got the following metrics from the region server in question: > > > > "queueCallTime_mean" : 2.190855525775637, > > "queueCallTime_median" : 0.0, > > "queueCallTime_75th_percentile" : 0.0, > > "queueCallTime_95th_percentile" : 1.0, > > "queueCallTime_99th_percentile" : 556.9799999999818, > > > > "processCallTime_min" : 0, > > "processCallTime_max" : 12755, > > "processCallTime_mean" : 105.64873440912682, > > "processCallTime_median" : 0.0, > > "processCallTime_75th_percentile" : 2.0, > > "processCallTime_95th_percentile" : 7917.95, > > "processCallTime_99th_percentile" : 8876.89, > > > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_min" > : 89, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_max" > : 11300, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_mean" > : 654.4949739797315, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_median" > : 101.0, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_75th_percentile" > : 101.0, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_95th_percentile" > : 101.0, > > > "namespace_default_table_delta_region_87be70d7710f95c05cfcc90181d183b4_metric_scanNext_99th_percentile" > : 113.0, > > > > Where "delta" is the name of the table I am querying. > > > > In addition to all this, i monitored the hardware resources (CPU, disk, > and network) of both the client and the region server and nothing seems > anywhere near saturation. So I am puzzled by what's going on and where this > time is going. > > > > Few things to note based on the above measurements: both medians of IPC > processCallTime and queueCallTime are basically zero (ms I presume, > right?). However, scanNext_median is 101 (ms too, right?). I am not sure > how this adds up. Also, even though the 101 figure seems outrageously high > and I don't know why, still all these scans should be happening in > parallel, so the overall call should finish fast, given that no hardware > resource is contended, right? but this is not what's happening, so I have > to be missing something(s). > > > > So, any help is appreciated there. > > > > Thanks, > > Khaled > >
