Check the region server logs. If they are blocking on something it should show up there. For cdh3
the logs are in /var/log/hbase/. Also you may want to turn on debug level for your logging (either
in log4j or in the web interface). Finally all of your requests are going to just one region
server...npin-172-16-12-204.np.local...so it may be stuck trying to split a region or something.
You could try to pre-split the regions which may help.
~Jeff
On 7/29/2011 10:57 AM, Eric Hauser wrote:
Hi,
I've been doing different experiments with a 5-node cluster with YCSB.
We have been testing a number of different configurations, so I have
been constantly been wiping our cluster up and setting it up again
since we configure everything via Chef. At one point, I was able to
get the following stats from our cluster which I was pretty happy
with:
YCSB Client 0.1
Command line: -load -db com.yahoo.ycsb.db.HBaseClient
-Pworkloads/workloada -p columnfamily=family -p recordcount=10000000
-s
[OVERALL], RunTime(ms), 1057645.0
[OVERALL], Throughput(ops/sec), 9454.96834949345
[INSERT], Operations, 10000000
[INSERT], AverageLatency(ms), 0.0915235
[INSERT], MinLatency(ms), 0
[INSERT], MaxLatency(ms), 6925
[INSERT], 95thPercentileLatency(ms), 0
[INSERT], 99thPercentileLatency(ms), 0
[INSERT], Return=0, 10000000
However, in our most recent server builds, I seem to very quickly
deadlock something in HBase. I've backed through all our old
revisions and reverted a number of different configuration settings,
and I can't seem to figure out now why the cluster is so slow. Our
terasort M/R tests are returning the same values as before, so I do
not believe that there is anything wrong external to HBase.
The behavior that I see when I kick off the tests is this:
[UPDATE], 0, 4765
[UPDATE], 1, 248
[UPDATE], 2, 0
[UPDATE], 3, 0
[UPDATE], 4, 0
Basically, it kicks off a large number of inserts and HBase grinds to
a halt. Some number of the writes end up getting inserted (usually
around ~50), but then everything stops. Here's the behavior I see
with the region servers:
npin-172-16-12-203.np.local:60030 1311956094792 requests=50,
regions=1, usedHeap=151, maxHeap=16358
npin-172-16-12-204.np.local:60030 1311956094776 requests=5, regions=2,
usedHeap=157, maxHeap=16358
npin-172-16-12-205.np.local:60030 1311956093804 requests=0, regions=0,
usedHeap=134, maxHeap=16358
npin-172-16-12-206.np.local:60030 1311956093809 requests=0, regions=0,
usedHeap=134, maxHeap=16358
npin-172-16-12-207.np.local:60030 1311956094799 requests=0, regions=0,
usedHeap=134, maxHeap=16358
Total: servers: 5 requests=55, regions=3
I did thread dumps on both the masters and region servers during this
time and did not see anything interesting. I'm using 0.90.3-CDH3U1.
Anyone have a suggestion on where to look next?
--
Jeff Whiting
Qualtrics Senior Software Engineer
je...@qualtrics.com