I've just started using hbase, and have encountered a perplexing bug.
The bug occurs on one set of Linux boxes, and not on another set, even
though they're both x86_64 Linux, and both are running -identical- JVM
releases.

I've attached a description of the probelm below, but really, what I'm
wondering is, if there's a description someplace of various places to
turn on instrumentation in hbase, so I can figure out what's wrong.  I
plan to do a lot of work with hbase in the future, so knowing how to
debug it is in some sense more important than finding out the fix for
this particular bug.

I really am looking to learn how to fish here.  I'm sure I can slowly
dig around find all the various tracing facilities and such, but I
figured there might be a cheat-sheet someplace ....

Thanks,
--chet--

================================================================

Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1
namenode, and anywhere from 2-5 datanodes which are also
regionservers.  I'm running a single zookeeper node, since this is
just for testing.  Furthermore, all these machines are isolated,
high-performance, SMP, with lots of memory.  Modern Intel/AMD boxes.

The cluster which 'works" runs Fedora 9 on Opteron, and the one that
"fails" runs RHEL5 on Intel Xeon (something-or-other -- I forget).

The test I'm running is Yahoo Cluster benchmark (YCSB).  I'm just
trying to load 1m records, and on the cluster that fails, I get,
variously:

(1) a load will fail with an error like:

com.yahoo.ycsb.DBException: 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
region server  -- nothing found, no 'location' returned, tableName=usertable, 
reload=true -- for region , row 'user1000015788', but failed after 11 
attempts.
Exceptions:
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region usertable,,1294095537393
org.apache.hadoop.hbase.client.NoServerForRegionException: No server address
listed in .META. for region usertable,,1294095537393

(b) a load will succeed, but there won't be 1m rows (where I use the
"count" command in "hbase shell" to count).

(c) sometimes, a "truncate" will fail, with an error of the form
above.  the step which fails is the "disable" step.

Java stack-dumps from the regionservers don't show any threads doing
anything interesting.  I don't know how to interrogate Zookeeper;
perhaps there's something messed-up in there ....

Reply via email to