I've just started using hbase, and have encountered a perplexing bug. The bug occurs on one set of Linux boxes, and not on another set, even though they're both x86_64 Linux, and both are running -identical- JVM releases.
I've attached a description of the probelm below, but really, what I'm wondering is, if there's a description someplace of various places to turn on instrumentation in hbase, so I can figure out what's wrong. I plan to do a lot of work with hbase in the future, so knowing how to debug it is in some sense more important than finding out the fix for this particular bug. I really am looking to learn how to fish here. I'm sure I can slowly dig around find all the various tracing facilities and such, but I figured there might be a cheat-sheet someplace .... Thanks, --chet-- ================================================================ Basically, I set up hadoop 0.20.0 + hbase 0.20.6, in a cluster with 1 namenode, and anywhere from 2-5 datanodes which are also regionservers. I'm running a single zookeeper node, since this is just for testing. Furthermore, all these machines are isolated, high-performance, SMP, with lots of memory. Modern Intel/AMD boxes. The cluster which 'works" runs Fedora 9 on Opteron, and the one that "fails" runs RHEL5 on Intel Xeon (something-or-other -- I forget). The test I'm running is Yahoo Cluster benchmark (YCSB). I'm just trying to load 1m records, and on the cluster that fails, I get, variously: (1) a load will fail with an error like: com.yahoo.ycsb.DBException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact region server -- nothing found, no 'location' returned, tableName=usertable, reload=true -- for region , row 'user1000015788', but failed after 11 attempts. Exceptions: org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in .META. for region usertable,,1294095537393 org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in .META. for region usertable,,1294095537393 (b) a load will succeed, but there won't be 1m rows (where I use the "count" command in "hbase shell" to count). (c) sometimes, a "truncate" will fail, with an error of the form above. the step which fails is the "disable" step. Java stack-dumps from the regionservers don't show any threads doing anything interesting. I don't know how to interrogate Zookeeper; perhaps there's something messed-up in there ....