Daniel Leffel wrote:
After experiencing a region server that would not exit (HBASE-617), I tried
to bring back up hbase (after first having shut down and bringing back up
DFS).

There are around 370 regions. The first 250 were assigned to region servers
within 5 minutes of startup. The rest of the regions took the better part of
the day to become assigned to a region server. A quick inspection of the
regionserver logs were showing messages like the following:

2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: Received
MSG_REPORT_PROCESS_OPEN : categories,2864153,1211005494348 from
10.254.26.31:60020

These messages are sent over to the master by the regionserver as a kind of ping saying "I'm still alive and working on whatever it was you gave me".

Can you tell what was happening by looking in regionserver logs?

Was it that all regions had been given to a single regionserver and it was busy replaying edits before bringing the regions online (There is a single worker thread per regionserver. If lots of edits to replay, can take seconds to minutes to bring on a region).

Did the regions come online gradually or all in a lump?
After waiting for all the regions to be assigned (and an absence of the
above message appearing in the log), I started a MapReduce job that iterates
over all regions. Immediately, the above mentioned region began to show up
in the logs again with the above message and the job failed with an
IOException because it couldn't locate blocks.

I ran fsck on /hbase and sure enough, blocks are missing from the following
file (although it reports a size of 0 as what's missing - I presume it just
doesn't know):

/hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000
The above looks like the innocuous messages described in https://issues.apache.org/jira/browse/HBASE-509.

St.Ack

Reply via email to