On Mon, Apr 16, 2012 at 8:21 AM, Bryan Beaudreault
<bbeaudrea...@hubspot.com> wrote:
> We've recently had a problem where regions will get stuck in transition for
> a long period of time.  In fact, they don't ever appear to get
> out-of-transition unless we take manual action.  Last time this happened I
> restarted the master and they were cleared out.  This time I wanted to
> consult the list first.
>

Yeah, sometimes the master's notion of what the cluster state is goes
out of agreement w/ conditions on the ground and restart of master
forcing it to reconsult the cluster is the only way to clear up
certain states (much has been fixed around the issues that gave rise
to these conditions in later hbase's but that you probably figured and
its probably of little immediate help to you at the moment).

> I checked the admin ui for all 24 of our servers, and the region does not
> appear to be hosted anywhere.  If I look in hdfs, I do see the region there
> and it has 2 files.  The first instance of this region in my HMaster logs
> is:
>
> 2/04/15 17:48:06 INFO master.HMaster: balance
>> hri=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.,
>> src=XXXXXXXXX.ec2.internal,60020,1334064456919,
>> dest=XXXXXXXX.ec2.internal,60020,1334064197946
>> 12/04/15 17:48:06 INFO master.AssignmentManager: Server
>> serverName=XXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0) returned
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Received close for
>> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>

This seems like classic case of master out of whack w/ the cluster;
its trying to rebalance a region that is not where it thinks it is.


> It then keeps saying the same few logs every ~30 mins:
>
> 12/04/15 18:18:18 INFO master.AssignmentManager: Regions in transition
>> timed out:

Yeah, every 30mins a checker runs.


>>  visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> state=PENDING_CLOSE, ts=1334526491544, server=null
>> 12/04/15 18:18:18 INFO master.AssignmentManager: Region has been
>> PENDING_CLOSE for too long, running forced unassign again on
>> region=visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> 12/04/15 18:18:18 INFO master.AssignmentManager: Server
>> serverName=XXXXXXXXX.ec2.internal,60020,1334064456919, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0) returned
>> org.apache.hadoop.hbase.NotServingRegionException:
>> org.apache.hadoop.hbase.NotServingRegionException: Received close for
>> visitor-activities-a2,\x00\x02EG120909,1333750824238.703fed4411f2d6ff4b3ea80506fb635e.
>> but we are not serving it for 703fed4411f2d6ff4b3ea80506fb635e
>
>
> Any ideas how I can avoid this, or a better solution than restarting the
> HMaster?
>

Can you grep this region in your master log so we can see its history?
 If its not deployed anywhere and all your data is online, restarting
the master might be the only think you can do in 0.90.x era hbase to
get rid of the above.  You could also try deleting that znode  from
zk.  Fire up the zk command line by doing ./bin/hbase zkcli.   Do
help.  You should be able to figure it.  If you can't find the above
znode in zk, then for sure its only the master's head and restart of
master is way to go (In later hbase's, should this condition arise,
there is an api that you can poke to make it clear the above so you
don't have to restart master).

St.Ack

Reply via email to