Re: Some regions never get online after a region server crashes

Shuai Lin Fri, 20 May 2016 06:21:02 -0700

Because of the "opening regions" rpc call sent by master to the region
server node6 got timed out after 1 minutes?


*RPC call was sent:*

2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Assigning 22 region(s) tors-node6.example.com,60020,1458723856883

*After 1 minute:*

2016-04-30 13:11:47,780 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
with rs-node6.example.com,60020,1458723856883 in order to assign regions,
java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=4, waitTime=60001, operationTimeout=60000 expired.

2016-04-30 13:11:47,783 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Force region state
offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}


I have checked hbase source code, but don't find any specific timeout
settings for "open region" rpc call I can use. So I guess the it's using
the default "hbase.rpc.timeout", which defaults to 60secs. And since there
are 20+ regions being assigned to node6 almost at the same moment, node6
gets overloaded and can't finish opening all of them within one minute.

So this looks like a hbase bug to me (regions never get online when the
region server failed to handle the OpenRegionRequest before the rpc
timeout), am I right?


On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
> when it was opening, leading to RegionAlreadyInTransitionException.
>
> Was there any clue in master log why the close request was sent ?
>
> Cheers
>
> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> > RS logs on node6 can be found here <
> http://paste.openstack.org/raw/496174/
> > >
> >  .
> >
> > Thanks!
> >
> > Shuai
> >
> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Can you pastebin related server log w.r.t.
> > d1c7f3f455f2529da82a2f713b5ee067
> > > from rs-node6 ?
> > >
> > > Which release of hbase are you using ?
> > >
> > > Cheers
> > >
> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <linshuai2...@gmail.com>
> > wrote:
> > >
> > > > Hi list,
> > > >
> > > > Last weekend I got a region server crashed, but some regions never
> got
> > > > online again on other RSes. I've gone through the logs, and here is
> the
> > > > timeline about some of the events:
> > > >
> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > > failure. Master started to split rs-node7's WALs
> > > >
> > > >
> > > > 2016-04-30 13:03:50,953 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> > > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > > region
> > > > count=133
> > > > 2016-04-30 13:03:50,966 DEBUG
> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> > logs
> > > to
> > > > split
> > > > 2016-04-30 13:03:50,966 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> > logs
> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > > ,60020,1458724695128-splitting]
> > > > for [rs-node7.example.com,60020,1458724695128]
> > > >
> > > > * 13:10:47 WAL splits done, master began to re-assign regions
> > > >
> > > > 2016-04-30 13:10:47,655 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > Reassigning
> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> > carrying
> > > > (and 0 regions(s) that were opening on this server)
> > > > 2016-04-30 13:10:47,665 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > > region(s) across 6 server(s), round-robin=true
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node1.example.com,60020,1458720625688
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node2.example.com,60020,1458721110988
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node3.example.com,60020,1458721713906
> > > > 2016-04-30 13:10:47,679 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> > region(s)
> > > to
> > > > rs-node4.example.com,60020,1458722335527
> > > > 2016-04-30 13:10:47,691 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node5.example.com,60020,1458722992296
> > > > 2016-04-30 13:10:47,702 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node6.example.com,60020,1458723856883
> > > >
> > > > * 13:11:47 the opening regions rpc sent by master to region servers
> > timed
> > > > out after 1 minutes
> > > >
> > > > 2016-04-30 13:11:47,780 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
> communicate
> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
> > regions
> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> > > failed
> > > > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> > > Call
> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 22
> > > > regions to server rs-node6.example.com,60020,1458723856883,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 23
> > > > regions to server rs-node4.example.com,60020,1458722335527,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,782 INFO
> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning
> 22
> > > > regions to server rs-node3.example.com,60020,1458721713906,
> > reassigning
> > > > them
> > > > 2016-04-30 13:11:47,783 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> > > ts=1462021907753,
> > > > server=rs-node6.example.com,60020,1458723856883}
> > > >
> > > >
> > > > * After that, part of the regions (40 out of 130 regions) never got
> > > online,
> > > > and the following lines were logged repeatly in master log:
> > > >
> > > > 2016-04-30 13:12:37,188 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021957087,
> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
> > > > 2016-04-30 13:12:37,188 DEBUG
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already
> in
> > > > transition; waiting up to 10668ms
> > > >
> > > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> > > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc
> -l
> > > > 484
> > > >
> > > >
> > > > I've searched in mailing list archive and hbase JIRA, but didn't find
> > any
> > > > similar situations like this one. The most similar one is HBASE-14407
> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
> > reading
> > > > its
> > > > discussion I don't think that's the same problem.
> > > >
> > > > Anyone have a clue? Thanks!
> > > >
> > > > Regards,
> > > > Shuai
> > > >
> > >
> >
>

Re: Some regions never get online after a region server crashes

Reply via email to