Hi Ted, I'm kind of confused, so is this normal behaviour or a bug of hbase? For me it looks like a bug, should I fire a JIRA?
Thanks, Shuai On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yuzhih...@gmail.com> wrote: > There were 7 regions Master tried to close which were opening but not yet > served. > > d1c7f3f455f2529da82a2f713b5ee067 was one of them. > > On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <linshuai2...@gmail.com> > wrote: > >> Here is the complete log on node6 between 13:10:47 and 13:11:47: >> http://paste.openstack.org/raw/505826/ >> >> The master asked node6 to open several regions. Node6 opened the first 4 >> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no >> errors at that time. >> >> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> In AssignmentManager#assign(), you should find: >>> >>> // Send OPEN RPC. If it fails on a IOE or RemoteException, >>> // regions will be assigned individually. >>> long maxWaitTime = System.currentTimeMillis() + >>> this.server.getConfiguration(). >>> getLong("hbase.regionserver.rpc.startup.waittime", 60000); >>> >>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ? >>> >>> Cheers >>> >>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <linshuai2...@gmail.com> >>> wrote: >>> >>>> Because of the "opening regions" rpc call sent by master to the region >>>> server node6 got timed out after 1 minutes? >>>> >>>> *RPC call was sent:* >>>> >>>> 2016-04-30 13:10:47,702 INFO >>>> org.apache.hadoop.hbase.master.AssignmentManager: >>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883 >>>> >>>> *After 1 minute:* >>>> >>>> 2016-04-30 13:11:47,780 INFO >>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate >>>> with rs-node6.example.com,60020,1458723856883 in order to assign >>>> regions, java.io.IOException: Call to >>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception: >>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4, >>>> waitTime=60001, operationTimeout=60000 expired. >>>> >>>> 2016-04-30 13:11:47,783 DEBUG >>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state >>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883} >>>> >>>> >>>> I have checked hbase source code, but don't find any specific timeout >>>> settings for "open region" rpc call I can use. So I guess the it's using >>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there >>>> are 20+ regions being assigned to node6 almost at the same moment, node6 >>>> gets overloaded and can't finish opening all of them within one minute. >>>> >>>> So this looks like a hbase bug to me (regions never get online when the >>>> region server failed to handle the OpenRegionRequest before the rpc >>>> timeout), am I right? >>>> >>>> >>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE >>>>> request >>>>> when it was opening, leading to RegionAlreadyInTransitionException. >>>>> >>>>> Was there any clue in master log why the close request was sent ? >>>>> >>>>> Cheers >>>>> >>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com> >>>>> wrote: >>>>> >>>>> > Hi Ted, >>>>> > >>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH >>>>> 5.4.8. The >>>>> > RS logs on node6 can be found here < >>>>> http://paste.openstack.org/raw/496174/ >>>>> > > >>>>> > . >>>>> > >>>>> > Thanks! >>>>> > >>>>> > Shuai >>>>> > >>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> > >>>>> > > Can you pastebin related server log w.r.t. >>>>> > d1c7f3f455f2529da82a2f713b5ee067 >>>>> > > from rs-node6 ? >>>>> > > >>>>> > > Which release of hbase are you using ? >>>>> > > >>>>> > > Cheers >>>>> > > >>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <linshuai2...@gmail.com> >>>>> > wrote: >>>>> > > >>>>> > > > Hi list, >>>>> > > > >>>>> > > > Last weekend I got a region server crashed, but some regions >>>>> never got >>>>> > > > online again on other RSes. I've gone through the logs, and here >>>>> is the >>>>> > > > timeline about some of the events: >>>>> > > > >>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a >>>>> disk >>>>> > > > failure. Master started to split rs-node7's WALs >>>>> > > > >>>>> > > > >>>>> > > > 2016-04-30 13:03:50,953 INFO >>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>> Splitting >>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before >>>>> assignment; >>>>> > > > region >>>>> > > > count=133 >>>>> > > > 2016-04-30 13:03:50,966 DEBUG >>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch >>>>> of >>>>> > logs >>>>> > > to >>>>> > > > split >>>>> > > > 2016-04-30 13:03:50,966 INFO >>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started >>>>> splitting 33 >>>>> > logs >>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >>>>> > > > ,60020,1458724695128-splitting] >>>>> > > > for [rs-node7.example.com,60020,1458724695128] >>>>> > > > >>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions >>>>> > > > >>>>> > > > 2016-04-30 13:10:47,655 INFO >>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>> > Reassigning >>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was >>>>> > carrying >>>>> > > > (and 0 regions(s) that were opening on this server) >>>>> > > > 2016-04-30 13:10:47,665 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning >>>>> 133 >>>>> > > > region(s) across 6 server(s), round-robin=true >>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node1.example.com,60020,1458720625688 >>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node2.example.com,60020,1458721110988 >>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node3.example.com,60020,1458721713906 >>>>> > > > 2016-04-30 13:10:47,679 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node4.example.com,60020,1458722335527 >>>>> > > > 2016-04-30 13:10:47,691 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node5.example.com,60020,1458722992296 >>>>> > > > 2016-04-30 13:10:47,702 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>> > region(s) >>>>> > > to >>>>> > > > rs-node6.example.com,60020,1458723856883 >>>>> > > > >>>>> > > > * 13:11:47 the opening regions rpc sent by master to region >>>>> servers >>>>> > timed >>>>> > > > out after 1 minutes >>>>> > > > >>>>> > > > 2016-04-30 13:11:47,780 INFO >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to >>>>> communicate >>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign >>>>> > regions >>>>> > > > java.io.IOException: Call to >>>>> rs-node3.example.com/172.16.6.3:60020 >>>>> > > failed >>>>> > > > on local exception: >>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: >>>>> > > Call >>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired. >>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>> assigning 22 >>>>> > > > regions to server rs-node6.example.com,60020,1458723856883, >>>>> > reassigning >>>>> > > > them >>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>> assigning 23 >>>>> > > > regions to server rs-node4.example.com,60020,1458722335527, >>>>> > reassigning >>>>> > > > them >>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>> assigning 22 >>>>> > > > regions to server rs-node3.example.com,60020,1458721713906, >>>>> > reassigning >>>>> > > > them >>>>> > > > 2016-04-30 13:11:47,783 DEBUG >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region >>>>> state >>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >>>>> > > ts=1462021907753, >>>>> > > > server=rs-node6.example.com,60020,1458723856883} >>>>> > > > >>>>> > > > >>>>> > > > * After that, part of the regions (40 out of 130 regions) never >>>>> got >>>>> > > online, >>>>> > > > and the following lines were logged repeatly in master log: >>>>> > > > >>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update >>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>>> ts=1462021957087, >>>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp. >>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is >>>>> already in >>>>> > > > transition; waiting up to 10668ms >>>>> > > > >>>>> > > > $ grep 'AssignmentManager: update >>>>> {d1c7f3f455f2529da82a2f713b5ee067' >>>>> > > > >>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l >>>>> > > > 484 >>>>> > > > >>>>> > > > >>>>> > > > I've searched in mailing list archive and hbase JIRA, but didn't >>>>> find >>>>> > any >>>>> > > > similar situations like this one. The most similar one is >>>>> HBASE-14407 >>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after >>>>> > reading >>>>> > > > its >>>>> > > > discussion I don't think that's the same problem. >>>>> > > > >>>>> > > > Anyone have a clue? Thanks! >>>>> > > > >>>>> > > > Regards, >>>>> > > > Shuai >>>>> > > > >>>>> > > >>>>> > >>>>> >>>> >>>> >>> >> >