Would do that, thanks! On Sat, Jun 4, 2016 at 6:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> I think this sounds like a bug. > > Search in HBase JIRA first. If there is no JIRA with the same symptom, > consider filing one. > > Cheers > > On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin <linshuai2...@gmail.com> wrote: > >> Hi Ted, >> >> I'm kind of confused, so is this normal behaviour or a bug of hbase? For >> me it looks like a bug, should I fire a JIRA? >> >> Thanks, >> >> Shuai >> >> On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> There were 7 regions Master tried to close which were opening but not >>> yet served. >>> >>> d1c7f3f455f2529da82a2f713b5ee067 was one of them. >>> >>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <linshuai2...@gmail.com> >>> wrote: >>> >>>> Here is the complete log on node6 between 13:10:47 and 13:11:47: >>>> http://paste.openstack.org/raw/505826/ >>>> >>>> The master asked node6 to open several regions. Node6 opened the first >>>> 4 very fast (within 1 seconsd) and got stuck at the 5th one. But there is >>>> no errors at that time. >>>> >>>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> In AssignmentManager#assign(), you should find: >>>>> >>>>> // Send OPEN RPC. If it fails on a IOE or RemoteException, >>>>> // regions will be assigned individually. >>>>> long maxWaitTime = System.currentTimeMillis() + >>>>> this.server.getConfiguration(). >>>>> getLong("hbase.regionserver.rpc.startup.waittime", >>>>> 60000); >>>>> >>>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ? >>>>> >>>>> Cheers >>>>> >>>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <linshuai2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Because of the "opening regions" rpc call sent by master to the >>>>>> region server node6 got timed out after 1 minutes? >>>>>> >>>>>> *RPC call was sent:* >>>>>> >>>>>> 2016-04-30 13:10:47,702 INFO >>>>>> org.apache.hadoop.hbase.master.AssignmentManager: >>>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883 >>>>>> >>>>>> *After 1 minute:* >>>>>> >>>>>> 2016-04-30 13:11:47,780 INFO >>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate >>>>>> with rs-node6.example.com,60020,1458723856883 in order to assign >>>>>> regions, java.io.IOException: Call to >>>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception: >>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4, >>>>>> waitTime=60001, operationTimeout=60000 expired. >>>>>> >>>>>> 2016-04-30 13:11:47,783 DEBUG >>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state >>>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883} >>>>>> >>>>>> >>>>>> I have checked hbase source code, but don't find any specific timeout >>>>>> settings for "open region" rpc call I can use. So I guess the it's using >>>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since >>>>>> there >>>>>> are 20+ regions being assigned to node6 almost at the same moment, node6 >>>>>> gets overloaded and can't finish opening all of them within one minute. >>>>>> >>>>>> So this looks like a hbase bug to me (regions never get online when >>>>>> the region server failed to handle the OpenRegionRequest before the rpc >>>>>> timeout), am I right? >>>>>> >>>>>> >>>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>> >>>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE >>>>>>> request >>>>>>> when it was opening, leading to RegionAlreadyInTransitionException. >>>>>>> >>>>>>> Was there any clue in master log why the close request was sent ? >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> > Hi Ted, >>>>>>> > >>>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH >>>>>>> 5.4.8. The >>>>>>> > RS logs on node6 can be found here < >>>>>>> http://paste.openstack.org/raw/496174/ >>>>>>> > > >>>>>>> > . >>>>>>> > >>>>>>> > Thanks! >>>>>>> > >>>>>>> > Shuai >>>>>>> > >>>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com> >>>>>>> wrote: >>>>>>> > >>>>>>> > > Can you pastebin related server log w.r.t. >>>>>>> > d1c7f3f455f2529da82a2f713b5ee067 >>>>>>> > > from rs-node6 ? >>>>>>> > > >>>>>>> > > Which release of hbase are you using ? >>>>>>> > > >>>>>>> > > Cheers >>>>>>> > > >>>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin < >>>>>>> linshuai2...@gmail.com> >>>>>>> > wrote: >>>>>>> > > >>>>>>> > > > Hi list, >>>>>>> > > > >>>>>>> > > > Last weekend I got a region server crashed, but some regions >>>>>>> never got >>>>>>> > > > online again on other RSes. I've gone through the logs, and >>>>>>> here is the >>>>>>> > > > timeline about some of the events: >>>>>>> > > > >>>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of >>>>>>> a disk >>>>>>> > > > failure. Master started to split rs-node7's WALs >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > 2016-04-30 13:03:50,953 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>>>> Splitting >>>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before >>>>>>> assignment; >>>>>>> > > > region >>>>>>> > > > count=133 >>>>>>> > > > 2016-04-30 13:03:50,966 DEBUG >>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling >>>>>>> batch of >>>>>>> > logs >>>>>>> > > to >>>>>>> > > > split >>>>>>> > > > 2016-04-30 13:03:50,966 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started >>>>>>> splitting 33 >>>>>>> > logs >>>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >>>>>>> > > > ,60020,1458724695128-splitting] >>>>>>> > > > for [rs-node7.example.com,60020,1458724695128] >>>>>>> > > > >>>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions >>>>>>> > > > >>>>>>> > > > 2016-04-30 13:10:47,655 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>>>> > Reassigning >>>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 >>>>>>> was >>>>>>> > carrying >>>>>>> > > > (and 0 regions(s) that were opening on this server) >>>>>>> > > > 2016-04-30 13:10:47,665 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk >>>>>>> assigning 133 >>>>>>> > > > region(s) across 6 server(s), round-robin=true >>>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node1.example.com,60020,1458720625688 >>>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node2.example.com,60020,1458721110988 >>>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node3.example.com,60020,1458721713906 >>>>>>> > > > 2016-04-30 13:10:47,679 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node4.example.com,60020,1458722335527 >>>>>>> > > > 2016-04-30 13:10:47,691 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node5.example.com,60020,1458722992296 >>>>>>> > > > 2016-04-30 13:10:47,702 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>>> > region(s) >>>>>>> > > to >>>>>>> > > > rs-node6.example.com,60020,1458723856883 >>>>>>> > > > >>>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region >>>>>>> servers >>>>>>> > timed >>>>>>> > > > out after 1 minutes >>>>>>> > > > >>>>>>> > > > 2016-04-30 13:11:47,780 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to >>>>>>> communicate >>>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to >>>>>>> assign >>>>>>> > regions >>>>>>> > > > java.io.IOException: Call to >>>>>>> rs-node3.example.com/172.16.6.3:60020 >>>>>>> > > failed >>>>>>> > > > on local exception: >>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: >>>>>>> > > Call >>>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired. >>>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>>> assigning 22 >>>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883, >>>>>>> > reassigning >>>>>>> > > > them >>>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>>> assigning 23 >>>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527, >>>>>>> > reassigning >>>>>>> > > > them >>>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>>> assigning 22 >>>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906, >>>>>>> > reassigning >>>>>>> > > > them >>>>>>> > > > 2016-04-30 13:11:47,783 DEBUG >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region >>>>>>> state >>>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >>>>>>> > > ts=1462021907753, >>>>>>> > > > server=rs-node6.example.com,60020,1458723856883} >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > * After that, part of the regions (40 out of 130 regions) >>>>>>> never got >>>>>>> > > online, >>>>>>> > > > and the following lines were logged repeatly in master log: >>>>>>> > > > >>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update >>>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>>>>> ts=1462021957087, >>>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the >>>>>>> timestamp. >>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is >>>>>>> already in >>>>>>> > > > transition; waiting up to 10668ms >>>>>>> > > > >>>>>>> > > > $ grep 'AssignmentManager: update >>>>>>> {d1c7f3f455f2529da82a2f713b5ee067' >>>>>>> > > > >>>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l >>>>>>> > > > 484 >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > I've searched in mailing list archive and hbase JIRA, but >>>>>>> didn't find >>>>>>> > any >>>>>>> > > > similar situations like this one. The most similar one is >>>>>>> HBASE-14407 >>>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but >>>>>>> after >>>>>>> > reading >>>>>>> > > > its >>>>>>> > > > discussion I don't think that's the same problem. >>>>>>> > > > >>>>>>> > > > Anyone have a clue? Thanks! >>>>>>> > > > >>>>>>> > > > Regards, >>>>>>> > > > Shuai >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >