Re: Some regions never get online after a region server crashes

Shuai Lin Mon, 06 Jun 2016 07:24:28 -0700

Would do that, thanks!

On Sat, Jun 4, 2016 at 6:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:


> I think this sounds like a bug.
>
> Search in HBase JIRA first. If there is no JIRA with the same symptom,
> consider filing one.
>
> Cheers
>
> On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin <linshuai2...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> I'm kind of confused, so is this normal behaviour or a bug of hbase? For
>> me it looks like a bug, should I fire a JIRA?
>>
>> Thanks,
>>
>> Shuai
>>
>> On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> There were 7 regions Master tried to close which were opening but not
>>> yet served.
>>>
>>> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>>>
>>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <linshuai2...@gmail.com>
>>> wrote:
>>>
>>>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>>>> http://paste.openstack.org/raw/505826/
>>>>
>>>> The master asked node6 to open several regions. Node6 opened the first
>>>> 4 very fast (within 1 seconsd) and got stuck at the 5th one. But there is
>>>> no errors at that time.
>>>>
>>>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> In AssignmentManager#assign(), you should find:
>>>>>
>>>>>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>>>>           // regions will be assigned individually.
>>>>>           long maxWaitTime = System.currentTimeMillis() +
>>>>>             this.server.getConfiguration().
>>>>>               getLong("hbase.regionserver.rpc.startup.waittime",
>>>>> 60000);
>>>>>
>>>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <linshuai2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Because of the "opening regions" rpc call sent by master to the
>>>>>> region server node6 got timed out after 1 minutes?
>>>>>>
>>>>>> *RPC call was sent:*
>>>>>>
>>>>>> 2016-04-30 13:10:47,702 INFO 
>>>>>> org.apache.hadoop.hbase.master.AssignmentManager:
>>>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>>>>
>>>>>> *After 1 minute:*
>>>>>>
>>>>>> 2016-04-30 13:11:47,780 INFO
>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>>>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>>>>> regions, java.io.IOException: Call to
>>>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>>>>> waitTime=60001, operationTimeout=60000 expired.
>>>>>>
>>>>>> 2016-04-30 13:11:47,783 DEBUG
>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>>>>
>>>>>>
>>>>>> I have checked hbase source code, but don't find any specific timeout
>>>>>> settings for "open region" rpc call I can use. So I guess the it's using
>>>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since 
>>>>>> there
>>>>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>>>>> gets overloaded and can't finish opening all of them within one minute.
>>>>>>
>>>>>> So this looks like a hbase bug to me (regions never get online when
>>>>>> the region server failed to handle the OpenRegionRequest before the rpc
>>>>>> timeout), am I right?
>>>>>>
>>>>>>
>>>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>>>>>> request
>>>>>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>>>>>
>>>>>>> Was there any clue in master log why the close request was sent ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Hi Ted,
>>>>>>> >
>>>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>>>>>>> 5.4.8. The
>>>>>>> > RS logs on node6 can be found here <
>>>>>>> http://paste.openstack.org/raw/496174/
>>>>>>> > >
>>>>>>> >  .
>>>>>>> >
>>>>>>> > Thanks!
>>>>>>> >
>>>>>>> > Shuai
>>>>>>> >
>>>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > > Can you pastebin related server log w.r.t.
>>>>>>> > d1c7f3f455f2529da82a2f713b5ee067
>>>>>>> > > from rs-node6 ?
>>>>>>> > >
>>>>>>> > > Which release of hbase are you using ?
>>>>>>> > >
>>>>>>> > > Cheers
>>>>>>> > >
>>>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <
>>>>>>> linshuai2...@gmail.com>
>>>>>>> > wrote:
>>>>>>> > >
>>>>>>> > > > Hi list,
>>>>>>> > > >
>>>>>>> > > > Last weekend I got a region server crashed, but some regions
>>>>>>> never got
>>>>>>> > > > online again on other RSes. I've gone through the logs, and
>>>>>>> here is the
>>>>>>> > > > timeline about some of the events:
>>>>>>> > > >
>>>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of
>>>>>>> a disk
>>>>>>> > > > failure. Master started to split rs-node7's WALs
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:03:50,953 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>>> Splitting
>>>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>>>>>> assignment;
>>>>>>> > > > region
>>>>>>> > > > count=133
>>>>>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling
>>>>>>> batch of
>>>>>>> > logs
>>>>>>> > > to
>>>>>>> > > > split
>>>>>>> > > > 2016-04-30 13:03:50,966 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started
>>>>>>> splitting 33
>>>>>>> > logs
>>>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>>>>>> > > > ,60020,1458724695128-splitting]
>>>>>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>>>>>> > > >
>>>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:10:47,655 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>>>>>> > Reassigning
>>>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128
>>>>>>> was
>>>>>>> > carrying
>>>>>>> > > > (and 0 regions(s) that were opening on this server)
>>>>>>> > > > 2016-04-30 13:10:47,665 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk
>>>>>>> assigning 133
>>>>>>> > > > region(s) across 6 server(s), round-robin=true
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node1.example.com,60020,1458720625688
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node2.example.com,60020,1458721110988
>>>>>>> > > > 2016-04-30 13:10:47,667 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node3.example.com,60020,1458721713906
>>>>>>> > > > 2016-04-30 13:10:47,679 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node4.example.com,60020,1458722335527
>>>>>>> > > > 2016-04-30 13:10:47,691 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node5.example.com,60020,1458722992296
>>>>>>> > > > 2016-04-30 13:10:47,702 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>>>>>> > region(s)
>>>>>>> > > to
>>>>>>> > > > rs-node6.example.com,60020,1458723856883
>>>>>>> > > >
>>>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region
>>>>>>> servers
>>>>>>> > timed
>>>>>>> > > > out after 1 minutes
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:11:47,780 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>>>>>> communicate
>>>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to
>>>>>>> assign
>>>>>>> > regions
>>>>>>> > > > java.io.IOException: Call to
>>>>>>> rs-node3.example.com/172.16.6.3:60020
>>>>>>> > > failed
>>>>>>> > > > on local exception:
>>>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>>>>>> > > Call
>>>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 22
>>>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 23
>>>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,782 INFO
>>>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>>>>>> assigning 22
>>>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>>>>>> > reassigning
>>>>>>> > > > them
>>>>>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>>>>>> state
>>>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>>>>>> > > ts=1462021907753,
>>>>>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > * After that, part of the regions (40 out of 130 regions)
>>>>>>> never got
>>>>>>> > > online,
>>>>>>> > > > and the following lines were logged repeatly in master log:
>>>>>>> > > >
>>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>>>>>> ts=1462021957087,
>>>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the
>>>>>>> timestamp.
>>>>>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>>>>>> already in
>>>>>>> > > > transition; waiting up to 10668ms
>>>>>>> > > >
>>>>>>> > > > $ grep 'AssignmentManager: update
>>>>>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>>>>>> > > >
>>>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>>>>>> > > > 484
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > I've searched in mailing list archive and hbase JIRA, but
>>>>>>> didn't find
>>>>>>> > any
>>>>>>> > > > similar situations like this one. The most similar one is
>>>>>>> HBASE-14407
>>>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but
>>>>>>> after
>>>>>>> > reading
>>>>>>> > > > its
>>>>>>> > > > discussion I don't think that's the same problem.
>>>>>>> > > >
>>>>>>> > > > Anyone have a clue? Thanks!
>>>>>>> > > >
>>>>>>> > > > Regards,
>>>>>>> > > > Shuai
>>>>>>> > > >
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Some regions never get online after a region server crashes

Reply via email to