[jira] [Comment Edited] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

Tak Lon (Stephen) Wu (JIRA) Sat, 18 Aug 2018 11:54:08 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584709#comment-16584709
 ]


Tak Lon (Stephen) Wu edited comment on HBASE-20671 at 8/18/18 6:53 PM:
-----------------------------------------------------------------------

hi guys, I am not 100% sure yet but I recently worked on using 
{{hbase.readonly}} to be true on hbase-2.1.0 for a read replica cluster that 
the {{hbase:namespace}} cannot be assigned (infinite loop when 
{{isTableAssigned}} is checking for {{hbase:namespace}} table but return false) 
during the read replica cluster startup.

{{I found the patch of HBASE-20702 has skipped `empty` rows but seems like rows 
for system table and data table e.g. }}{{hbase:namespace}}{{ and 
}}{{hbase-test}}{{ (a data table) should not be considered as `empty row`. I 
made my band-aid change below and the cluster resumed to be started. (updates, 
the data table }}{{hbase-test}}{{ also cannot be moved to offline and cannot be 
assigned during startup)}}

{{## captured those `empty rows` message (is it really empty?)}}
{noformat}
2018-08-18 05:10:44,735 INFO  [Thread-15] assignment.RegionStateStore: Load 
hbase:meta entry region=75f5bf7e777efcb255003a25f558d7c6, regionState=null, 
lastHost=null, regionLocation=null, openSeqNum=-1
2018-08-18 05:10:44,735 WARN  [Thread-15] assignment.AssignmentManager: 
Skipping empty 
row=keyvalues={hbase-test,,1534568583995.75f5bf7e777efcb255003a25f558d7c6./info:regioninfo/1534569044664/Put/vlen=56/seqid=0}
2018-08-18 05:10:44,736 INFO  [Thread-15] assignment.RegionStateStore: Load 
hbase:meta entry region=2148e3cbfc06d918ebeeb5fdcdbea246, regionState=null, 
lastHost=null, regionLocation=null, openSeqNum=-1
2018-08-18 05:10:44,736 WARN  [Thread-15] assignment.AssignmentManager: 
Skipping empty 
row=keyvalues={hbase:namespace,,1534568481536.2148e3cbfc06d918ebeeb5fdcdbea246./info:regioninfo/1534569044667/Put/vlen=41/seqid=0}
{noformat}

{{## my band-aid but it did not work for data table during the startup}}
{noformat}
  private void loadMeta() throws IOException {
    // TODO: use a thread pool
    regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
      @Override
      public void visitRegionState(Result result, final RegionInfo regionInfo, 
final State state,
          final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
        if (!regionInfo.getTable().equals(TableName.NAMESPACE_TABLE_NAME)) { // 
<-- added to unblock the read replica cluster
          if (state == null && regionLocation == null && lastHost == null
              && openSeqNum == SequenceId.NO_SEQUENCE_ID) {
            // This is a row with nothing in it.
            LOG.warn("Skipping empty row={}", result);
            return;
          }
        }
{noformat}

I will submit another JIRA to fix this issue when I'm getting closer, but just 
want to head up and let you guys know there is an issue after this patch.


was (Author: taklwu):
hi guys, I am not 100% sure yet but I recently worked on using 
{{hbase.readonly}} to be true on hbase-2.1.0 for a read replica cluster that 
the {{hbase:namespace}} cannot be assigned (infinite loop when 
{{isTableAssigned}} is checking for {{hbase:namespace}} table but return false) 
during the read replica cluster startup.

I found the patch of HBASE-20702 has skipped `empty` rows but seems like rows 
for system table(s) e.g. {{hbase:namespace}} should not be considered as empty. 
I made my band-aid change below and the cluster resumed to be started. 
{noformat}
  private void loadMeta() throws IOException {
    // TODO: use a thread pool
    regionStateStore.visitMeta(new RegionStateStore.RegionStateVisitor() {
      @Override
      public void visitRegionState(Result result, final RegionInfo regionInfo, 
final State state,
          final ServerName regionLocation, final ServerName lastHost, final 
long openSeqNum) {
        if (!regionInfo.getTable().equals(TableName.NAMESPACE_TABLE_NAME)) { // 
<-- added to unblock the read replica cluster
          if (state == null && regionLocation == null && lastHost == null
              && openSeqNum == SequenceId.NO_SEQUENCE_ID) {
            // This is a row with nothing in it.
            LOG.warn("Skipping empty row={}", result);
            return;
          }
        }
{noformat}

I will submit another JIRA to fix this issue when I'm getting closer, but just 
want to head up and let you guys know there is an issue after this patch. 

> Merged region brought back to life causing RS to be killed by Master
> --------------------------------------------------------------------
>
>                 Key: HBASE-20671
>                 URL: https://issues.apache.org/jira/browse/HBASE-20671
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.0.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>         Attachments: 0001-Test-for-HBASE-20671.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-000003.hwx.site.log.zip, 
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-000002.hwx.site.log.zip,
>  workaround.txt
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then 
> ended up assigning the children region back out to the cluster. There is a 
> log message which appears to indicate that RegionStates acknowledges that it 
> doesn't know what this region is as it's replaying the pv2 WAL; however, it 
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=20000] master.HMaster: 
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c 
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG 
> [master/ctr-e138-1518143905142-336066-01-000003:20000] 
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS; 
> MergeTableRegionsProcedure table=tabletwo_merge, 
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b], 
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-000003:20000] 
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO  
> [master/ctr-e138-1518143905142-336066-01-000003:20000] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-000003:20000] 
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b 
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO  
> [master/ctr-e138-1518143905142-336066-01-000003:20000] 
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!! 
> rit=OFFLINE, location=null, table=tabletwo_merge, 
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to 
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=20000] 
> assignment.AssignmentManager: Killing 
> ctr-e138-1518143905142-336066-01-000002.hwx.site,16020,1527654546619: Not 
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-20671) Merged region brought back to life causing RS to be killed by Master

Reply via email to