[jira] [Commented] (HBASE-20978) [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure

stack (JIRA) Mon, 30 Jul 2018 15:45:10 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562627#comment-16562627
 ]


stack commented on HBASE-20978:
-------------------------------

bq. So the problem here is that, there are some holes in the wal logging, which 
means that we have finished the sub procedure but haven't woken up the parent 
procedure?

I think there is a 'hole' but its more like if we finish the subprocedure and 
crash, its possible we don't wake the parent for some reason. Maybe its an 
issue on load of the master WAL procedures? I was going to try and repro in a 
test..

> [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure
> ----------------------------------------------------------------
>
>                 Key: HBASE-20978
>                 URL: https://issues.apache.org/jira/browse/HBASE-20978
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.0.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.2
>
>
> Testing tip of branch-2.0, ran into this:
> {code}
> 2018-07-29 01:45:33,002 INFO  [master/ve0524:16000] master.HMaster: Master 
> has completed initialization 13.854sec                                        
>                                                                    2018-07-29 
> 01:45:33,003 INFO  [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on 
> 533fb79ba23b27e9e0715b51daeb30c1                                              
>                                                     2018-07-29 01:45:33,003 
> WARN  [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031
>   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>                                                                               
>                                      at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
>                                                                               
>                                          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
> {code}
> It then shows as the below in the UI:
> {code}
> Id    Parent  State   Owner   Type    Start Time      Last Update     Errors  
> Parameters
> 1820          WAITING stack   MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031       Sun Jul 29 
> 01:33:37 PDT 2018    Sun Jul 29 01:33:38 PDT 2018            [ { state => [ 
> '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace => 
> 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, 
> startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==', 
> offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { 
> hostName => 've0538.halxg.cloudera.com', port => '16020', startCode => 
> '1532847421672' }, destinationServer => { hostName => 
> 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' } 
> } ]
> {code}
> This is what we'd just read from hbase:meta:
> {code}
> 2018-07-29 01:45:32,802 INFO  [master/ve0524:16000] 
> assignment.RegionStateStore: Load hbase:meta entry 
> region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED, 
> lastHost=ve0538.halxg.cloudera.com,16020,1532847421672, 
> regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672, 
> openSeqNum=1544600
> {code}
> Before this, we'd just logged this:
> 2018-07-29 01:33:39,786 INFO  [PEWorker-14] assignment.RegionStateStore: 
> pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1, 
> regionState=CLOSED
> Going back in history, we do the above each time the Master gets restarted so 
> the region is offlined and never brought back online.
> It is failing here:
> {code}
>   private void execProcedure(final RootProcedureState procStack,
>       final Procedure<TEnvironment> procedure) {
>     Preconditions.checkArgument(procedure.getState() == 
> ProcedureState.RUNNABLE,
>         procedure.toString());
> {code}
> Its the parent move region that is trying to run and failing. It is not 
> RUNNABLE? Because the subprocedure was 'done' but not fully?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20978) [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure

Reply via email to