[ 
https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569530#comment-16569530
 ] 

stack commented on HBASE-20978:
-------------------------------

branch-2.0 has HBASE-20846 now. My long-run test failed with another instance 
of this issue. Notes below but just repeat of above. Still need to dig. Making 
this critical.

The region was being moved because of a balance. The unassign had completed. 
Here are the log lines:

{code}
....
2018-08-05 01:19:58,555 INFO  [Thread-19] procedure.MasterProcedureScheduler: 
pid=4965, state=WAITING:MOVE_REGION_ASSIGN, hasLock=false; MoveRegionProcedure 
hri=e7c120b0eb913346e4bead908ebed468, 
source=ve0536.halxg.cloudera.com,16020,1533402328997, 
destination=ve0534.halxg.cloudera.com,16020,1533457059157 checking lock on 
e7c120b0eb913346e4bead908ebed468
...
2018-08-05 01:19:58,555 INFO  [Thread-19] procedure.MasterProcedureScheduler: 
pid=4967, ppid=4965, state=SUCCESS, hasLock=false; UnassignProcedure 
table=IntegrationTestBigLinkedList, region=e7c120b0eb913346e4bead908ebed468, 
server=ve0536.halxg.cloudera.com,16020,1533402328997 checking lock on 
e7c120b0eb913346e4bead908ebed468
...
CRASH!
...
2018-08-05 01:20:03,698 INFO  [Thread-19] assignment.RegionStateStore: Load 
hbase:meta entry region=e7c120b0eb913346e4bead908ebed468, regionState=CLOSED, 
lastHost=ve0536.halxg.cloudera.com,16020,1533402328997, 
regionLocation=ve0536.halxg.cloudera.com,16020,1533402328997, openSeqNum=2267217
...
2018-08-05 01:20:03,968 INFO  [Thread-19] master.HMaster: Master has completed 
initialization 10.371sec                                                        
                                                                      
2018-08-05 01:20:03,969 WARN  [PEWorker-8] procedure2.ProcedureExecutor: Worker 
terminating UNNATURALLY null                                                    
                                                                     
java.lang.IllegalArgumentException: NOT RUNNABLE! pid=4965, 
state=WAITING:MOVE_REGION_ASSIGN, hasLock=true; MoveRegionProcedure 
hri=e7c120b0eb913346e4bead908ebed468, 
source=ve0536.halxg.cloudera.com,16020,1533402328997, 
destination=ve0534.halxg.cloudera.com,16020,1533457059157                       
                                                                                
                                                                                
at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
                                                                                
                                            at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1502)
                                                                                
                                                   at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1298)
                                                                                
                                                at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
  at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1805)
2018-08-05 01:20:03,972 INFO  [Thread-19] quotas.MasterQuotaManager: Quota 
support disabled
2018-08-05 01:20:03,972 INFO  [Thread-19] zookeeper.ZKWatcher: not a secure 
deployment, proceeding
{code}



> [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure
> ----------------------------------------------------------------
>
>                 Key: HBASE-20978
>                 URL: https://issues.apache.org/jira/browse/HBASE-20978
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.0.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.2
>
>
> Testing tip of branch-2.0, ran into this:
> {code}
> 2018-07-29 01:45:33,002 INFO  [master/ve0524:16000] master.HMaster: Master 
> has completed initialization 13.854sec                                        
>                                                                    2018-07-29 
> 01:45:33,003 INFO  [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on 
> 533fb79ba23b27e9e0715b51daeb30c1                                              
>                                                     2018-07-29 01:45:33,003 
> WARN  [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: pid=1820, 
> state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031
>   at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
>                                                                               
>                                      at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
>                                                                               
>                                          at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
> {code}
> It then shows as the below in the UI:
> {code}
> Id    Parent  State   Owner   Type    Start Time      Last Update     Errors  
> Parameters
> 1820          WAITING stack   MoveRegionProcedure 
> hri=533fb79ba23b27e9e0715b51daeb30c1, 
> source=ve0538.halxg.cloudera.com,16020,1532847421672, 
> destination=ve0540.halxg.cloudera.com,16020,1532853151031       Sun Jul 29 
> 01:33:37 PDT 2018    Sun Jul 29 01:33:38 PDT 2018            [ { state => [ 
> '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace => 
> 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, 
> startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==', 
> offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { 
> hostName => 've0538.halxg.cloudera.com', port => '16020', startCode => 
> '1532847421672' }, destinationServer => { hostName => 
> 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' } 
> } ]
> {code}
> This is what we'd just read from hbase:meta:
> {code}
> 2018-07-29 01:45:32,802 INFO  [master/ve0524:16000] 
> assignment.RegionStateStore: Load hbase:meta entry 
> region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED, 
> lastHost=ve0538.halxg.cloudera.com,16020,1532847421672, 
> regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672, 
> openSeqNum=1544600
> {code}
> Before this, we'd just logged this:
> 2018-07-29 01:33:39,786 INFO  [PEWorker-14] assignment.RegionStateStore: 
> pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1, 
> regionState=CLOSED
> Going back in history, we do the above each time the Master gets restarted so 
> the region is offlined and never brought back online.
> It is failing here:
> {code}
>   private void execProcedure(final RootProcedureState procStack,
>       final Procedure<TEnvironment> procedure) {
>     Preconditions.checkArgument(procedure.getState() == 
> ProcedureState.RUNNABLE,
>         procedure.toString());
> {code}
> Its the parent move region that is trying to run and failing. It is not 
> RUNNABLE? Because the subprocedure was 'done' but not fully?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to