[ https://issues.apache.org/jira/browse/HBASE-20978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569530#comment-16569530 ]
stack commented on HBASE-20978: ------------------------------- branch-2.0 has HBASE-20846 now. My long-run test failed with another instance of this issue. Notes below but just repeat of above. Still need to dig. Making this critical. The region was being moved because of a balance. The unassign had completed. Here are the log lines: {code} .... 2018-08-05 01:19:58,555 INFO [Thread-19] procedure.MasterProcedureScheduler: pid=4965, state=WAITING:MOVE_REGION_ASSIGN, hasLock=false; MoveRegionProcedure hri=e7c120b0eb913346e4bead908ebed468, source=ve0536.halxg.cloudera.com,16020,1533402328997, destination=ve0534.halxg.cloudera.com,16020,1533457059157 checking lock on e7c120b0eb913346e4bead908ebed468 ... 2018-08-05 01:19:58,555 INFO [Thread-19] procedure.MasterProcedureScheduler: pid=4967, ppid=4965, state=SUCCESS, hasLock=false; UnassignProcedure table=IntegrationTestBigLinkedList, region=e7c120b0eb913346e4bead908ebed468, server=ve0536.halxg.cloudera.com,16020,1533402328997 checking lock on e7c120b0eb913346e4bead908ebed468 ... CRASH! ... 2018-08-05 01:20:03,698 INFO [Thread-19] assignment.RegionStateStore: Load hbase:meta entry region=e7c120b0eb913346e4bead908ebed468, regionState=CLOSED, lastHost=ve0536.halxg.cloudera.com,16020,1533402328997, regionLocation=ve0536.halxg.cloudera.com,16020,1533402328997, openSeqNum=2267217 ... 2018-08-05 01:20:03,968 INFO [Thread-19] master.HMaster: Master has completed initialization 10.371sec 2018-08-05 01:20:03,969 WARN [PEWorker-8] procedure2.ProcedureExecutor: Worker terminating UNNATURALLY null java.lang.IllegalArgumentException: NOT RUNNABLE! pid=4965, state=WAITING:MOVE_REGION_ASSIGN, hasLock=true; MoveRegionProcedure hri=e7c120b0eb913346e4bead908ebed468, source=ve0536.halxg.cloudera.com,16020,1533402328997, destination=ve0534.halxg.cloudera.com,16020,1533457059157 at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1502) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1298) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1805) 2018-08-05 01:20:03,972 INFO [Thread-19] quotas.MasterQuotaManager: Quota support disabled 2018-08-05 01:20:03,972 INFO [Thread-19] zookeeper.ZKWatcher: not a secure deployment, proceeding {code} > [amv2] Worker terminating UNNATURALLY during MoveRegionProcedure > ---------------------------------------------------------------- > > Key: HBASE-20978 > URL: https://issues.apache.org/jira/browse/HBASE-20978 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 2.0.1 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 2.0.2 > > > Testing tip of branch-2.0, ran into this: > {code} > 2018-07-29 01:45:33,002 INFO [master/ve0524:16000] master.HMaster: Master > has completed initialization 13.854sec > 2018-07-29 > 01:45:33,003 INFO [PEWorker-4] procedure.MasterProcedureScheduler: pid=1820, > state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure > hri=533fb79ba23b27e9e0715b51daeb30c1, > source=ve0538.halxg.cloudera.com,16020,1532847421672, > destination=ve0540.halxg.cloudera.com,16020,1532853151031 checking lock on > 533fb79ba23b27e9e0715b51daeb30c1 > 2018-07-29 01:45:33,003 > WARN [PEWorker-4] procedure2.ProcedureExecutor: Worker terminating > UNNATURALLY null > java.lang.IllegalArgumentException: pid=1820, > state=WAITING:MOVE_REGION_ASSIGN; MoveRegionProcedure > hri=533fb79ba23b27e9e0715b51daeb30c1, > source=ve0538.halxg.cloudera.com,16020,1532847421672, > destination=ve0540.halxg.cloudera.com,16020,1532853151031 > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) > > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1458) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249) > > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763) > {code} > It then shows as the below in the UI: > {code} > Id Parent State Owner Type Start Time Last Update Errors > Parameters > 1820 WAITING stack MoveRegionProcedure > hri=533fb79ba23b27e9e0715b51daeb30c1, > source=ve0538.halxg.cloudera.com,16020,1532847421672, > destination=ve0540.halxg.cloudera.com,16020,1532853151031 Sun Jul 29 > 01:33:37 PDT 2018 Sun Jul 29 01:33:38 PDT 2018 [ { state => [ > '1', '2' ] }, { regionId => '1532851768240', tableName => { namespace => > 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, > startKey => 'VttDLvXHdcmzwqNdrNoUFg==', endKey => 'WGFV8k+hFqhcIJGiKZ8L4Q==', > offline => 'false', split => 'false', replicaId => '0' }, { sourceServer => { > hostName => 've0538.halxg.cloudera.com', port => '16020', startCode => > '1532847421672' }, destinationServer => { hostName => > 've0540.halxg.cloudera.com', port => '16020', startCode => '1532853151031' } > } ] > {code} > This is what we'd just read from hbase:meta: > {code} > 2018-07-29 01:45:32,802 INFO [master/ve0524:16000] > assignment.RegionStateStore: Load hbase:meta entry > region=533fb79ba23b27e9e0715b51daeb30c1, regionState=CLOSED, > lastHost=ve0538.halxg.cloudera.com,16020,1532847421672, > regionLocation=ve0538.halxg.cloudera.com,16020,1532847421672, > openSeqNum=1544600 > {code} > Before this, we'd just logged this: > 2018-07-29 01:33:39,786 INFO [PEWorker-14] assignment.RegionStateStore: > pid=1823 updating hbase:meta row=533fb79ba23b27e9e0715b51daeb30c1, > regionState=CLOSED > Going back in history, we do the above each time the Master gets restarted so > the region is offlined and never brought back online. > It is failing here: > {code} > private void execProcedure(final RootProcedureState procStack, > final Procedure<TEnvironment> procedure) { > Preconditions.checkArgument(procedure.getState() == > ProcedureState.RUNNABLE, > procedure.toString()); > {code} > Its the parent move region that is trying to run and failing. It is not > RUNNABLE? Because the subprocedure was 'done' but not fully? -- This message was sent by Atlassian JIRA (v7.6.3#76005)