[ https://issues.apache.org/jira/browse/HBASE-19218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299221#comment-16299221 ]
Yi Liang commented on HBASE-19218: ---------------------------------- I have seen this error in my cluster as well {quote} 2017-11-08 00:25:25,260 INFO [ProcExecWrkr-18] procedure.MasterProcedureScheduler: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 hbase:namespace hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94. 2017-11-08 00:25:25,263 INFO [ProcExecWrkr-18] assignment.AssignProcedure: Start pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, location=node_5,16020,1510084579728; forceNewPlan=false, retain=true ..... 2017-11-08 00:25:26,040 INFO [ProcExecWrkr-23] procedure.ServerCrashProcedure: pid=42 found RIT pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534 2017-11-08 00:25:26,040 WARN [ProcExecWrkr-23] assignment.RegionTransitionProcedure: Remote call failed pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534; exception=ServerCrashProcedure pid=42, server=node_5,16020,1510100695534 2017-11-08 00:25:26,041 INFO [ProcExecWrkr-23] assignment.AssignProcedure: Retry=1 of max=10; pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OPENING, location=node_5,16020,1510100695534 2017-11-08 00:25:26,193 INFO [ProcExecWrkr-25] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as node_2,16020,1510100696388 2017-11-08 00:25:26,195 INFO [ProcExecWrkr-25] assignment.RegionTransitionProcedure: Dispatch pid=44, ppid=43, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740; rit=OPENING, location=node_2,16020,1510100696388 2017-11-08 00:25:26,346 INFO [ProcedureDispatcherTimeoutThread] procedure.RSProcedureDispatcher: Using procedure batch rpc execution for serverName=node_2,16020,1510100696388 version=2097152 2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] hbase.MetaTableAccessor: Updated table hbase:meta state to ENABLED in META 2017-11-08 00:25:27,187 INFO [ProcExecWrkr-4] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as node_2,16020,1510100696388 2017-11-08 00:25:27,209 INFO [ProcExecWrkr-22] assignment.RegionTransitionProcedure: Dispatch pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94; rit=OFFLINE, location=null 2017-11-08 00:25:27,210 INFO [ProcExecWrkr-21] assignment.RegionTransitionProcedure: Dispatch pid=39, ppid=36, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:acl, region=24aadcb52fdc43e2ebcffe95d39b43ab; rit=OPENING, location=node_2,16020,1510100696388 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) {quote} I track the AssignProcedure of hbase:namespace to above log 1. AssignProcedure created for table=hbase:namespace; and state is RUNNABLE:REGION_TRANSITION_QUEUE 2. ServerCrashProcedure happens and will handle AssignProcedure of table=hbase:namespace, and from the log, this AssignProcedure state has already been set to REGION_TRANSITION_DISPATCH when SCP handle it. The SCP will set the AssignProcedure back to REGION_TRANSITION_QUEUE and offline related region. 3. However, The AssignProcedure of hbase:namespace resume from state REGION_TRANSITION_DISPATCH, and SCP in step2 has already offline the region and set location as null, so null pointer exception may happen The problem happens at step 3 above, the AssignProcedure should resume from state REGION_TRANSITION_QUEUE, but it actually from REGION_TRANSITION_DISPATCH. This could happen, since when SCP call remoteCallFailed for AssignProcedure of hbase:namespace, ProcedureExecutor is running AssignProcedure of hbase:namespace as state REGION_TRANSITION_DISPATCH at same time. and if SCP's handleFailure(which set region location as null) for hbase:namespace happens before AssignProcedure#addToRemoteDispatcher, and then null pointer happens And if we can catch the nullpointerexception and set state back to REGION_TRANSITION_QUEUE, that may work for this problem > Master stuck thinking hbase:namespace is assigned after restart preventing > intialization > ---------------------------------------------------------------------------------------- > > Key: HBASE-19218 > URL: https://issues.apache.org/jira/browse/HBASE-19218 > Project: HBase > Issue Type: Bug > Reporter: Josh Elser > Assignee: Umesh Agashe > Priority: Critical > Fix For: 2.0.0-beta-1 > > Attachments: > hbase-hbase-master-ctr-e134-1499953498516-282290-01-000003.hwx.site.log.zip, > hbase-site.xml > > > Our [~romil.choksi] brought this one to my attention after trying to get some > cluster tests running. > The Master seems to have gotten stuck never initializing after it thinks that > hbase:namespace was already deployed on the cluster when it actually was not. > On a Master restart, it reads the location out of meta and assumes that it's > there (I assume this invalid entry is the issue): > {noformat} > 2017-11-08 00:29:17,556 INFO > [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] > assignment.RegionStateStore: Load hbase:meta entry region={ENCODED => > f147f204a579b885c351bdc0a7ebbf94, NAME => > 'hbase:namespace,,1510084256045.f147f204a579b885c351bdc0a7ebbf94.', STARTKEY > => '', ENDKEY => ''} regionState=OPENING > lastHost=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 > regionLocation=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510100695534 > {noformat} > Prior to this, the RS5 went through the ServerCrashProcedure, but it looks > like this bailed out unexpectedly: > {noformat} > 2017-11-08 00:25:25,187 WARN > [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] > master.ServerManager: Expiration of > ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 but > server not online > 2017-11-08 00:25:25,187 INFO [ProcExecWrkr-5] > procedure.ServerCrashProcedure: Start pid=36, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-000003.hwx.site,16020,1510084580111, > splitWal=t > rue, meta=false > 2017-11-08 00:25:25,188 INFO > [ctr-e134-1499953498516-282290-01-000003:20000.masterManager] > master.ServerManager: Processing expiration of > ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728 on > ctr-e134-1499953498516-28 > 2290-01-000003.hwx.site,20000,1510100690324 > ... > 2017-11-08 00:25:27,211 ERROR [ProcExecWrkr-22] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:namespace, region=f147f204a579b885c351bdc0a7ebbf94 > java.lang.NullPointerException > at > java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:171) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:223) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.updateTransition(AssignProcedure.java:252) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:309) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:82) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731) > 2017-11-08 00:25:27,239 FATAL [ProcExecWrkr-22] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=37, > state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException > via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:namespace, > region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; > ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, > splitWal=true, meta=false > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731) > 2017-11-08 00:25:27,344 FATAL [ProcExecWrkr-8] procedure2.ProcedureExecutor: > CODE-BUG: Uncaught runtime exception for pid=37, > state=FAILED:SERVER_CRASH_FINISH, exception=java.lang.NullPointerException > via CODE-BUG: Uncaught runtime exception: pid=40, ppid=37, > state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:namespace, > region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; > ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, > splitWal=true, meta=false > java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_FINISH > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:236) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:56) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:198) > at > org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:859) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1353) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1309) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1178) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:77) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1731) > 2017-11-08 00:25:27,356 INFO [ProcExecWrkr-5] procedure2.ProcedureExecutor: > Rolled back pid=37, state=ROLLEDBACK, > exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime > exception: pid=40, ppid=37, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:namespace, > region=f147f204a579b885c351bdc0a7ebbf94:java.lang.NullPointerException; > ServerCrashProcedure > server=ctr-e134-1499953498516-282290-01-000005.hwx.site,16020,1510084579728, > splitWal=true, meta=false exec-time=2.1650sec > {noformat} > Shortly after this, the master was restarted. > My hunch is that because the SCP crashed, we never invalidated the meta > entries and got us stuck thinking the region was assigned when it wasn't? -- This message was sent by Atlassian JIRA (v6.4.14#64029)