[ https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635924#comment-16635924 ]
stack edited comment on HBASE-21259 at 10/3/18 12:00 AM: --------------------------------------------------------- [~allan163] * meta has a region in CLOSING state against a server that has no mention in fs, is not online, nor has it a znode so it is 'unknown' to the system. * I try to move the region 'manually' via hbck2 from CLOSING to CLOSED -- i.e. unassign -- so I can assign it elsewhere. The CLOSING dispatch fails because no such server and the UP expires the server which queues a SCP. * The SCP runs. Finds no logs to splits. Finds the stuck UP and calls it's handleFailure. The UP then moves to CLOSED and all is good. * Except, the SCP has now resulted in their being a deadserver element. * So, when the next region that references the 'unknown' server comes along, it goes to unassign, fails, and tries to queue a server expiration..... * But the attempt at expiration is rejected because 'there is one in progress already' (because the server has an entry in dead servers -- See ServerManager#expireServer) so we skip out without queuing a new SCP. * This second UP and all subsequent regions that were pointing to the 'unknown' server end up 'hung', suspended waiting for someone to wake them. I have to call 'bypass' on each to get them out of suspend. I cannot unassign the regions, not in bulk at least. If a server is dead we should not be reviving it. It causes more headache that it solves. My first patch was stopping our reviving a server if it unknown but it messed up startup. Let me try and be more clinical. was (Author: stack): [~allan163] * meta has a region in CLOSING state against a server that has no mention in fs, is not online, nor has it a znode so it is 'unknown' to the system. * I try to move the region from CLOSING to CLOSED so I can assign it elsewhere. The CLOSING dispatch fails because no such server and UP queues a SCP. * The SCP runs. Finds no logs to splits. Finds the stuck UP and calls its handleFailure. The UP then moves to CLOSED and all is good. * Except, the SCP has now resulted in their being a deadserver element. * So, when the new region that references the 'unknown' server comes along, it goes to unassign, fails, and tries to queue an expiration. * But the attempt at expiration is rejected because 'there is one in progress already' (because the server has an entry in dead servers) so we skip out without queuing a new SCP. * The second UP and all subsequent regions that were pointing to the 'unknown' server end up 'hung', suspended waiting for someone to wake them. If a server is dead we should not be reviving it. It cause more headache that it solves. My first patch was stopping our reviving a server if it unknown but it messed up startup. Let me try and be more clinical. > [amv2] Revived deadservers; recreated serverstatenode > ----------------------------------------------------- > > Key: HBASE-21259 > URL: https://issues.apache.org/jira/browse/HBASE-21259 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 2.1.0 > Reporter: stack > Assignee: stack > Priority: Major > Fix For: 2.2.0, 2.1.1, 2.0.3 > > Attachments: HBASE-21259.branch-2.1.001.patch > > > On startup, I see servers being revived; i.e. their serverstatenode is > getting marked online even though its just been processed by > ServerCrashProcedure. It looks like this (in a patched server that reports on > whenever a serverstatenode is created): > {code} > 2018-09-29 03:45:40,963 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, > state=SUCCESS; ServerCrashProcedure > server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, > meta=false in 1.0130sec > ... > 2018-09-29 03:45:43,733 INFO > org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! > vb1442.halxg.cloudera.com,22101,1536675314426 > java.lang.RuntimeException: WHERE AM I? > at > org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116) > at > org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464) > at > org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022) > {code} > See how we've just finished a SCP which will have removed the > serverstatenode... but then we come across an unassign that references the > server that was just processed. The unassign will attempt to update the > serverstatenode and therein we create one if one not present. We shouldn't be > creating one. > I think I see this a lot because I am scheduling unassigns with hbck2. The > servers crash and then come up with SCPs doing cleanup of old server and > unassign procedures in the procedure executor queue to be processed still.... > but could happen at any time on cluster should an unassign happen get > scheduled near an SCP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)