[jira] [Comment Edited] (HBASE-21259) [amv2] Revived deadservers; recreated serverstatenode

2018-10-02 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16636257#comment-16636257
 ] 

stack edited comment on HBASE-21259 at 10/3/18 12:03 AM:
-

.001 seems to work mostly (it has worked 99% of the time... trying to figure 
where the holes are). It is simply an undo of all the places we auto-create 
ServerStateNodes so that we don't create one long after a server has been dead 
and gone (messing up UP#remoteCallFailed processing).

Let me figure where the holes are and see if I can do a test too.


was (Author: stack):
.001 seems to work mostly (it has worked 99% of the time... trying to figure 
where the holes are). It is simply an undo of all the places we auto-create 
ServerStateNodes so that we don't create one long after a server has been dead 
and gone (messing up UP#remoteCallFailed processing).

> [amv2] Revived deadservers; recreated serverstatenode
> -
>
> Key: HBASE-21259
> URL: https://issues.apache.org/jira/browse/HBASE-21259
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.1.0
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.2.0, 2.1.1, 2.0.3
>
> Attachments: HBASE-21259.branch-2.1.001.patch
>
>
> On startup, I see servers being revived; i.e. their serverstatenode is 
> getting marked online even though its just been processed by 
> ServerCrashProcedure. It looks like this (in a patched server that reports on 
> whenever a serverstatenode is created):
> {code}
> 2018-09-29 03:45:40,963 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
> state=SUCCESS; ServerCrashProcedure 
> server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, 
> meta=false in 1.0130sec
> ...
> 2018-09-29 03:45:43,733 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
> vb1442.halxg.cloudera.com,22101,1536675314426
> java.lang.RuntimeException: WHERE AM I?
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)
> {code}
> See how we've just finished a SCP which will have removed the 
> serverstatenode... but then we come across an unassign that references the 
> server that was just processed. The unassign will attempt to update the 
> serverstatenode and therein we create one if one not present. We shouldn't be 
> creating one.
> I think I see this a lot because I am scheduling unassigns with hbck2. The 
> servers crash and then come up with SCPs doing cleanup of old server and 
> unassign procedures in the procedure executor queue to be processed still 
>  but could happen at any time on cluster should an unassign happen get 
> scheduled near an SCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21259) [amv2] Revived deadservers; recreated serverstatenode

2018-10-02 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635924#comment-16635924
 ] 

stack edited comment on HBASE-21259 at 10/3/18 12:00 AM:
-

[~allan163]

 * meta  has a region in CLOSING state against a server that has no mention in 
fs, is not online, nor has it a znode so it is 'unknown' to the system.
 * I try to move the region 'manually' via hbck2 from CLOSING to CLOSED -- i.e. 
unassign -- so I can assign it elsewhere. The CLOSING dispatch fails because no 
such server and the UP expires the server which queues a SCP.
 * The SCP runs. Finds no logs to splits. Finds the stuck UP and calls it's 
handleFailure. The UP then moves to CLOSED and all is good.
 * Except, the SCP has now resulted in their being a deadserver element.
 * So, when the next region that references the 'unknown' server comes along, 
it goes to unassign, fails, and tries to queue a server expiration.
 * But the attempt at expiration is rejected because 'there is one in progress 
already' (because the server has an entry in dead servers -- See 
ServerManager#expireServer) so we skip out without queuing a new SCP.
 * This second UP and all subsequent regions that were pointing to the 
'unknown' server end up 'hung', suspended waiting for someone to wake them.

I have to call 'bypass' on each to get them out of suspend. I cannot unassign 
the regions, not in bulk at least.

If a server is dead we should not be reviving it. It causes more headache that 
it solves.

My first patch was stopping our reviving a server if it unknown but it messed 
up startup. Let me try and be more clinical.


was (Author: stack):
[~allan163]

 * meta  has a region in CLOSING state against a server that has no mention in 
fs, is not online, nor has it a znode so it is 'unknown' to the system.
 * I try to move the region from CLOSING to CLOSED so I can assign it 
elsewhere. The CLOSING dispatch fails because no such server and UP queues a 
SCP.
 * The SCP runs. Finds no logs to splits. Finds the stuck UP and calls its  
handleFailure. The UP then moves to CLOSED and all is good.
 * Except, the SCP has now resulted in their being a deadserver element.
 * So, when the new region that references the 'unknown' server comes along, it 
goes to unassign, fails, and tries to queue an expiration.
 * But the attempt at expiration is rejected because 'there is one in progress 
already' (because the server has an entry in dead servers) so we skip out 
without queuing a new SCP.
 * The second UP and all subsequent regions that were pointing to the 'unknown' 
server end up 'hung', suspended waiting for someone to wake them.

If a server is dead we should not be reviving it. It cause more headache that 
it solves.

My first patch was stopping our reviving a server if it unknown but it messed 
up startup. Let me try and be more clinical.

> [amv2] Revived deadservers; recreated serverstatenode
> -
>
> Key: HBASE-21259
> URL: https://issues.apache.org/jira/browse/HBASE-21259
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.1.0
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.2.0, 2.1.1, 2.0.3
>
> Attachments: HBASE-21259.branch-2.1.001.patch
>
>
> On startup, I see servers being revived; i.e. their serverstatenode is 
> getting marked online even though its just been processed by 
> ServerCrashProcedure. It looks like this (in a patched server that reports on 
> whenever a serverstatenode is created):
> {code}
> 2018-09-29 03:45:40,963 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
> state=SUCCESS; ServerCrashProcedure 
> server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, 
> meta=false in 1.0130sec
> ...
> 2018-09-29 03:45:43,733 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
> vb1442.halxg.cloudera.com,22101,1536675314426
> java.lang.RuntimeException: WHERE AM I?
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
> at 
> 

[jira] [Comment Edited] (HBASE-21259) [amv2] Revived deadservers; recreated serverstatenode

2018-10-01 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634569#comment-16634569
 ] 

stack edited comment on HBASE-21259 at 10/1/18 8:28 PM:


Scenario is this:

 * meta has regions that reference a regionserver that is long gone. It was 
processed (or not if all MasterProcWALs have been removed) many restarts ago.
 * The table is borked. Some regions are not unassigned though their table is.
 * We run a mass unassign. Because table has many unassigned regions, it takes 
a while.
 * The first unassign queues a SCP for the long-dead server. It quickly runs 
through the SCP and finishes.. no logs to split.
 * Soon after, another scheduled unassign for same server is run. It queues an 
SCP (remember, if the unassign is against a server that is not online, we queue 
SCP and then wait on the SCP to wake the unassign so we do proper unassign 
cleanup in the handleRIT callback)... only in this case, the server is in the 
deadserver list and has been processed so this last assign just hangs for 
ever because the check for server state creates a new serverstatenode and new 
serverstatenodes default ONLINE.

It is sort of wonky and not 'usual' but I've been trashing my cluster and then 
trying to repair with hbck2. This is how I run into the odd state reported 
above.  In particular, on start, the load of meta will put all regions into 
RIT. If no online server associated, then the regions are considered STUCK. I 
then do a bulk assign or unassign of the OPENING/CLOSING regions to clean up 
the RITs... (Tens of thousands on this big cluster) and then I run into the 
issue described here... where a bunch of unassigns end-up suspended never to be 
woken up.

A test would be sort of tough given the state is not normal.

Thanks [~allan163]


was (Author: stack):
Scenario is this:

 * meta has regions that reference a regionserver that is long gone. It was 
processed (or not if all MasterProcWALs have been removed) many restarts ago.
 * The table is borked. Some regions are not unassigned though their table is.
 * We run a mass unassign. Because table has many unassigned regions, it takes 
a while.
 * The first unassign queues a SCP for the long-dead server. It quickly runs 
through the SCP and finishes.. no logs to split.
 * Soon after, another scheduled unassign for same server is run. It queues an 
SCP (remember, if the unassign is against a server that is not online, we queue 
SCP and then wait on the SCP to wake the unassign so we do proper unassign 
cleanup in the handleRIT callback)... only in this case, the server is in the 
deadserver list and has been processed so this last assign just hangs for 
ever because the check for server state creates a new serverstatenode and new 
serverstatenodes default ONLINE.

It is sort of wonky and not 'usual' but I've been trashing my cluster and then 
trying to repair with hbck2. This is how I run into the odd state reported 
above.

A test would be sort of tough given the state is not normal.

Thanks [~allan163]

> [amv2] Revived deadservers; recreated serverstatenode
> -
>
> Key: HBASE-21259
> URL: https://issues.apache.org/jira/browse/HBASE-21259
> Project: HBase
>  Issue Type: Bug
>  Components: amv2
>Affects Versions: 2.1.0
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 2.2.0, 2.1.1, 2.0.3
>
>
> On startup, I see servers being revived; i.e. their serverstatenode is 
> getting marked online even though its just been processed by 
> ServerCrashProcedure. It looks like this (in a patched server that reports on 
> whenever a serverstatenode is created):
> {code}
> 2018-09-29 03:45:40,963 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
> state=SUCCESS; ServerCrashProcedure 
> server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, 
> meta=false in 1.0130sec
> ...
> 2018-09-29 03:45:43,733 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
> vb1442.halxg.cloudera.com,22101,1536675314426
> java.lang.RuntimeException: WHERE AM I?
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
> at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
> at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
> at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
> at 
>