stack created HBASE-18551:
-----------------------------
Summary: [AMv2] UnassignProcedure and crashed regionservers
Key: HBASE-18551
URL: https://issues.apache.org/jira/browse/HBASE-18551
Project: HBase
Issue Type: Bug
Components: amv2
Reporter: stack
This has been [~uagashe] and my obsession over the last few days, what should
an UnassignProcedure do when it dispatches a CLOSE but the CLOSE fails because
of ConnectException or SocketTimeout.
+ We used to let UnassignProcedure continue presuming the Region would be
closed since the server is dead. BUT, if the unassign was part of a
MoveProcedure, the unassign would proceed and the Move would then run WITHOUT
first splitting logs. Bad.
+ So, we made it so UnassignProcedure failed; let the upper layers take care of
the failure. See HBASE-18491 that enabled this behavior. BUT, we are since
figuring that even if the UP completes as a failure, since it gives up the
Region lock on completion, another procedure -- say an AssignProcedure -- could
cut in before the ServerCrashProcedure had finished and again there could be
dataloss.
+ Now we are thinking the UP should hold on to the Region lock until we are
signalled by a ServerCrashProcedure; only then let go of the region. The UP has
context that is hard to pass another. Waiting on a SCP has the UP living on for
what could be a good amount of time. It might be ok if we can suspend the
procedure.
There is a good sample scenario that came up doing the no-regions-on-master
issue, HBASE-18511. When meta is not on master, TestSplitTransactionOnCluster
is failing. It fails because though the test completes, the tests commonly kill
a RegionServer. The teardown for the test runs before we've noticed the aborted
RS. So, the disable of the table in the teardown prepartory to our deleting the
test table as part of clean up, goes to unassign regions but the unassign fails
against the aborted server.
Good stuff.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)