Sean Busbey created HBASE-22263:
-----------------------------------

             Summary: Master creates duplicate ServerCrashProcedure on 
initialization, leading to assignment hanging in region-dense clusters
                 Key: HBASE-22263
                 URL: https://issues.apache.org/jira/browse/HBASE-22263
             Project: HBase
          Issue Type: Bug
          Components: proc-v2, Region Assignment
    Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.5.0
            Reporter: Sean Busbey
            Assignee: Sean Busbey


h3. Problem:

During Master initialization we
 # restore existing procedures that still need to run from prior active Master 
instances
 # look for signs that Region Servers have died and need to be recovered while 
we were out and schedule a ServerCrashProcedure (SCP) for each them
 # turn on the assignment manager

The normal turn of events for a ServerCrashProcedure will attempt to use a bulk 
assignment to maintain the set of regions on a RS if possible. However, we wait 
around and retry a bit later if the assignment manager isn’t ready yet.

Note that currently #2 has no notion of wether or not a previous active Master 
instances has already done a check. This means we might schedule an SCP for a 
ServerName (host, port, start code) that already has an SCP scheduled. Ideally, 
such a duplicate should be a no-op.

However, before step #2 schedules the SCP it first marks the region server as 
dead and not yet processed, with the expectation that the SCP it just created 
will look if there is log splitting work and then mark the server as easy for 
region assignment. At the same time, any restored SCPs that are past the step 
of log splitting will be waiting for the AssignmentManager still. As a part of 
restoring themselves, they do not update with the current master instance to 
show that they are past the point of WAL processing.

Once the AssignmentManager starts in #3 the restored SCP continues; it will 
eventually get to the assignment phase and find that its server is marked as 
dead and in need of wal processing. Such assignments are skipped with a log 
message. Thus as we iterate over the regions to assign we’ll skip all of them. 
This non-intuitively shifts the “no-op” status from the newer SCP we scheduled 
at #2 to the older SCP that was restored in #1.

Bulk assignment works by sending the assign calls via a pool to allow more 
parallelism. Once we’ve set up the pool we just wait to see if the region state 
updates to online. Unfortunately, since all of the assigns got skipped, we’ll 
never change the state for any of these regions. That means the bulk assign, 
and the older SCP that started it, will wait until it hits a timeout.

By default the timeout for a bulk assignment is the smaller of {{(# Regions in 
the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s + 60s 
+ # of RegionServers in the cluster * 30s)}}. For even modest clusters with 
several hundreds of regions per region server, this means the “no-op” SCP will 
end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average region density 
of 300 regions per region server on a 100 node cluster. ~11 minutes for 300 
regions per region server on a 10 node cluster). During this time, the SCP will 
hold one of the available procedure execution slots for both the overall pool 
and for the specific server queue.

As previously mentioned, restored SCPs will retry their submission if the 
assignment manager has not yet been activated (done in #3), this can cause them 
to be scheduled after the newer SCPs (created in #2). Thus the order of 
execution of no-op and usable SCPs can vary from run-to-run of master 
initialization.

This means that unless you get lucky with SCP ordering, impacted regions will 
remain as RIT for an extended period of time. If you get particularly unlucky 
and a critical system table is included in the regions that are being 
recovered, then master initialization itself will end up blocked on this 
sequence of SCP timeouts. If there are enough of them to exceed the master 
initialization timeouts, then the situation can be self-sustaining as 
additional master fails over cause even more duplicative SCPs to be scheduled.
h3. Indicators:
 * Master appears to hang; failing to assign regions to available region 
servers.
 * Master appears to hang during initialization; shows waiting for the meta or 
namespace regions.
 * Repeated master restarts allow some progress to be made on assignments for a 
limited period of time.
 * Master UI shows a large number of Server Crash Procedures in RUNNABLE state 
and the number increases by roughly the number of Region Servers on master 
restart.
 * Master log shows a large number of messages that assignment of a region has 
failed because it was last seen on a region server that has not yet been 
processed. These messages come from the AssignmentManager logger. This message 
should normally only occur when a Region Server dies just before some 
assignment is about to happen. When this combination of issues happens the 
message will happen repeatedly; every time a new defunct SCP is processed it’ll 
happen for each region.

Example of aforementioned message:
{code:java}
2019-04-15 11:19:04,610 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Skip assigning test6,5f89022e,1555251200249.946ff80e7602e66853d33899819983a1., 
it is on a dead but not processed yet server: 
regionserver-8.hbase.example.com,22101,1555349031179
{code}
h3. Reproduction:

The procedure we currently have to reproduce this issue requires specific 
timings that can be hard to get correctly, so this might require multiple tries.

Before starting, the test cluster should have the following properties:
 * Only 1 master. Not a strict requirement but it helps in the following steps.
 * Hundreds of regions per region server. If you need more, fire up your 
preferred data generation tool and tell it to create a large enough table. 
Those regions can be empty, no need to fill them with actual data.
 * At least ten times more region servers than available CPUs on the master. If 
the number of CPUs is too high, set {{hbase.master.procedure.threads}} in the 
safety valves to number-of-region-servers divided by 10. For example, if you 
have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3.

Set your environment the following way:
 * Access to cluster-wide shutdown/startup via whatever control plane you use
 * Access to master-role specific restart via whatever control plane you use
 * A shell on the master node that tails the master log (use {{tail -F}} with a 
capital F to ride over log rolls).
 * A shell on the master node that’s ready to grab the HMaster’s pid and {{kill 
-9}} it.

The procedure is:
 * Start with a cluster as described above.
 * Restart the entire HBase cluster.
 * When the master log shows “Clean cluster startup. Assigning user regions” 
and it starts assigning regions, {{kill -9}} the master.
 * Stop the entire HBase cluster
 * Restart the entire HBase cluster
 * Once the master shows “Found regions out on cluster or in RIT”, {{kill -9}} 
the master again.
 * Restart only the Master.
 * (Optionally repeat the above two steps)

You know that you hit the bug when the given indicators show up.
 If it seems like it’s able to assign a lot of regions again, try {{kill -9}} 
again and restart the Master role.
h3. Workaround:

Identified and tested workaround to mitigate this problem involves 
configuration tunings to speed up regions assignment on Region Servers and also 
reduce the time spent by BulkAssignment threads on the Master side. We do not 
recommend setting this configurations for normal operations.
 * {{hbase.master.procedure.threads}} - This property defines the general 
procedure pool size, and in the context of this issue, is the pool for 
executing SCPs. Increasing this pool would allow more SCPs for different Region 
Servers to run in parallel, allowing for more regions assignments to be 
processed. However, on the core of this problem is the fact that Master may 
have multiple SCPs for same Region Server. These are not run in parallel, 
therefore tuning this parameter will not be sufficient. We recommend setting 
this parameter to the number of Region Servers in the cluster, so that under 
normal scenarios where there is one SCP for each Region Server, all those can 
run in parallel.
 * {{hbase.bulk.assignment.perregion.open.time}} - This property determines how 
long a bulk assignment thread on Master's BulkAssigner should wait for all its 
regions to get assigned. Setting to a value as low as 100 (milliseconds) will 
allow the no-op SCPs to complete faster. Which will open up execution spots for 
SCPs that can do actual assignment work.
 * {{hbase.master.namespace.init.timeout}} - The master has a time limit for 
how long it takes to assign the namespace table. Given that we want to limit 
master restarts, this is better adjusted upwards.

Since this issue is especially pronounced on clusters with a large number of 
regions-per-region-server the following additional config can also help:
 * {{hbase.regionserver.executor.openregion.threads}} - This relates to the 
number of threads on each Region Server responsible for handling the region 
assignment requests. Provided individual Region Servers are not overloaded 
already tuning this value higher than the default (3) should help expedite 
region assignment.

h3. Acknowledgements:

Thanks to the ton of folks who helped diagnose, chase down, and document this 
issue, its reproduction, and the workaround. Especially [~jdcryans], 
[~wchevreuil], [~an...@apache.org], [~avirmani], Shamik Dave, [~esteban], and 
[~elserj].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to