[ 
https://issues.apache.org/jira/browse/HBASE-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Cui reassigned HBASE-22263:
------------------------------

    Assignee: Bo Cui  (was: Sean Busbey)

> Master creates duplicate ServerCrashProcedure on initialization, leading to 
> assignment hanging in region-dense clusters
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22263
>                 URL: https://issues.apache.org/jira/browse/HBASE-22263
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
>            Reporter: Sean Busbey
>            Assignee: Bo Cui
>            Priority: Critical
>         Attachments: HBASE-22263-branch-1.v0.add.patch, 
> HBASE-22263-branch-1.v0.patch
>
>
> h3. Problem:
> During Master initialization we
>  # restore existing procedures that still need to run from prior active 
> Master instances
>  # look for signs that Region Servers have died and need to be recovered 
> while we were out and schedule a ServerCrashProcedure (SCP) for each them
>  # turn on the assignment manager
> The normal turn of events for a ServerCrashProcedure will attempt to use a 
> bulk assignment to maintain the set of regions on a RS if possible. However, 
> we wait around and retry a bit later if the assignment manager isn’t ready 
> yet.
> Note that currently #2 has no notion of wether or not a previous active 
> Master instances has already done a check. This means we might schedule an 
> SCP for a ServerName (host, port, start code) that already has an SCP 
> scheduled. Ideally, such a duplicate should be a no-op.
> However, before step #2 schedules the SCP it first marks the region server as 
> dead and not yet processed, with the expectation that the SCP it just created 
> will look if there is log splitting work and then mark the server as easy for 
> region assignment. At the same time, any restored SCPs that are past the step 
> of log splitting will be waiting for the AssignmentManager still. As a part 
> of restoring themselves, they do not update with the current master instance 
> to show that they are past the point of WAL processing.
> Once the AssignmentManager starts in #3 the restored SCP continues; it will 
> eventually get to the assignment phase and find that its server is marked as 
> dead and in need of wal processing. Such assignments are skipped with a log 
> message. Thus as we iterate over the regions to assign we’ll skip all of 
> them. This non-intuitively shifts the “no-op” status from the newer SCP we 
> scheduled at #2 to the older SCP that was restored in #1.
> Bulk assignment works by sending the assign calls via a pool to allow more 
> parallelism. Once we’ve set up the pool we just wait to see if the region 
> state updates to online. Unfortunately, since all of the assigns got skipped, 
> we’ll never change the state for any of these regions. That means the bulk 
> assign, and the older SCP that started it, will wait until it hits a timeout.
> By default the timeout for a bulk assignment is the smaller of {{(# Regions 
> in the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s + 
> 60s + # of RegionServers in the cluster * 30s)}}. For even modest clusters 
> with several hundreds of regions per region server, this means the “no-op” 
> SCP will end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average 
> region density of 300 regions per region server on a 100 node cluster. ~11 
> minutes for 300 regions per region server on a 10 node cluster). During this 
> time, the SCP will hold one of the available procedure execution slots for 
> both the overall pool and for the specific server queue.
> As previously mentioned, restored SCPs will retry their submission if the 
> assignment manager has not yet been activated (done in #3), this can cause 
> them to be scheduled after the newer SCPs (created in #2). Thus the order of 
> execution of no-op and usable SCPs can vary from run-to-run of master 
> initialization.
> This means that unless you get lucky with SCP ordering, impacted regions will 
> remain as RIT for an extended period of time. If you get particularly unlucky 
> and a critical system table is included in the regions that are being 
> recovered, then master initialization itself will end up blocked on this 
> sequence of SCP timeouts. If there are enough of them to exceed the master 
> initialization timeouts, then the situation can be self-sustaining as 
> additional master fails over cause even more duplicative SCPs to be scheduled.
> h3. Indicators:
>  * Master appears to hang; failing to assign regions to available region 
> servers.
>  * Master appears to hang during initialization; shows waiting for the meta 
> or namespace regions.
>  * Repeated master restarts allow some progress to be made on assignments for 
> a limited period of time.
>  * Master UI shows a large number of Server Crash Procedures in RUNNABLE 
> state and the number increases by roughly the number of Region Servers on 
> master restart.
>  * Master log shows a large number of messages that assignment of a region 
> has failed because it was last seen on a region server that has not yet been 
> processed. These messages come from the AssignmentManager logger. This 
> message should normally only occur when a Region Server dies just before some 
> assignment is about to happen. When this combination of issues happens the 
> message will happen repeatedly; every time a new defunct SCP is processed 
> it’ll happen for each region.
> Example of aforementioned message:
> {code:java}
> 2019-04-15 11:19:04,610 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning 
> test6,5f89022e,1555251200249.946ff80e7602e66853d33899819983a1., it is on a 
> dead but not processed yet server: 
> regionserver-8.hbase.example.com,22101,1555349031179
> {code}
> h3. Reproduction:
> The procedure we currently have to reproduce this issue requires specific 
> timings that can be hard to get correctly, so this might require multiple 
> tries.
> Before starting, the test cluster should have the following properties:
>  * Only 1 master. Not a strict requirement but it helps in the following 
> steps.
>  * Hundreds of regions per region server. If you need more, fire up your 
> preferred data generation tool and tell it to create a large enough table. 
> Those regions can be empty, no need to fill them with actual data.
>  * At least ten times more region servers than available CPUs on the master. 
> If the number of CPUs is too high, set {{hbase.master.procedure.threads}} in 
> the safety valves to number-of-region-servers divided by 10. For example, if 
> you have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3.
> Set your environment the following way:
>  * Access to cluster-wide shutdown/startup via whatever control plane you use
>  * Access to master-role specific restart via whatever control plane you use
>  * A shell on the master node that tails the master log (use {{tail -F}} with 
> a capital F to ride over log rolls).
>  * A shell on the master node that’s ready to grab the HMaster’s pid and 
> {{kill -9}} it.
> The procedure is:
>  * Start with a cluster as described above.
>  * Restart the entire HBase cluster.
>  * When the master log shows “Clean cluster startup. Assigning user regions” 
> and it starts assigning regions, {{kill -9}} the master.
>  * Stop the entire HBase cluster
>  * Restart the entire HBase cluster
>  * Once the master shows “Found regions out on cluster or in RIT”, {{kill 
> -9}} the master again.
>  * Restart only the Master.
>  * (Optionally repeat the above two steps)
> You know that you hit the bug when the given indicators show up.
>  If it seems like it’s able to assign a lot of regions again, try {{kill -9}} 
> again and restart the Master role.
> h3. Workaround:
> Identified and tested workaround to mitigate this problem involves 
> configuration tunings to speed up regions assignment on Region Servers and 
> also reduce the time spent by BulkAssignment threads on the Master side. We 
> do not recommend setting this configurations for normal operations.
>  * {{hbase.master.procedure.threads}} - This property defines the general 
> procedure pool size, and in the context of this issue, is the pool for 
> executing SCPs. Increasing this pool would allow more SCPs for different 
> Region Servers to run in parallel, allowing for more regions assignments to 
> be processed. However, on the core of this problem is the fact that Master 
> may have multiple SCPs for same Region Server. These are not run in parallel, 
> therefore tuning this parameter will not be sufficient. We recommend setting 
> this parameter to the number of Region Servers in the cluster, so that under 
> normal scenarios where there is one SCP for each Region Server, all those can 
> run in parallel.
>  * {{hbase.bulk.assignment.perregion.open.time}} - This property determines 
> how long a bulk assignment thread on Master's BulkAssigner should wait for 
> all its regions to get assigned. Setting to a value as low as 100 
> (milliseconds) will allow the no-op SCPs to complete faster. Which will open 
> up execution spots for SCPs that can do actual assignment work.
>  * {{hbase.master.namespace.init.timeout}} - The master has a time limit for 
> how long it takes to assign the namespace table. Given that we want to limit 
> master restarts, this is better adjusted upwards.
> Since this issue is especially pronounced on clusters with a large number of 
> regions-per-region-server the following additional config can also help:
>  * {{hbase.regionserver.executor.openregion.threads}} - This relates to the 
> number of threads on each Region Server responsible for handling the region 
> assignment requests. Provided individual Region Servers are not overloaded 
> already tuning this value higher than the default (3) should help expedite 
> region assignment.
> h3. Acknowledgements:
> Thanks to the ton of folks who helped diagnose, chase down, and document this 
> issue, its reproduction, and the workaround. Especially [~jdcryans], 
> [~wchevreuil], [~an...@apache.org], [~avirmani], Shamik Dave, [~esteban], and 
> [~elserj].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to