[ https://issues.apache.org/jira/browse/HBASE-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999812#comment-16999812 ]
Pankaj Kumar commented on HBASE-22263: -------------------------------------- We also met this problem in our production environment (HBase 1.3.1), which caused more than 40 mins to assign 20k regions. [~busbey] how's your internal branch testing report? > Master creates duplicate ServerCrashProcedure on initialization, leading to > assignment hanging in region-dense clusters > ----------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-22263 > URL: https://issues.apache.org/jira/browse/HBASE-22263 > Project: HBase > Issue Type: Bug > Components: proc-v2, Region Assignment > Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0 > Reporter: Sean Busbey > Assignee: Sean Busbey > Priority: Critical > Attachments: HBASE-22263-branch-1.v0.patch > > > h3. Problem: > During Master initialization we > # restore existing procedures that still need to run from prior active > Master instances > # look for signs that Region Servers have died and need to be recovered > while we were out and schedule a ServerCrashProcedure (SCP) for each them > # turn on the assignment manager > The normal turn of events for a ServerCrashProcedure will attempt to use a > bulk assignment to maintain the set of regions on a RS if possible. However, > we wait around and retry a bit later if the assignment manager isn’t ready > yet. > Note that currently #2 has no notion of wether or not a previous active > Master instances has already done a check. This means we might schedule an > SCP for a ServerName (host, port, start code) that already has an SCP > scheduled. Ideally, such a duplicate should be a no-op. > However, before step #2 schedules the SCP it first marks the region server as > dead and not yet processed, with the expectation that the SCP it just created > will look if there is log splitting work and then mark the server as easy for > region assignment. At the same time, any restored SCPs that are past the step > of log splitting will be waiting for the AssignmentManager still. As a part > of restoring themselves, they do not update with the current master instance > to show that they are past the point of WAL processing. > Once the AssignmentManager starts in #3 the restored SCP continues; it will > eventually get to the assignment phase and find that its server is marked as > dead and in need of wal processing. Such assignments are skipped with a log > message. Thus as we iterate over the regions to assign we’ll skip all of > them. This non-intuitively shifts the “no-op” status from the newer SCP we > scheduled at #2 to the older SCP that was restored in #1. > Bulk assignment works by sending the assign calls via a pool to allow more > parallelism. Once we’ve set up the pool we just wait to see if the region > state updates to online. Unfortunately, since all of the assigns got skipped, > we’ll never change the state for any of these regions. That means the bulk > assign, and the older SCP that started it, will wait until it hits a timeout. > By default the timeout for a bulk assignment is the smaller of {{(# Regions > in the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s + > 60s + # of RegionServers in the cluster * 30s)}}. For even modest clusters > with several hundreds of regions per region server, this means the “no-op” > SCP will end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average > region density of 300 regions per region server on a 100 node cluster. ~11 > minutes for 300 regions per region server on a 10 node cluster). During this > time, the SCP will hold one of the available procedure execution slots for > both the overall pool and for the specific server queue. > As previously mentioned, restored SCPs will retry their submission if the > assignment manager has not yet been activated (done in #3), this can cause > them to be scheduled after the newer SCPs (created in #2). Thus the order of > execution of no-op and usable SCPs can vary from run-to-run of master > initialization. > This means that unless you get lucky with SCP ordering, impacted regions will > remain as RIT for an extended period of time. If you get particularly unlucky > and a critical system table is included in the regions that are being > recovered, then master initialization itself will end up blocked on this > sequence of SCP timeouts. If there are enough of them to exceed the master > initialization timeouts, then the situation can be self-sustaining as > additional master fails over cause even more duplicative SCPs to be scheduled. > h3. Indicators: > * Master appears to hang; failing to assign regions to available region > servers. > * Master appears to hang during initialization; shows waiting for the meta > or namespace regions. > * Repeated master restarts allow some progress to be made on assignments for > a limited period of time. > * Master UI shows a large number of Server Crash Procedures in RUNNABLE > state and the number increases by roughly the number of Region Servers on > master restart. > * Master log shows a large number of messages that assignment of a region > has failed because it was last seen on a region server that has not yet been > processed. These messages come from the AssignmentManager logger. This > message should normally only occur when a Region Server dies just before some > assignment is about to happen. When this combination of issues happens the > message will happen repeatedly; every time a new defunct SCP is processed > it’ll happen for each region. > Example of aforementioned message: > {code:java} > 2019-04-15 11:19:04,610 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning > test6,5f89022e,1555251200249.946ff80e7602e66853d33899819983a1., it is on a > dead but not processed yet server: > regionserver-8.hbase.example.com,22101,1555349031179 > {code} > h3. Reproduction: > The procedure we currently have to reproduce this issue requires specific > timings that can be hard to get correctly, so this might require multiple > tries. > Before starting, the test cluster should have the following properties: > * Only 1 master. Not a strict requirement but it helps in the following > steps. > * Hundreds of regions per region server. If you need more, fire up your > preferred data generation tool and tell it to create a large enough table. > Those regions can be empty, no need to fill them with actual data. > * At least ten times more region servers than available CPUs on the master. > If the number of CPUs is too high, set {{hbase.master.procedure.threads}} in > the safety valves to number-of-region-servers divided by 10. For example, if > you have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3. > Set your environment the following way: > * Access to cluster-wide shutdown/startup via whatever control plane you use > * Access to master-role specific restart via whatever control plane you use > * A shell on the master node that tails the master log (use {{tail -F}} with > a capital F to ride over log rolls). > * A shell on the master node that’s ready to grab the HMaster’s pid and > {{kill -9}} it. > The procedure is: > * Start with a cluster as described above. > * Restart the entire HBase cluster. > * When the master log shows “Clean cluster startup. Assigning user regions” > and it starts assigning regions, {{kill -9}} the master. > * Stop the entire HBase cluster > * Restart the entire HBase cluster > * Once the master shows “Found regions out on cluster or in RIT”, {{kill > -9}} the master again. > * Restart only the Master. > * (Optionally repeat the above two steps) > You know that you hit the bug when the given indicators show up. > If it seems like it’s able to assign a lot of regions again, try {{kill -9}} > again and restart the Master role. > h3. Workaround: > Identified and tested workaround to mitigate this problem involves > configuration tunings to speed up regions assignment on Region Servers and > also reduce the time spent by BulkAssignment threads on the Master side. We > do not recommend setting this configurations for normal operations. > * {{hbase.master.procedure.threads}} - This property defines the general > procedure pool size, and in the context of this issue, is the pool for > executing SCPs. Increasing this pool would allow more SCPs for different > Region Servers to run in parallel, allowing for more regions assignments to > be processed. However, on the core of this problem is the fact that Master > may have multiple SCPs for same Region Server. These are not run in parallel, > therefore tuning this parameter will not be sufficient. We recommend setting > this parameter to the number of Region Servers in the cluster, so that under > normal scenarios where there is one SCP for each Region Server, all those can > run in parallel. > * {{hbase.bulk.assignment.perregion.open.time}} - This property determines > how long a bulk assignment thread on Master's BulkAssigner should wait for > all its regions to get assigned. Setting to a value as low as 100 > (milliseconds) will allow the no-op SCPs to complete faster. Which will open > up execution spots for SCPs that can do actual assignment work. > * {{hbase.master.namespace.init.timeout}} - The master has a time limit for > how long it takes to assign the namespace table. Given that we want to limit > master restarts, this is better adjusted upwards. > Since this issue is especially pronounced on clusters with a large number of > regions-per-region-server the following additional config can also help: > * {{hbase.regionserver.executor.openregion.threads}} - This relates to the > number of threads on each Region Server responsible for handling the region > assignment requests. Provided individual Region Servers are not overloaded > already tuning this value higher than the default (3) should help expedite > region assignment. > h3. Acknowledgements: > Thanks to the ton of folks who helped diagnose, chase down, and document this > issue, its reproduction, and the workaround. Especially [~jdcryans], > [~wchevreuil], [~an...@apache.org], [~avirmani], Shamik Dave, [~esteban], and > [~elserj]. -- This message was sent by Atlassian Jira (v8.3.4#803005)