[ https://issues.apache.org/jira/browse/HBASE-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114519#comment-17114519 ]
HBase QA commented on HBASE-23269: ---------------------------------- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 6m 1s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-1.4 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 15s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 28s{color} | {color:green} branch-1.4 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} branch-1.4 passed with JDK v1.8.0_252 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} branch-1.4 passed with JDK v1.7.0_262 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 11s{color} | {color:green} branch-1.4 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 3m 4s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} branch-1.4 passed with JDK v1.8.0_252 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} branch-1.4 passed with JDK v1.7.0_262 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 51s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 54s{color} | {color:green} branch-1.4 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed with JDK v1.8.0_252 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} the patch passed with JDK v1.7.0_262 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 2m 57s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 2m 19s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.7. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed with JDK v1.8.0_252 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} the patch passed with JDK v1.7.0_262 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 4s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}114m 38s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 57s{color} | {color:green} hbase-rsgroup in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}174m 24s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.9 Server=19.03.9 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-844/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hbase/pull/844 | | JIRA Issue | HBASE-23269 | | Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux d789a9ecdf76 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-844/out/precommit/personality/provided.sh | | git revision | branch-1.4 / 25298ea | | Default Java | 1.7.0_262 | | Multi-JDK versions | /usr/lib/jvm/zulu-8-amd64:1.8.0_252 /usr/lib/jvm/zulu-7-amd64:1.7.0_262 | | Test Results | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-844/2/testReport/ | | Max. process+thread count | 3824 (vs. ulimit of 10000) | | modules | C: hbase-server hbase-rsgroup U: . | | Console output | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-844/2/console | | versions | git=1.9.1 maven=3.0.5 findbugs=3.0.1 | | Powered by | Apache Yetus 0.11.1 https://yetus.apache.org | This message was automatically generated. > Hbase crashed due to two versions of regionservers when rolling upgrading > ------------------------------------------------------------------------- > > Key: HBASE-23269 > URL: https://issues.apache.org/jira/browse/HBASE-23269 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 1.4.0, 1.4.2, 1.4.9, 1.4.10, 1.4.11 > Reporter: Jianzhen Xu > Assignee: Jianzhen Xu > Priority: Critical > Attachments: 9.png, image-2019-11-07-14-49-41-253.png, > image-2019-11-07-14-50-11-877.png, image-2019-11-07-14-51-38-858.png > > > Currently, when hbase turns on the rs_group function and needs to upgrade to > a higher version, the meta table maybe assign failed, which eventually makes > the whole cluster unavailable and the availability drops to 0.This applies to > all versions that introduce rs_group functionality in hbase-1.4.*. Including > the patch of rs_group is introduced in the version below 1.4, upgrade to > version 1.4 will also appear. > When this happens during an upgrade: > * When rolling upgrading regionservers, it must appear if the first rs of > the upgrade is not in the same rs_group as the meta table. > The phenomenon is as follows: > !image-2019-11-07-14-50-11-877.png! > !image-2019-11-07-14-51-38-858.png! > The reason for this is as follows: during a rolling upgrade of the first > regionserver node (denoted as RS1),RS1 started up and re-registered to > zk,master triggered the operation through watcher perception in > RegionServerTracker, and finally came to this > method-HMaster.checkIfShouldMoveSystemRegionAsync()。 > The logic of this method is as follows: > > {code:java} > // code placeholder > public void checkIfShouldMoveSystemRegionAsync() { > new Thread(new Runnable() { > @Override > public void run() { > try { > synchronized (checkIfShouldMoveSystemRegionLock) { > // RS register on ZK after reports startup on master > List<HRegionInfo> regionsShouldMove = new ArrayList<>(); > for (ServerName server : getExcludedServersForSystemTable()) { > regionsShouldMove.addAll(getCarryingSystemTables(server)); > } > if (!regionsShouldMove.isEmpty()) { > List<RegionPlan> plans = new ArrayList<>(); > for (HRegionInfo regionInfo : regionsShouldMove) { > RegionPlan plan = getRegionPlan(regionInfo, true); > if (regionInfo.isMetaRegion()) { > // Must move meta region first. > balance(plan); > } else { > plans.add(plan); > } > } > for (RegionPlan plan : plans) { > balance(plan); > } > } > } > } catch (Throwable t) { > LOG.error(t); > } > } > }).start(); > }{code} > > # First execute getExcludedServersForSystemTable():Get the highest version > value in all regionservers and return all RSs below that version value, > labeled LowVersionRSList > # If 1 does not return null, iterate.If there is a region with system table > on rs, add this region to the List that needs move.If the first rs upgraded > at this point is not in the rs_group where the system table is located, the > region of the meta table is added to regionsShouldMove > # Get a Regionplan for the region in regionsShouldMove,, and the parameter > forceNewPlan is true: > ## Gets all regionserver which version is below the highest version; > ## Exclude regionservers from 1) for all rs online status. The result is > that only the rs has been upgraded will in collection, marked as destServers ; > ## Since forceNewPlan is set to true, destination server will be obtained > through balance.randomassignmet (region, destServers). Since rs_group > function is enabled, the balance here is RSGroupBasedLoadBalancer.The logic > in this method is: > ### the destServers in 3.2 obtained intersect with all online regionservers > in the rs_group of the current region.When region is a system table and not > in the same rs_group, the result here is null.If null is returned, > destination regionserver is hard-coded as BOGUS_SERVER_NAME(localhost,1); > Therefore, when master assigns region of the system table to localhost,1, it > will naturally assign failed.If the above master logic is not noticed and > this problem occurs, you can randomly upgrade a node in the rs_group where > the system table is located, and it will automatically recover. > During the actual upgrade process, you will rarely know this problem without > looking at the master code.However, the official document does not indicate > that when using the rs_group function, the rs_group where the system table is > located needs to be upgraded first. It is easy to get into this process and > eventually crash.The system tables are assigned to the highest version of rs > for compatibility purposes, the comment says. > Therefore, without changing the code logic, it can be noted in the official > documentation that the rs_group of the system table is the priority to be > upgraded when the cluster is upgraded with the rs_group function. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)