[jira] [Created] (HBASE-27861) Increases the concurrency of off-peak compactions
Xiao Zhang created HBASE-27861: -- Summary: Increases the concurrency of off-peak compactions Key: HBASE-27861 URL: https://issues.apache.org/jira/browse/HBASE-27861 Project: HBase Issue Type: Improvement Environment: HBase 2.0.2 Hadoop 3.1.1 Reporter: Xiao Zhang Assignee: Xiao Zhang The current off-peak compactions are global comparisons. There can only be one off-peak compact running at a time, which does not make better use of the system resources during the off-peak period. This is not very friendly for clusters with significant peaks and valleys of business. Code: {code:java} private static final AtomicBoolean offPeakCompactionTracker = new AtomicBoolean(); // Normal case - coprocessor is not overriding file selection. if (!compaction.hasSelection()) { boolean isUserCompaction = priority == Store.PRIORITY_USER; boolean mayUseOffPeak = offPeakHours.isOffPeakHour() && offPeakCompactionTracker.compareAndSet(false, true); try { compaction.select(this.filesCompacting, isUserCompaction, mayUseOffPeak, forceMajor && filesCompacting.isEmpty()); } catch (IOException e) { if (mayUseOffPeak) { offPeakCompactionTracker.set(false); } throw e; } assert compaction.hasSelection(); if (mayUseOffPeak && !compaction.getRequest().isOffPeak()) { // Compaction policy doesn't want to take advantage of off-peak. offPeakCompactionTracker.set(false); } }{code} I think it could be optimized to allow the user to set the number of concurrent off-peak compactions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [VOTE] Merge feature branch HBASE-27109 back to master
+1 Thanks all for the efforts! Best Regards, Yu On Fri, 12 May 2023 at 10:17, tianhang tang wrote: > +1 > > 张铎(Duo Zhang) 于2023年5月10日周三 21:20写道: > > > > Oh, it seems finally the 3 VOTE emails are all sent... > > > > Sorry for the spam... > > > > Liangjun He <2005hit...@163.com> 于2023年5月10日周三 19:36写道: > > > > > +1 > > > > > > > > > At 2023-05-10 01:13:12, "张铎(Duo Zhang)" wrote: > > > >The issue is about moving replication queue storage from zookeeper to > a > > > >hbase table. This is the last piece of persistent data on zookeeper. > So > > > >after this feature merged, we are finally fine to say that all data on > > > >zookeeper can be removed while restarting a cluster. > > > > > > > >Let me paste the release note here > > > > > > > >We introduced a table based replication queue storage in this issue. > The > > > >> queue data will be stored in hbase:replication table. This is the > last > > > >> piece of persistent data on zookeeper. So after this change, we are > OK > > > to > > > >> clean up all the data on zookeeper, as now they are all transient, a > > > >> cluster restarting can fix everything. > > > >> > > > >> The data structure has been changed a bit as now we only support an > > > offset > > > >> for a WAL group instead of storing all the WAL files for a WAL > group. > > > >> Please see the replication internals section in our ref guide for > more > > > >> details. > > > >> > > > >> To break the cyclic dependency issue, i.e, creating a new WAL writer > > > >> requires writing to replication queue storage first but with table > based > > > >> replication queue storage, you first need a WAL writer when you > want to > > > >> update to table, now we will not record a queue when creating a new > WAL > > > >> writer instance. The downside for this change is that, the logic for > > > >> claiming queue and WAL cleaner are much more complicated. See > > > >> AssignReplicationQueuesProcedure and ReplicationLogCleaner for more > > > details > > > >> if you have interest. > > > >> > > > >> Notice that, we will use a separate WAL provider for > hbase:replication > > > >> table, so you will see a new WAL file for the region server which > holds > > > the > > > >> hbase:replication table. If we do not do this, the update to > > > >> hbase:replication table will also generate some WAL edits in the WAL > > > file > > > >> we need to track in replication, and then lead to more updates to > > > >> hbase:replication table since we have advanced the replication > offset. > > > In > > > >> this way we will generate a lot of garbage in our WAL file, even if > we > > > >> write nothing to the cluster. So a separated WAL provider which is > not > > > >> tracked by replication is necessary here. > > > >> > > > >> The data migration will be done automatically during rolling > upgrading, > > > of > > > >> course the migration via a full cluster restart is also supported, > but > > > >> please make sure you restart master with new code first. The > replication > > > >> peers will be disabled during the migration and no claiming queue > will > > > be > > > >> scheduled at the same time. So you may see a lot of unfinished SCPs > > > during > > > >> the migration but do not worry, it will not block the normal > failover, > > > all > > > >> regions will be assigned. The replication peers will be enabled > again > > > after > > > >> the migration is done, no manual operations needed. > > > >> > > > >> The ReplicationSyncUp tool is also affected. The goal of this tool > is to > > > >> replicate data to peer cluster while the source cluster is down. > But if > > > we > > > >> store the replication queue data in a hbase table, it is impossible > for > > > us > > > >> to get the newest data if the source cluster is down. So here we > choose > > > to > > > >> read from the region directory directly to load all the replication > > > queue > > > >> data in memory, and do the sync up work. We may lose the newest > data so > > > in > > > >> this way we need to replicate more data but it will not affect > > > >> correctness. > > > >> > > > > > > > > The nightly job is here > > > > > > > > > > > > https://ci-hbase.apache.org/job/HBase%20Nightly/job/HBASE-27109%252Ftable_based_rqs/ > > > > > > > >Mostly fine, the failed UTs are not related and are flaky, for > example, > > > >build #73, the failed UT is TestAdmin1.testCompactionTimestamps, > which is > > > >not related to replication and it only failed in jdk11 build but > passed in > > > >jdk8 build. > > > > > > > >This is the PR against the master branch. > > > > > > > >https://github.com/apache/hbase/pull/5202 > > > > > > > >The PR is big as we have 16 commits on the feature branch. > > > > > > > >The VOTE will be open for at least 72 hours. > > > > > > > >[+1] Agree > > > >[+0] Neutral > > > >[-1] Disagree (please include actionable feedback) > > > > > > > >Thanks. > > > >
Re: [VOTE] Merge feature branch HBASE-27109 back to master
+1 张铎(Duo Zhang) 于2023年5月10日周三 21:20写道: > > Oh, it seems finally the 3 VOTE emails are all sent... > > Sorry for the spam... > > Liangjun He <2005hit...@163.com> 于2023年5月10日周三 19:36写道: > > > +1 > > > > > > At 2023-05-10 01:13:12, "张铎(Duo Zhang)" wrote: > > >The issue is about moving replication queue storage from zookeeper to a > > >hbase table. This is the last piece of persistent data on zookeeper. So > > >after this feature merged, we are finally fine to say that all data on > > >zookeeper can be removed while restarting a cluster. > > > > > >Let me paste the release note here > > > > > >We introduced a table based replication queue storage in this issue. The > > >> queue data will be stored in hbase:replication table. This is the last > > >> piece of persistent data on zookeeper. So after this change, we are OK > > to > > >> clean up all the data on zookeeper, as now they are all transient, a > > >> cluster restarting can fix everything. > > >> > > >> The data structure has been changed a bit as now we only support an > > offset > > >> for a WAL group instead of storing all the WAL files for a WAL group. > > >> Please see the replication internals section in our ref guide for more > > >> details. > > >> > > >> To break the cyclic dependency issue, i.e, creating a new WAL writer > > >> requires writing to replication queue storage first but with table based > > >> replication queue storage, you first need a WAL writer when you want to > > >> update to table, now we will not record a queue when creating a new WAL > > >> writer instance. The downside for this change is that, the logic for > > >> claiming queue and WAL cleaner are much more complicated. See > > >> AssignReplicationQueuesProcedure and ReplicationLogCleaner for more > > details > > >> if you have interest. > > >> > > >> Notice that, we will use a separate WAL provider for hbase:replication > > >> table, so you will see a new WAL file for the region server which holds > > the > > >> hbase:replication table. If we do not do this, the update to > > >> hbase:replication table will also generate some WAL edits in the WAL > > file > > >> we need to track in replication, and then lead to more updates to > > >> hbase:replication table since we have advanced the replication offset. > > In > > >> this way we will generate a lot of garbage in our WAL file, even if we > > >> write nothing to the cluster. So a separated WAL provider which is not > > >> tracked by replication is necessary here. > > >> > > >> The data migration will be done automatically during rolling upgrading, > > of > > >> course the migration via a full cluster restart is also supported, but > > >> please make sure you restart master with new code first. The replication > > >> peers will be disabled during the migration and no claiming queue will > > be > > >> scheduled at the same time. So you may see a lot of unfinished SCPs > > during > > >> the migration but do not worry, it will not block the normal failover, > > all > > >> regions will be assigned. The replication peers will be enabled again > > after > > >> the migration is done, no manual operations needed. > > >> > > >> The ReplicationSyncUp tool is also affected. The goal of this tool is to > > >> replicate data to peer cluster while the source cluster is down. But if > > we > > >> store the replication queue data in a hbase table, it is impossible for > > us > > >> to get the newest data if the source cluster is down. So here we choose > > to > > >> read from the region directory directly to load all the replication > > queue > > >> data in memory, and do the sync up work. We may lose the newest data so > > in > > >> this way we need to replicate more data but it will not affect > > >> correctness. > > >> > > > > > > The nightly job is here > > > > > > > > https://ci-hbase.apache.org/job/HBase%20Nightly/job/HBASE-27109%252Ftable_based_rqs/ > > > > > >Mostly fine, the failed UTs are not related and are flaky, for example, > > >build #73, the failed UT is TestAdmin1.testCompactionTimestamps, which is > > >not related to replication and it only failed in jdk11 build but passed in > > >jdk8 build. > > > > > >This is the PR against the master branch. > > > > > >https://github.com/apache/hbase/pull/5202 > > > > > >The PR is big as we have 16 commits on the feature branch. > > > > > >The VOTE will be open for at least 72 hours. > > > > > >[+1] Agree > > >[+0] Neutral > > >[-1] Disagree (please include actionable feedback) > > > > > >Thanks. > >
[jira] [Created] (HBASE-27860) Fix build error against Hadoop 3.3.5
Shuhei Yamasaki created HBASE-27860: --- Summary: Fix build error against Hadoop 3.3.5 Key: HBASE-27860 URL: https://issues.apache.org/jira/browse/HBASE-27860 Project: HBase Issue Type: Bug Components: build, hadoop3 Reporter: Shuhei Yamasaki Build with hadoop 3.3.5 will fail with following messages. Some packages are not included in shaded jar. {code:java} $ mvn clean install -DskipTests -Phadoop-3.0 -Dhadoop-three.version=3.3.5 ... [INFO] --- exec:1.6.0:exec (check-jar-contents-for-stuff-with-hadoop) @ hbase-shaded-with-hadoop-check-invariants --- [ERROR] Found artifact with unexpected contents: '/home/yamasakisua/hbase/hbase-shaded/hbase-shaded-client/target/hbase-shaded-client-2.4.17.jar' Please check the following and either correct the build or update the allowed list with reasoning. com/ com/sun/ com/sun/jersey/ com/sun/jersey/json/ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[DISCUSS] potential presentation for community over code
Hi guys, Apologies for the delay. Following our recent meetup, I was wondering if we could initiate a discussion and compile a list of potential topics to be presented at the community over code conference 2023 [1]. Please note that there is another Asia section [2] with a closer proposal deadline. Based on the conversations during the meetup, I suggest starting this thread with the following topics. However, feel free to add any additional ideas that come to mind as well. 1. What does the stable version (2.5.x) of HBase provide ? What are the use cases (maybe including the use cases of 2.4)? and what will HBase move forward? a. security updates with TLS. b. hbase incremental backup c. reduce dependency with Zookeeper. d. Prometheus integration. Furthermore, we can discuss potential presenters for both the Asia section and/or the international section. [1] https://communityovercode.org/ [2] https://www.bagevent.com/event/cocasia-2023-EN Thanks, Stephen
[jira] [Resolved] (HBASE-27851) TestListTablesByState is silently failing due to a surefire bug
[ https://issues.apache.org/jira/browse/HBASE-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang resolved HBASE-27851. --- Fix Version/s: 2.6.0 3.0.0-alpha-4 Hadoop Flags: Reviewed Assignee: Jonathan Albrecht Resolution: Fixed Pushed to master and branch-2. Thanks [~jonathan.albrecht] for contributing! > TestListTablesByState is silently failing due to a surefire bug > --- > > Key: HBASE-27851 > URL: https://issues.apache.org/jira/browse/HBASE-27851 > Project: HBase > Issue Type: Bug > Components: test > Environment: Maven hbase-server tests >Reporter: Jonathan Albrecht >Assignee: Jonathan Albrecht >Priority: Minor > Fix For: 2.6.0, 3.0.0-alpha-4 > > > Surefire version 3.0.0-M6 has a bug where tests end up being removed from the > test results if they fail with a long exception message. See: > > https://issues.apache.org/jira/browse/SUREFIRE-2079 > > org.apache.hadoop.hbase.master.TestListTablesByState is currently failing in > CI due to an error. However, it does not show up in the Test Results because > of the surefire bug. > > If you download the raw test_logs from the build artifacts, you will find the > files: > /home/jenkins/jenkins-home/workspace/HBase_Nightly_master/output-jdk8-hadoop3/archiver/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestListTablesByState.txt > > which contains: > {{---}} > {{Test set: org.apache.hadoop.hbase.master.TestListTablesByState}} > {{---}} > {{Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.929 s - > in org.apache.hadoop.hbase.master.TestListTablesByState}} > > and > /home/jenkins/jenkins-home/workspace/HBase_Nightly_master/output-jdk8-hadoop3/archiver/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestListTablesByState-output.txt > > which contains exceptions like: > {{...}} > {{2023-05-04T11:41:56,262 INFO [RPCClient-NioEventLoopGroup-4-3 {}] > client.RawAsyncHBaseAdmin$TableProcedureBiConsumer(2603): Operation: CREATE, > Table Name: default:test failed with > org.apache.hadoop.hbase.DoNotRetryIOException: Table test should have at > least one column family.}} > {{ at}} > {{...}} > > I found this while testing the final surfire 3.0.0 version which fixes the > bug and the test then shows up as failing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27859) HMaster.getCompactionState can happen NPE when region state is closed
guluo created HBASE-27859: - Summary: HMaster.getCompactionState can happen NPE when region state is closed Key: HBASE-27859 URL: https://issues.apache.org/jira/browse/HBASE-27859 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.4.13, 2.3.7 Environment: hbase 2.4.13 Reporter: guluo Attachments: 2023-05-11_001147.png Following steps to reproduce: # create table {code:java} create 'hbase_region_test', 'info', SPLITS => ['3', '7'] {code} # write data {code:java} //代码占位符 put 'hbase_region_test', '10010', 'info:name', 'Tom' put 'hbase_region_test', '20010', 'info:name', 'Tom' put 'hbase_region_test', '30010', 'info:name', 'Tom' put 'hbase_region_test', '40010', 'info:name', 'Tom' put 'hbase_region_test', '50010', 'info:name', 'Tom' put 'hbase_region_test', '60010', 'info:name', 'Tom' put 'hbase_region_test', '70010', 'info:name', 'Tom' put 'hbase_region_test', '80010', 'info:name', 'Tom' put 'hbase_region_test', '90010', 'info:name', 'Tom' {code} # closed a region of hbase_region_test !2023-05-11_001147.png! # calling method HMaster.getCompactionState: At this step, we can trigger this method to be called by opening hbase UI page about 'hbase_region_test' table detailes and getting the compaction state of 'hbase_region_test' samply # HMaster print NPE logs about HMaster.getCompactionState -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27858) Update surefire version to 3.0.0 and use the SurefireForkNodeFactory
Jonathan Albrecht created HBASE-27858: - Summary: Update surefire version to 3.0.0 and use the SurefireForkNodeFactory Key: HBASE-27858 URL: https://issues.apache.org/jira/browse/HBASE-27858 Project: HBase Issue Type: Improvement Components: test Environment: maven unit tests Reporter: Jonathan Albrecht The final surefire version 3.0.0 is released so we can update from 3.0.0-M6 -> 3.0.0. The final version has several fixes and is more stable. SurefireForkNodeFactory is a new strategy to control how the surefire forked nodes communicate with the main maven process. It uses a tcp channel instead of process pipes. It helps to fix some "corrupted channel" messages seen in the s390x build and should be more reliable on all platforms. I have been testing locally on amd64 and s390x and have not had any issues. Ref: [https://maven.apache.org/surefire/maven-surefire-plugin/examples/process-communication.html] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27857) HBaseClassTestRule: system exit not restored if test times out may cause test to hang
Jonathan Albrecht created HBASE-27857: - Summary: HBaseClassTestRule: system exit not restored if test times out may cause test to hang Key: HBASE-27857 URL: https://issues.apache.org/jira/browse/HBASE-27857 Project: HBase Issue Type: Bug Components: test Environment: maven unit tests Reporter: Jonathan Albrecht HBaseClassTestRule applies a timeout and a system exit rule to tests. The timeout rule throws an exception if it hits the timeout threshold. Since the timeout rule is applied after the system exit rule, the system exit rule does not see the exception and does not re-enable the default system exit behavior which can cause maven to hang on some tests. I saw the hang happen when certain tests timed out on s390x but it could happen on any platform. If the org.apache.hadoop.hbase.TestTimeout.infiniteLoop test is enabled and run it will generate a *-jvmRun1.dump file which shows that the org.apache.hadoop.hbase.TestSecurityManager is still enabled: {quote}# Created at 2023-04-27T15:51:58.947 org.apache.hadoop.hbase.SystemExitRule$SystemExitInTestException at org.apache.hadoop.hbase.TestSecurityManager.checkExit(TestSecurityManager.java:32) at java.base/java.lang.Runtime.exit(Runtime.java:114) at java.base/java.lang.System.exit(System.java:1752) at org.apache.maven.surefire.booter.ForkedBooter.acknowledgedExit(ForkedBooter.java:381) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:178) at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495) ...{quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27856) Add hadolint binary to operator-tools yetus environment
Nick Dimiduk created HBASE-27856: Summary: Add hadolint binary to operator-tools yetus environment Key: HBASE-27856 URL: https://issues.apache.org/jira/browse/HBASE-27856 Project: HBase Issue Type: Task Components: build Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: hbase-operator-tools-1.3.0 Since we're adding dockerfiles via HBASE-27827, let's also have a pre-commit check for them. -- This message was sent by Atlassian Jira (v8.20.10#820010)