[jira] [Created] (HBASE-27861) Increases the concurrency of off-peak compactions

2023-05-11 Thread Xiao Zhang (Jira)
Xiao Zhang created HBASE-27861:
--

 Summary: Increases the concurrency of off-peak compactions
 Key: HBASE-27861
 URL: https://issues.apache.org/jira/browse/HBASE-27861
 Project: HBase
  Issue Type: Improvement
 Environment: HBase 2.0.2

Hadoop 3.1.1
Reporter: Xiao Zhang
Assignee: Xiao Zhang


The current off-peak compactions are global comparisons.  There can only be one 
off-peak compact running at a time, which does not make better use of the 
system resources during the off-peak period.  This is not very friendly for 
clusters with significant peaks and valleys of business.

Code:
{code:java}
private static final AtomicBoolean offPeakCompactionTracker = new 
AtomicBoolean(); 


// Normal case - coprocessor is not overriding file selection.
if (!compaction.hasSelection()) {
 boolean isUserCompaction = priority == Store.PRIORITY_USER;
 boolean mayUseOffPeak =
   offPeakHours.isOffPeakHour() && 
offPeakCompactionTracker.compareAndSet(false, true);
 try {
   compaction.select(this.filesCompacting, isUserCompaction, mayUseOffPeak,
 forceMajor && filesCompacting.isEmpty());
 } catch (IOException e) {
   if (mayUseOffPeak) {
 offPeakCompactionTracker.set(false);
   }
   throw e;
 }
 assert compaction.hasSelection();
 if (mayUseOffPeak && !compaction.getRequest().isOffPeak()) {
   // Compaction policy doesn't want to take advantage of off-peak.
   offPeakCompactionTracker.set(false);
 }
}{code}
 

I think it could be optimized to allow the user to set the number of concurrent 
off-peak compactions.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [VOTE] Merge feature branch HBASE-27109 back to master

2023-05-11 Thread Yu Li
+1

Thanks all for the efforts!

Best Regards,
Yu


On Fri, 12 May 2023 at 10:17, tianhang tang  wrote:

> +1
>
> 张铎(Duo Zhang)  于2023年5月10日周三 21:20写道:
> >
> > Oh, it seems finally the 3 VOTE emails are all sent...
> >
> > Sorry for the spam...
> >
> > Liangjun He <2005hit...@163.com> 于2023年5月10日周三 19:36写道:
> >
> > > +1
> > >
> > >
> > > At 2023-05-10 01:13:12, "张铎(Duo Zhang)"  wrote:
> > > >The issue is about moving replication queue storage from zookeeper to
> a
> > > >hbase table. This is the last piece of persistent data on zookeeper.
> So
> > > >after this feature merged, we are finally fine to say that all data on
> > > >zookeeper can be removed while restarting a cluster.
> > > >
> > > >Let me paste the release note here
> > > >
> > > >We introduced a table based replication queue storage in this issue.
> The
> > > >> queue data will be stored in hbase:replication table. This is the
> last
> > > >> piece of persistent data on zookeeper. So after this change, we are
> OK
> > > to
> > > >> clean up all the data on zookeeper, as now they are all transient, a
> > > >> cluster restarting can fix everything.
> > > >>
> > > >> The data structure has been changed a bit as now we only support an
> > > offset
> > > >> for a WAL group instead of storing all the WAL files for a WAL
> group.
> > > >> Please see the replication internals section in our ref guide for
> more
> > > >> details.
> > > >>
> > > >> To break the cyclic dependency issue, i.e, creating a new WAL writer
> > > >> requires writing to replication queue storage first but with table
> based
> > > >> replication queue storage, you first need a WAL writer when you
> want to
> > > >> update to table, now we will not record a queue when creating a new
> WAL
> > > >> writer instance. The downside for this change is that, the logic for
> > > >> claiming queue and WAL cleaner are much more complicated. See
> > > >> AssignReplicationQueuesProcedure and ReplicationLogCleaner for more
> > > details
> > > >> if you have interest.
> > > >>
> > > >> Notice that, we will use a separate WAL provider for
> hbase:replication
> > > >> table, so you will see a new WAL file for the region server which
> holds
> > > the
> > > >> hbase:replication table. If we do not do this, the update to
> > > >> hbase:replication table will also generate some WAL edits in the WAL
> > > file
> > > >> we need to track in replication, and then lead to more updates to
> > > >> hbase:replication table since we have advanced the replication
> offset.
> > > In
> > > >> this way we will generate a lot of garbage in our WAL file, even if
> we
> > > >> write nothing to the cluster. So a separated WAL provider which is
> not
> > > >> tracked by replication is necessary here.
> > > >>
> > > >> The data migration will be done automatically during rolling
> upgrading,
> > > of
> > > >> course the migration via a full cluster restart is also supported,
> but
> > > >> please make sure you restart master with new code first. The
> replication
> > > >> peers will be disabled during the migration and no claiming queue
> will
> > > be
> > > >> scheduled at the same time. So you may see a lot of unfinished SCPs
> > > during
> > > >> the migration but do not worry, it will not block the normal
> failover,
> > > all
> > > >> regions will be assigned. The replication peers will be enabled
> again
> > > after
> > > >> the migration is done, no manual operations needed.
> > > >>
> > > >> The ReplicationSyncUp tool is also affected. The goal of this tool
> is to
> > > >> replicate data to peer cluster while the source cluster is down.
> But if
> > > we
> > > >> store the replication queue data in a hbase table, it is impossible
> for
> > > us
> > > >> to get the newest data if the source cluster is down. So here we
> choose
> > > to
> > > >> read from the region directory directly to load all the replication
> > > queue
> > > >> data in memory, and do the sync up work. We may lose the newest
> data so
> > > in
> > > >> this way we need to replicate more data but it will not affect
> > > >> correctness.
> > > >>
> > > >
> > > > The nightly job is here
> > > >
> > > >
> > >
> https://ci-hbase.apache.org/job/HBase%20Nightly/job/HBASE-27109%252Ftable_based_rqs/
> > > >
> > > >Mostly fine, the failed UTs are not related and are flaky, for
> example,
> > > >build #73, the failed UT is TestAdmin1.testCompactionTimestamps,
> which is
> > > >not related to replication and it only failed in jdk11 build but
> passed in
> > > >jdk8 build.
> > > >
> > > >This is the PR against the master branch.
> > > >
> > > >https://github.com/apache/hbase/pull/5202
> > > >
> > > >The PR is big as we have 16 commits on the feature branch.
> > > >
> > > >The VOTE will be open for at least 72 hours.
> > > >
> > > >[+1] Agree
> > > >[+0] Neutral
> > > >[-1] Disagree (please include actionable feedback)
> > > >
> > > >Thanks.
> > >
>


Re: [VOTE] Merge feature branch HBASE-27109 back to master

2023-05-11 Thread tianhang tang
+1

张铎(Duo Zhang)  于2023年5月10日周三 21:20写道:
>
> Oh, it seems finally the 3 VOTE emails are all sent...
>
> Sorry for the spam...
>
> Liangjun He <2005hit...@163.com> 于2023年5月10日周三 19:36写道:
>
> > +1
> >
> >
> > At 2023-05-10 01:13:12, "张铎(Duo Zhang)"  wrote:
> > >The issue is about moving replication queue storage from zookeeper to a
> > >hbase table. This is the last piece of persistent data on zookeeper. So
> > >after this feature merged, we are finally fine to say that all data on
> > >zookeeper can be removed while restarting a cluster.
> > >
> > >Let me paste the release note here
> > >
> > >We introduced a table based replication queue storage in this issue. The
> > >> queue data will be stored in hbase:replication table. This is the last
> > >> piece of persistent data on zookeeper. So after this change, we are OK
> > to
> > >> clean up all the data on zookeeper, as now they are all transient, a
> > >> cluster restarting can fix everything.
> > >>
> > >> The data structure has been changed a bit as now we only support an
> > offset
> > >> for a WAL group instead of storing all the WAL files for a WAL group.
> > >> Please see the replication internals section in our ref guide for more
> > >> details.
> > >>
> > >> To break the cyclic dependency issue, i.e, creating a new WAL writer
> > >> requires writing to replication queue storage first but with table based
> > >> replication queue storage, you first need a WAL writer when you want to
> > >> update to table, now we will not record a queue when creating a new WAL
> > >> writer instance. The downside for this change is that, the logic for
> > >> claiming queue and WAL cleaner are much more complicated. See
> > >> AssignReplicationQueuesProcedure and ReplicationLogCleaner for more
> > details
> > >> if you have interest.
> > >>
> > >> Notice that, we will use a separate WAL provider for hbase:replication
> > >> table, so you will see a new WAL file for the region server which holds
> > the
> > >> hbase:replication table. If we do not do this, the update to
> > >> hbase:replication table will also generate some WAL edits in the WAL
> > file
> > >> we need to track in replication, and then lead to more updates to
> > >> hbase:replication table since we have advanced the replication offset.
> > In
> > >> this way we will generate a lot of garbage in our WAL file, even if we
> > >> write nothing to the cluster. So a separated WAL provider which is not
> > >> tracked by replication is necessary here.
> > >>
> > >> The data migration will be done automatically during rolling upgrading,
> > of
> > >> course the migration via a full cluster restart is also supported, but
> > >> please make sure you restart master with new code first. The replication
> > >> peers will be disabled during the migration and no claiming queue will
> > be
> > >> scheduled at the same time. So you may see a lot of unfinished SCPs
> > during
> > >> the migration but do not worry, it will not block the normal failover,
> > all
> > >> regions will be assigned. The replication peers will be enabled again
> > after
> > >> the migration is done, no manual operations needed.
> > >>
> > >> The ReplicationSyncUp tool is also affected. The goal of this tool is to
> > >> replicate data to peer cluster while the source cluster is down. But if
> > we
> > >> store the replication queue data in a hbase table, it is impossible for
> > us
> > >> to get the newest data if the source cluster is down. So here we choose
> > to
> > >> read from the region directory directly to load all the replication
> > queue
> > >> data in memory, and do the sync up work. We may lose the newest data so
> > in
> > >> this way we need to replicate more data but it will not affect
> > >> correctness.
> > >>
> > >
> > > The nightly job is here
> > >
> > >
> > https://ci-hbase.apache.org/job/HBase%20Nightly/job/HBASE-27109%252Ftable_based_rqs/
> > >
> > >Mostly fine, the failed UTs are not related and are flaky, for example,
> > >build #73, the failed UT is TestAdmin1.testCompactionTimestamps, which is
> > >not related to replication and it only failed in jdk11 build but passed in
> > >jdk8 build.
> > >
> > >This is the PR against the master branch.
> > >
> > >https://github.com/apache/hbase/pull/5202
> > >
> > >The PR is big as we have 16 commits on the feature branch.
> > >
> > >The VOTE will be open for at least 72 hours.
> > >
> > >[+1] Agree
> > >[+0] Neutral
> > >[-1] Disagree (please include actionable feedback)
> > >
> > >Thanks.
> >


[jira] [Created] (HBASE-27860) Fix build error against Hadoop 3.3.5

2023-05-11 Thread Shuhei Yamasaki (Jira)
Shuhei Yamasaki created HBASE-27860:
---

 Summary: Fix build error against Hadoop 3.3.5
 Key: HBASE-27860
 URL: https://issues.apache.org/jira/browse/HBASE-27860
 Project: HBase
  Issue Type: Bug
  Components: build, hadoop3
Reporter: Shuhei Yamasaki


Build with hadoop 3.3.5 will fail with following messages.

Some packages are not included in shaded jar.
{code:java}
$ mvn clean install -DskipTests -Phadoop-3.0 -Dhadoop-three.version=3.3.5
...
[INFO] --- exec:1.6.0:exec (check-jar-contents-for-stuff-with-hadoop) @ 
hbase-shaded-with-hadoop-check-invariants ---
[ERROR] Found artifact with unexpected contents: 
'/home/yamasakisua/hbase/hbase-shaded/hbase-shaded-client/target/hbase-shaded-client-2.4.17.jar'
    Please check the following and either correct the build or update
    the allowed list with reasoning.    com/
    com/sun/
    com/sun/jersey/
    com/sun/jersey/json/ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[DISCUSS] potential presentation for community over code

2023-05-11 Thread Tak Lon (Stephen) Wu
Hi guys,

Apologies for the delay. Following our recent meetup, I was wondering if we
could initiate a discussion and compile a list of potential topics to be
presented at the community over code conference 2023 [1]. Please note that
there is another Asia section [2] with a closer proposal deadline.

Based on the conversations during the meetup, I suggest starting this
thread with the following topics. However, feel free to add any additional
ideas that come to mind as well.

1. What does the stable version (2.5.x) of HBase provide ? What are the use
cases (maybe including the use cases of 2.4)? and what will HBase move
forward?
  a. security updates with TLS.
  b. hbase incremental backup
  c. reduce dependency with Zookeeper.
  d. Prometheus integration.

Furthermore, we can discuss potential presenters for both the Asia section
and/or the international section.

[1] https://communityovercode.org/
[2] https://www.bagevent.com/event/cocasia-2023-EN


Thanks,
Stephen


[jira] [Resolved] (HBASE-27851) TestListTablesByState is silently failing due to a surefire bug

2023-05-11 Thread Duo Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-27851.
---
Fix Version/s: 2.6.0
   3.0.0-alpha-4
 Hadoop Flags: Reviewed
 Assignee: Jonathan Albrecht
   Resolution: Fixed

Pushed to master and branch-2.

Thanks [~jonathan.albrecht] for contributing!

> TestListTablesByState is silently failing due to a surefire bug
> ---
>
> Key: HBASE-27851
> URL: https://issues.apache.org/jira/browse/HBASE-27851
> Project: HBase
>  Issue Type: Bug
>  Components: test
> Environment: Maven hbase-server tests
>Reporter: Jonathan Albrecht
>Assignee: Jonathan Albrecht
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-alpha-4
>
>
> Surefire version 3.0.0-M6 has a bug where tests end up being removed from the 
> test results if they fail with a long exception message. See:
>  
> https://issues.apache.org/jira/browse/SUREFIRE-2079
>  
> org.apache.hadoop.hbase.master.TestListTablesByState is currently failing in 
> CI due to an error. However, it does not show up in the Test Results because 
> of the surefire bug.
>  
> If you download the raw test_logs from the build artifacts, you will find the 
> files:
> /home/jenkins/jenkins-home/workspace/HBase_Nightly_master/output-jdk8-hadoop3/archiver/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestListTablesByState.txt
>  
> which contains:
> {{---}}
> {{Test set: org.apache.hadoop.hbase.master.TestListTablesByState}}
> {{---}}
> {{Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 14.929 s - 
> in org.apache.hadoop.hbase.master.TestListTablesByState}}
>  
> and 
> /home/jenkins/jenkins-home/workspace/HBase_Nightly_master/output-jdk8-hadoop3/archiver/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestListTablesByState-output.txt
>  
> which contains exceptions like:
> {{...}}
> {{2023-05-04T11:41:56,262 INFO  [RPCClient-NioEventLoopGroup-4-3 {}] 
> client.RawAsyncHBaseAdmin$TableProcedureBiConsumer(2603): Operation: CREATE, 
> Table Name: default:test failed with 
> org.apache.hadoop.hbase.DoNotRetryIOException: Table test should have at 
> least one column family.}}
> {{        at}}
> {{...}}
>  
> I found this while testing the final surfire 3.0.0 version which fixes the 
> bug and the test then shows up as failing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27859) HMaster.getCompactionState can happen NPE when region state is closed

2023-05-11 Thread guluo (Jira)
guluo created HBASE-27859:
-

 Summary: HMaster.getCompactionState can happen NPE when region 
state is closed
 Key: HBASE-27859
 URL: https://issues.apache.org/jira/browse/HBASE-27859
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.4.13, 2.3.7
 Environment: hbase 2.4.13
Reporter: guluo
 Attachments: 2023-05-11_001147.png

Following steps to reproduce:
 # create table 
{code:java}
create 'hbase_region_test', 'info', SPLITS => ['3', '7']  {code}

 # write data
{code:java}
//代码占位符
put 'hbase_region_test', '10010', 'info:name', 'Tom'
put 'hbase_region_test', '20010', 'info:name', 'Tom'
put 'hbase_region_test', '30010', 'info:name', 'Tom'
put 'hbase_region_test', '40010', 'info:name', 'Tom'
put 'hbase_region_test', '50010', 'info:name', 'Tom'
put 'hbase_region_test', '60010', 'info:name', 'Tom'
put 'hbase_region_test', '70010', 'info:name', 'Tom'
put 'hbase_region_test', '80010', 'info:name', 'Tom'
put 'hbase_region_test', '90010', 'info:name', 'Tom' {code}

 # closed a region of hbase_region_test  !2023-05-11_001147.png!
 # calling method HMaster.getCompactionState:    At this step, we can trigger 
this method to be called by opening hbase UI page about 'hbase_region_test' 
table detailes and getting the compaction state of 'hbase_region_test' samply
 # HMaster print NPE logs about HMaster.getCompactionState



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27858) Update surefire version to 3.0.0 and use the SurefireForkNodeFactory

2023-05-11 Thread Jonathan Albrecht (Jira)
Jonathan Albrecht created HBASE-27858:
-

 Summary: Update surefire version to 3.0.0 and use the 
SurefireForkNodeFactory
 Key: HBASE-27858
 URL: https://issues.apache.org/jira/browse/HBASE-27858
 Project: HBase
  Issue Type: Improvement
  Components: test
 Environment: maven unit tests
Reporter: Jonathan Albrecht


The final surefire version 3.0.0 is released so we can update from 3.0.0-M6 -> 
3.0.0. The final version has several fixes and is more stable.
 
SurefireForkNodeFactory is a new strategy to control how the surefire forked 
nodes communicate with the main maven process. It uses a tcp channel instead of 
process pipes. It helps to fix some "corrupted channel" messages seen in the 
s390x build and should be more reliable on all platforms. I have been testing 
locally on amd64 and s390x and have not had any issues.
 
Ref: 
[https://maven.apache.org/surefire/maven-surefire-plugin/examples/process-communication.html]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27857) HBaseClassTestRule: system exit not restored if test times out may cause test to hang

2023-05-11 Thread Jonathan Albrecht (Jira)
Jonathan Albrecht created HBASE-27857:
-

 Summary: HBaseClassTestRule: system exit not restored if test 
times out may cause test to hang
 Key: HBASE-27857
 URL: https://issues.apache.org/jira/browse/HBASE-27857
 Project: HBase
  Issue Type: Bug
  Components: test
 Environment: maven unit tests
Reporter: Jonathan Albrecht


HBaseClassTestRule applies a timeout and a system exit rule to tests. The 
timeout rule throws an exception if it hits the timeout threshold. Since the 
timeout rule is applied after the system exit rule, the system exit rule does 
not see the exception and does not re-enable the default system exit behavior 
which can cause maven to hang on some tests. I saw the hang happen when certain 
tests timed out on s390x but it could happen on any platform.
 
If the org.apache.hadoop.hbase.TestTimeout.infiniteLoop test is enabled and run 
it will generate a *-jvmRun1.dump file which shows that the 
org.apache.hadoop.hbase.TestSecurityManager is still enabled:
 
{quote}# Created at 2023-04-27T15:51:58.947
org.apache.hadoop.hbase.SystemExitRule$SystemExitInTestException
        at 
org.apache.hadoop.hbase.TestSecurityManager.checkExit(TestSecurityManager.java:32)
        at java.base/java.lang.Runtime.exit(Runtime.java:114)
        at java.base/java.lang.System.exit(System.java:1752)
        at 
org.apache.maven.surefire.booter.ForkedBooter.acknowledgedExit(ForkedBooter.java:381)
        at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:178)
        at 
org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507)
        at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495)
...{quote}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-27856) Add hadolint binary to operator-tools yetus environment

2023-05-11 Thread Nick Dimiduk (Jira)
Nick Dimiduk created HBASE-27856:


 Summary: Add hadolint binary to operator-tools yetus environment
 Key: HBASE-27856
 URL: https://issues.apache.org/jira/browse/HBASE-27856
 Project: HBase
  Issue Type: Task
  Components: build
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: hbase-operator-tools-1.3.0


Since we're adding dockerfiles via HBASE-27827, let's also have a pre-commit 
check for them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)