Re: Time for 2.5.6 (last release this year)

2023-12-06 Thread Duo Zhang
You must mean 2.5.7 :)

I think we should include HBASE-28248 , as HBASE-28210 and HBASE-28212
has already been committed to branch-2.5.

Andrew Purtell  于2023年12月6日周三 08:13写道:
>
> Some important fixes in branch-2.5. Let's do one more 2.5 release this
> year, 2.5.6.
>
> If you have any pending work, please try to commit it before Thursday of
> this week. I will start the work of cutting RC0 then. If you are aware of
> any potentially blocking issues feel free to raise them on this thread.
>
> --
> Best regards,
> Andrew


[jira] [Created] (HBASE-28248) Race between RegionRemoteProcedureBase and rollback operation could lead to ROLLEDBACK state be persisent to procedure store

2023-12-06 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-28248:
-

 Summary: Race between RegionRemoteProcedureBase and rollback 
operation could lead to ROLLEDBACK state be persisent to procedure store
 Key: HBASE-28248
 URL: https://issues.apache.org/jira/browse/HBASE-28248
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang


And then cause load procedures failure.

This is because we do not execute RegionRemoteProcedureBase.persistAndWake 
method in PEWorker, so even if we hold the procedureExecutionLock in rollback, 
they could be executed concurrently.

So it is possible that in rollback, we set the state to ROLLEDBACK and delete 
the procedure, and then in persistAndWake, we persist the ROLLEDBACK state to 
procedure store.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28247) Add java.base/sun.net.dns and java.base/sun.net.util export to jdk11 JVM test flags

2023-12-06 Thread Istvan Toth (Jira)
Istvan Toth created HBASE-28247:
---

 Summary: Add java.base/sun.net.dns and java.base/sun.net.util  
export to jdk11 JVM test flags
 Key: HBASE-28247
 URL: https://issues.apache.org/jira/browse/HBASE-28247
 Project: HBase
  Issue Type: Bug
  Components: java
Affects Versions: 2.5.6, 2.4.17, 3.0.0-alpha-4, 2.6.0
Reporter: Istvan Toth
Assignee: Istvan Toth


While testing with JDK17 we have found  that we need to add 
{noformat}
  --add-exports java.base/sun.net.dns=ALL-UNNAMED
  --add-exports java.base/sun.net.util=ALL-UNNAMED
{noformat}
on top of what is already defined in _hbase-surefire.jdk11.flags_ , otherwise 
RS and Master startup fails in the Hadoop security code.

While this does not affect the test suite (at least not the commonly run 
tests), I consider hbase-surefire.jdk11.flags to be an unoffical resource to 
getting HBase to run on newer JDK versions.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28246) Expose region cached size over JMX metrics and report in the RS UI

2023-12-06 Thread Wellington Chevreuil (Jira)
Wellington Chevreuil created HBASE-28246:


 Summary: Expose region cached size over JMX metrics and report in 
the RS UI
 Key: HBASE-28246
 URL: https://issues.apache.org/jira/browse/HBASE-28246
 Project: HBase
  Issue Type: Improvement
  Components: BucketCache
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
 Attachments: Screenshot 2023-12-06 at 22.58.17.png

With large file based bucket cache, the prefetch executor can take long time to 
complete cache all of the dataset. It would be useful to report how much % of 
regions data is already cached, in order to give an idea of how much work 
prefetch executor has done.

This PRs adds jmx metrics for region cache % and also reports the same in the 
RS UI "Store File Metrics" tab as below:

!Screenshot 2023-12-06 at 22.58.17.png|width=658,height=114!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] End support for hadoop 2.10?

2023-12-06 Thread Bryan Beaudreault
Thanks for the input.

My concern with waiting on hbase 3.x is that it's already been pending for
years, and comes with many big architectural changes. It will probably be a
risky upgrade for users, and we will end up supporting hbase 2.x for years
to come. This is probably a separate discussion, but I do wonder if we
should target a specific major release cadence (yearly) so that we can move
forward on deprecations, etc. Not every major release has to be huge
(ideally isn't).

I agree we need to support hadoop-2.x for a while, but we can keep that
support in hbase 2.5. This is how we've handled other hadoop versions
according to our compatibility matrix.

On Wed, Dec 6, 2023 at 1:53 AM 张铎(Duo Zhang)  wrote:

> Better also send the email to user@hbase to see what our users think.
>
> I think we could change the default profile to hadoop3, but better
> still have the hadoop2 profile as there could still be users on
> hadoop-2.x.
>
> We will completely drop the hadoop2 support in hbase 3.x.
>
> Tak Lon (Stephen) Wu  于2023年12月6日周三 12:08写道:
> >
> > When Wei-Chiu and I were working on Ozone support via HBASE-27769, we
> asked
> > once when we could supporting hadoop-3.3+, the answer from Duo was HBase
> > community supports the oldest version of hadoop
> > https://hadoop.apache.org/releases.html (it was 2.10, 3.2.4 and 3.3.6).
> >
> > If this strategy remains and once 2.10 becomes EOL then HBase 2.6 should
> be
> > able to support 3.2.x and 3.3.x. At the same time, IMO 3.2.x is also an
> > inactive release version, we can discuss if we should just change our
> base
> > of hadoop to 3.3.6 maybe starting from HBase 3.0+
> >
> > -Stephen
> >
> > On Tue, Dec 5, 2023 at 7:51 AM Bryan Beaudreault <
> bbeaudrea...@apache.org>
> > wrote:
> >
> > > On the hdfs dev list, they are talking about EOL Hadoop 2.10 (and thus
> > > 2.x). They may cherry-pick back critical CVE fixes but not create any
> more
> > > releases. Of course, the decision is not final yet, but I wonder if we
> > > should make a similar decision for supporting 2.10 in hbase.
> > >
> > > Given that 2.6 is soon, we could mark the end of support in that
> release.
> > > While it may seem like a major change, there is some precedent for
> this.
> > > Looking at our compatibility matrix, we have dropped support for Hadoop
> > > releases in minor releases in the past.
> > >
> > > Dropping support for Hadoop 2 in HBase 2.6 would allow us to start
> cleaning
> > > up our POMs and some of the hacks we've had to do to reflect around
> Hadoop
> > > releases. It may also free up Jenkins capacity since we can turn off
> some
> > > builds for our primary branches.
> > >
>


[jira] [Created] (HBASE-28245) Sync internal protobuf version for hbase to be same as hbase-thirdparty

2023-12-06 Thread Nihal Jain (Jira)
Nihal Jain created HBASE-28245:
--

 Summary: Sync internal protobuf version for hbase to be same as 
hbase-thirdparty
 Key: HBASE-28245
 URL: https://issues.apache.org/jira/browse/HBASE-28245
 Project: HBase
  Issue Type: Task
Reporter: Nihal Jain
Assignee: Nihal Jain






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28244) ProcedureTestingUtility.restart is broken sometimes after HBASE-28199

2023-12-06 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-28244:
-

 Summary: ProcedureTestingUtility.restart is broken sometimes after 
HBASE-28199
 Key: HBASE-28244
 URL: https://issues.apache.org/jira/browse/HBASE-28244
 Project: HBase
  Issue Type: Sub-task
Reporter: Duo Zhang


In ProcedureTestingUtility.restart, we will reuse the same ProcedureExecutor, 
so when restarting, we need to make sure that all procedures are not executed 
and then clear the scheduler.

But after HBASE-28199, we may add procedures back to scheduler after a 
CompletableFuture is completed, so even if all the PEWorker are terminated we 
could still add things to scheduler, which may break some tests.

We need to find a way to deal with this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28243) Bump jackson version to 2.15.2

2023-12-06 Thread Nihal Jain (Jira)
Nihal Jain created HBASE-28243:
--

 Summary:  Bump jackson version to 2.15.2 
 Key: HBASE-28243
 URL: https://issues.apache.org/jira/browse/HBASE-28243
 Project: HBase
  Issue Type: Improvement
Reporter: Nihal Jain
Assignee: Nihal Jain


We should bump jackson to 2.15.2 as it is already move to this in 
hbase-thirdparty in HBASE-28093 

Also 2.14.1 has 
[sonatype-2022-6438.|https://github.com/FasterXML/jackson-core/issues/861]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Debugging slowness in Gets

2023-12-06 Thread Nick Dimiduk
Hi Lars,

I took a look through my k8s PR's against hbase-operator-tools and see no
mention at all of async-profiler;  I have nothing to point you towards. I
also no longer have access to those systems, so I'm afraid I'm not much
further help.

I do remember there was some coordination required to enable capture of
kernel call stacks [0], but I don't recall the details of deploying this in
the kubernetes environment. Further down the doc [1], there is some mention
about running within a container, fiddling with the security profile.

[0]: https://github.com/async-profiler/async-profiler#basic-usage
[1]:
https://github.com/async-profiler/async-profiler#profiling-java-in-a-container

On Wed, Dec 6, 2023 at 10:03 AM Lars Francke  wrote:

> Thanks Nick,
> I'll take a look at that.
>
> I did add async-profiler to our image[1] but haven't had a chance to
> test it yet.
> Do you remember if you had to run the container with extra privileges?
>
> I just opened this issue before I saw your mail as it turns out that
> not all args are exposed after all :)
> https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this
> ticket)
>
> I'll continue this journey and will report back with any issues and
> will see if I can improve anything
>
> [1] <
> https://github.com/stackabletech/docker-images/commit/ba957b37c8d4c679b918f3815e28d974b0bd008d
> >
>
> On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
> >
> > For what it’s worth, we deployed async-profiler into the regionserver
> > container image and it all worked as expected. But it’s not a sidecar
> > container, it’s on the same image as the region server.
> >
> > If you can get the async profiler into your container image, installed
> > where the RS can find it (as described in the online book; double-check
> you
> > have a version of AP that’s compatible with your version of the profiler
> > servlet), you should be able to use the profiling http endpoint on the
> RS.
> > It’ll run async-profiler with the arguments you specify (read the servlet
> > code, all args are exposed). You can then download the flamegraph via
> HTTP
> > as well …
> >
> > Well, most of the time. I have run into issues where the file wasn’t
> served
> > correctly and I had to download it from the region server file system
> > (annoying to do from a container). There’s probably a closed Jira where I
> > scratch my head in public.
> >
> > On Wed, 6 Dec 2023 at 08:15, Lars Francke 
> wrote:
> >
> > > > > Also, are you sure you couldn't use async-profiler? We use this
> all the
> > > > > time in our very latency-sensitive production. It has no noticeable
> > > > > overhead in my experience and doesn't need any special
> dependencies.
> > > >
> > > > I have to admit, I have never used async-profiler. Shame on me.
> > > > That is a fabulous hint and I'll read up on it immediately.
> > >
> > > I now did read up on it, tried it locally, stumbled over
> > > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > > the whole docs earlier) that it's hard to run async-profiler in a
> > > container.
> > > For us, this is all running on Kubernetes, so we'll test that today.
> > >
> > > Testing i tlocally it looked very promising.
> > >
> > >
> > >
> > >
> > >
> > > > >
> > > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke <
> lars.fran...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am debugging an issue where we see some Get requests taking
> 2-5s.
> > > > > > We do see "responseTooSlow" etc. and this is in an environment
> where
> > > I
> > > > > > cannot run a Profiler but I  _can_ run modified code.
> > > > > >
> > > > > > So what I did was I added a stupid "MethodTimer"[1] which
> records how
> > > > > > long certain operations take at various points in the code (e.g.
> > > [2]).
> > > > > > I've been doing this a few rounds and have now arrived at the
> > > StoreScanner.
> > > > > >
> > > > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > > > I am a HBase committer but I haven't been able to keep up with
> the
> > > > > > changes in the last 5-6 years so I'm not too familiar with the
> inner
> > > > > > workings anymore and would appreciate a hint.
> > > > > >
> > > > > > I suspect it is slowness related to storage access.
> > > > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > > > access, does such a thing exist?
> > > > > > And something else that'd help me: Can anyone point me (if it
> exists)
> > > > > > at the (vicinity of the) code that actually reads from HDFS at
> the
> > > > > > end? There are so many layers.
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > > > Cheers,
> > > > > > Lars
> > > > > >
> > > > > >
> > > > > > [1] <
> > > > > >
> > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.

Re: Debugging slowness in Gets

2023-12-06 Thread Lars Francke
Thanks Nick,
I'll take a look at that.

I did add async-profiler to our image[1] but haven't had a chance to
test it yet.
Do you remember if you had to run the container with extra privileges?

I just opened this issue before I saw your mail as it turns out that
not all args are exposed after all :)
https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this ticket)

I'll continue this journey and will report back with any issues and
will see if I can improve anything

[1] 


On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk  wrote:
>
> For what it’s worth, we deployed async-profiler into the regionserver
> container image and it all worked as expected. But it’s not a sidecar
> container, it’s on the same image as the region server.
>
> If you can get the async profiler into your container image, installed
> where the RS can find it (as described in the online book; double-check you
> have a version of AP that’s compatible with your version of the profiler
> servlet), you should be able to use the profiling http endpoint on the RS.
> It’ll run async-profiler with the arguments you specify (read the servlet
> code, all args are exposed). You can then download the flamegraph via HTTP
> as well …
>
> Well, most of the time. I have run into issues where the file wasn’t served
> correctly and I had to download it from the region server file system
> (annoying to do from a container). There’s probably a closed Jira where I
> scratch my head in public.
>
> On Wed, 6 Dec 2023 at 08:15, Lars Francke  wrote:
>
> > > > Also, are you sure you couldn't use async-profiler? We use this all the
> > > > time in our very latency-sensitive production. It has no noticeable
> > > > overhead in my experience and doesn't need any special dependencies.
> > >
> > > I have to admit, I have never used async-profiler. Shame on me.
> > > That is a fabulous hint and I'll read up on it immediately.
> >
> > I now did read up on it, tried it locally, stumbled over
> > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> > 2.4 fails weirdly using Java 21 only to find out (I should have read
> > the whole docs earlier) that it's hard to run async-profiler in a
> > container.
> > For us, this is all running on Kubernetes, so we'll test that today.
> >
> > Testing i tlocally it looked very promising.
> >
> >
> >
> >
> >
> > > >
> > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke 
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am debugging an issue where we see some Get requests taking 2-5s.
> > > > > We do see "responseTooSlow" etc. and this is in an environment where
> > I
> > > > > cannot run a Profiler but I  _can_ run modified code.
> > > > >
> > > > > So what I did was I added a stupid "MethodTimer"[1] which records how
> > > > > long certain operations take at various points in the code (e.g.
> > [2]).
> > > > > I've been doing this a few rounds and have now arrived at the
> > StoreScanner.
> > > > >
> > > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > > I am a HBase committer but I haven't been able to keep up with the
> > > > > changes in the last 5-6 years so I'm not too familiar with the inner
> > > > > workings anymore and would appreciate a hint.
> > > > >
> > > > > I suspect it is slowness related to storage access.
> > > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > > access, does such a thing exist?
> > > > > And something else that'd help me: Can anyone point me (if it exists)
> > > > > at the (vicinity of the) code that actually reads from HDFS at the
> > > > > end? There are so many layers.
> > > > >
> > > > > Thank you!
> > > > >
> > > > > Cheers,
> > > > > Lars
> > > > >
> > > > >
> > > > > [1] <
> > > > >
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > > > > >
> > > > > [2] <
> > > > >
> > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > > > > >
> > > > >
> >


[jira] [Created] (HBASE-28242) ProfileServlet does not allow selecting all events (e.g. itimer)

2023-12-06 Thread Lars Francke (Jira)
Lars Francke created HBASE-28242:


 Summary: ProfileServlet does not allow selecting all events (e.g. 
itimer)
 Key: HBASE-28242
 URL: https://issues.apache.org/jira/browse/HBASE-28242
 Project: HBase
  Issue Type: Improvement
Reporter: Lars Francke


In ProfileServlet we currently force the use of certain events because we use 
an enum with allowed values.

async-profiler can support selecting multiple events (comma-separated) and it 
supports parameters as well.

Example from the README: {{event=cpu,alloc=2m,lock=10ms}}

We also miss an item {{itimer}} which is suggested if kernel level access to 
perf events is not available (e.g. in a container)

I understand that this probably has to do with security because we don't want 
to allow users passing in arbitrary things into the command line.
At the very least I'd like to add itimer support but if someone has an idea how 
we can (easily and safely) support more events I'm all ears.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Debugging slowness in Gets

2023-12-06 Thread Nick Dimiduk
For what it’s worth, we deployed async-profiler into the regionserver
container image and it all worked as expected. But it’s not a sidecar
container, it’s on the same image as the region server.

If you can get the async profiler into your container image, installed
where the RS can find it (as described in the online book; double-check you
have a version of AP that’s compatible with your version of the profiler
servlet), you should be able to use the profiling http endpoint on the RS.
It’ll run async-profiler with the arguments you specify (read the servlet
code, all args are exposed). You can then download the flamegraph via HTTP
as well …

Well, most of the time. I have run into issues where the file wasn’t served
correctly and I had to download it from the region server file system
(annoying to do from a container). There’s probably a closed Jira where I
scratch my head in public.

On Wed, 6 Dec 2023 at 08:15, Lars Francke  wrote:

> > > Also, are you sure you couldn't use async-profiler? We use this all the
> > > time in our very latency-sensitive production. It has no noticeable
> > > overhead in my experience and doesn't need any special dependencies.
> >
> > I have to admit, I have never used async-profiler. Shame on me.
> > That is a fabulous hint and I'll read up on it immediately.
>
> I now did read up on it, tried it locally, stumbled over
> https://issues.apache.org/jira/browse/HBASE-25685 and the fact that
> 2.4 fails weirdly using Java 21 only to find out (I should have read
> the whole docs earlier) that it's hard to run async-profiler in a
> container.
> For us, this is all running on Kubernetes, so we'll test that today.
>
> Testing i tlocally it looked very promising.
>
>
>
>
>
> > >
> > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am debugging an issue where we see some Get requests taking 2-5s.
> > > > We do see "responseTooSlow" etc. and this is in an environment where
> I
> > > > cannot run a Profiler but I  _can_ run modified code.
> > > >
> > > > So what I did was I added a stupid "MethodTimer"[1] which records how
> > > > long certain operations take at various points in the code (e.g.
> [2]).
> > > > I've been doing this a few rounds and have now arrived at the
> StoreScanner.
> > > >
> > > > I'm wondering if anyone has better ideas on how to diagnose this?
> > > > I am a HBase committer but I haven't been able to keep up with the
> > > > changes in the last 5-6 years so I'm not too familiar with the inner
> > > > workings anymore and would appreciate a hint.
> > > >
> > > > I suspect it is slowness related to storage access.
> > > > I was not able to find any logs or tweaks to log "slow storage"
> > > > access, does such a thing exist?
> > > > And something else that'd help me: Can anyone point me (if it exists)
> > > > at the (vicinity of the) code that actually reads from HDFS at the
> > > > end? There are so many layers.
> > > >
> > > > Thank you!
> > > >
> > > > Cheers,
> > > > Lars
> > > >
> > > >
> > > > [1] <
> > > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5
> > > > >
> > > > [2] <
> > > >
> https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297
> > > > >
> > > >
>