Hi, quick update. We did run with my patch for better 2.x support and got some profiles. It does indeed look like slowness in HDFS connections. I still have on my list to see if I can add some "slow logging" in the relevant paths. Happy to update this thread.
Cheers, Lars On Thu, Dec 7, 2023 at 9:11 AM Lars Francke <lars.fran...@gmail.com> wrote: > > Hi Nick, > > yes, thank you. We have it running in containers now but privileged > containers are not allowed. > I created this https://github.com/apache/hbase/pull/5566 to allow the > "itimer" thing and upgrade to async-profiler in general. > I hope to have some results next week, happy to update this thread. > > I also still hope to be able to add some "slow reads" logging once I > identify the relevant bits of code (haven't started looking yet and it > might not be feasible). > > Cheers, > Lars > > > On Wed, Dec 6, 2023 at 12:25 PM Nick Dimiduk <ndimi...@apache.org> wrote: > > > > Hi Lars, > > > > I took a look through my k8s PR's against hbase-operator-tools and see no > > mention at all of async-profiler; I have nothing to point you towards. I > > also no longer have access to those systems, so I'm afraid I'm not much > > further help. > > > > I do remember there was some coordination required to enable capture of > > kernel call stacks [0], but I don't recall the details of deploying this in > > the kubernetes environment. Further down the doc [1], there is some mention > > about running within a container, fiddling with the security profile. > > > > [0]: https://github.com/async-profiler/async-profiler#basic-usage > > [1]: > > https://github.com/async-profiler/async-profiler#profiling-java-in-a-container > > > > On Wed, Dec 6, 2023 at 10:03 AM Lars Francke <lars.fran...@gmail.com> wrote: > > > > > Thanks Nick, > > > I'll take a look at that. > > > > > > I did add async-profiler to our image[1] but haven't had a chance to > > > test it yet. > > > Do you remember if you had to run the container with extra privileges? > > > > > > I just opened this issue before I saw your mail as it turns out that > > > not all args are exposed after all :) > > > https://issues.apache.org/jira/browse/HBASE-28242 (ideas welcome in this > > > ticket) > > > > > > I'll continue this journey and will report back with any issues and > > > will see if I can improve anything > > > > > > [1] < > > > https://github.com/stackabletech/docker-images/commit/ba957b37c8d4c679b918f3815e28d974b0bd008d > > > > > > > > > > On Wed, Dec 6, 2023 at 9:06 AM Nick Dimiduk <ndimi...@apache.org> wrote: > > > > > > > > For what it’s worth, we deployed async-profiler into the regionserver > > > > container image and it all worked as expected. But it’s not a sidecar > > > > container, it’s on the same image as the region server. > > > > > > > > If you can get the async profiler into your container image, installed > > > > where the RS can find it (as described in the online book; double-check > > > you > > > > have a version of AP that’s compatible with your version of the profiler > > > > servlet), you should be able to use the profiling http endpoint on the > > > RS. > > > > It’ll run async-profiler with the arguments you specify (read the > > > > servlet > > > > code, all args are exposed). You can then download the flamegraph via > > > HTTP > > > > as well … > > > > > > > > Well, most of the time. I have run into issues where the file wasn’t > > > served > > > > correctly and I had to download it from the region server file system > > > > (annoying to do from a container). There’s probably a closed Jira where > > > > I > > > > scratch my head in public. > > > > > > > > On Wed, 6 Dec 2023 at 08:15, Lars Francke <lars.fran...@gmail.com> > > > wrote: > > > > > > > > > > > Also, are you sure you couldn't use async-profiler? We use this > > > all the > > > > > > > time in our very latency-sensitive production. It has no > > > > > > > noticeable > > > > > > > overhead in my experience and doesn't need any special > > > dependencies. > > > > > > > > > > > > I have to admit, I have never used async-profiler. Shame on me. > > > > > > That is a fabulous hint and I'll read up on it immediately. > > > > > > > > > > I now did read up on it, tried it locally, stumbled over > > > > > https://issues.apache.org/jira/browse/HBASE-25685 and the fact that > > > > > 2.4 fails weirdly using Java 21 only to find out (I should have read > > > > > the whole docs earlier) that it's hard to run async-profiler in a > > > > > container. > > > > > For us, this is all running on Kubernetes, so we'll test that today. > > > > > > > > > > Testing i tlocally it looked very promising. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 5, 2023 at 3:46 PM Lars Francke < > > > lars.fran...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I am debugging an issue where we see some Get requests taking > > > 2-5s. > > > > > > > > We do see "responseTooSlow" etc. and this is in an environment > > > where > > > > > I > > > > > > > > cannot run a Profiler but I _can_ run modified code. > > > > > > > > > > > > > > > > So what I did was I added a stupid "MethodTimer"[1] which > > > records how > > > > > > > > long certain operations take at various points in the code (e.g. > > > > > [2]). > > > > > > > > I've been doing this a few rounds and have now arrived at the > > > > > StoreScanner. > > > > > > > > > > > > > > > > I'm wondering if anyone has better ideas on how to diagnose > > > > > > > > this? > > > > > > > > I am a HBase committer but I haven't been able to keep up with > > > the > > > > > > > > changes in the last 5-6 years so I'm not too familiar with the > > > inner > > > > > > > > workings anymore and would appreciate a hint. > > > > > > > > > > > > > > > > I suspect it is slowness related to storage access. > > > > > > > > I was not able to find any logs or tweaks to log "slow storage" > > > > > > > > access, does such a thing exist? > > > > > > > > And something else that'd help me: Can anyone point me (if it > > > exists) > > > > > > > > at the (vicinity of the) code that actually reads from HDFS at > > > the > > > > > > > > end? There are so many layers. > > > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > Cheers, > > > > > > > > Lars > > > > > > > > > > > > > > > > > > > > > > > > [1] < > > > > > > > > > > > > > > > > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L150C5-L150C5 > > > > > > > > > > > > > > > > > [2] < > > > > > > > > > > > > > > > > https://github.com/stackabletech/docker-images/blob/8349f29f8aded8a01a8d1dbf7a90776ede1764ca/hbase/stackable/patches/2.4.12/005-STACKABLE-profiling-2.4.12.patch#L289-L297 > > > > > > > > > > > > > > > > > > > > > > > > >