> I am also not completely sure what you meant by "manager", what > manager? Is that some terminology from your work or something we have > here? Genuinely asking what you mean by that, I am lost a bit here.
Sorry, sleep deprived with a 3 month old atm… manager == side car… Side Car is adding async profiler to their API, there was a thread about it awhile back. > When it comes to API, we are not touching anything already there. We > expose this through brand new > org.apache.cassandra.profiler.AsyncProfilerMBean. Adding a new API isn’t a breaking change, but the point I made in the side car thread is that the “execute” function uses the same arguments that async profiler does, which could change for us over time as its a 3rd party API. Exposing a 3rd party API puts us at risk as we normally support things for 10+ years so if they make a change than Cassandra also makes such a change… will we detect this? To us its just a string, so how would we know that this happened to protect our users? > On Dec 12, 2025, at 6:45 AM, Štefan Miklošovič <[email protected]> wrote: > > Hi Jon, answers below > > On Fri, Dec 12, 2025 at 2:19 AM Jon Haddad <[email protected]> wrote: >> >> +1 to including it, conceptually. It's easily the best tool for diagnosing >> perf issues that I've used. I've got a few questions / thoughts about >> implementation details & user ergonomics. >> >> - Capturing call stacks in modern kernels require some params to be set. >> Are we going to be able to check the requirements are met and give the user >> feedback? > > Indeed, we go to inform a user on two occasions. First, the check will > be executed in the context of Startup Checks "framework" we already > have in place in Cassandra, reading respective parameters from /proc > and a message will be logged if values of these parameters are not > "ideal". We do not go to fail the startup if they are not though. Just > a warning, because a user can always set it while Cassandra runs. No > need to _fail_ the startup. > > However, later on, if you go to profile via "nodetool profile start" > and these two are not set as they should be we will fail and inform a > user that they need to set them first. > >> - Profiling in containers is a little weird [1]. Same type of issue as my >> first point. > > I have run this in a container (Docker Compose) and I just did not > need to do anything. It just ... worked. I think this will be on a > user to ensure all is in place if anything special is needed. We are > also not dealing with any "pids" here as profiling is running in JVM > via AsyncProfiler API. (2) > >> - Getting allocation profiles requires debug symbols. More ergonomics. > > That is an old recommendation in the context of Cassandra 6.0 this > lands in, no? Which runs on 11+. They say "Prior to JDK 11" which does > not happen here. > > https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilingModes.md#installing-debug-symbols > >> - The profiler moves a lot faster than we do. Are we going to bump the >> async profiler in bug fix C* releases or are we freezing the version? > > I would update major versions of async profiler only in major versions > of Cassandra. Patch versions of AsyncProfiler might be updated within > patch versions of Cassandra. That makes the most sense to me. > > If you want to use something more recent without Cassandra providing > it first, you can basically do this and it should just work. > >> - Can I still attach using the asprof tool? Will there be an issue if I >> attach a newer version of the profiler? > > As said, the fact whether we can profile in Cassandra via in-built > profiler is driven by a system property, defaults to false. When set > to false, that means the logic which would check kernel parameters or > which would instantiate the AsyncProfiler object (as shown in (2)) > would not be exercised at all. Hence nothing "async-related" would be > instantiated in Cassandra etc. Then you can just take the async > profiler as you know it and run bin/asprof for Cassandra's PID as you > are used to. That also answers what happens if you use a newer version > - it would act the very same way. > >> - Are we relocating the jars, or does Corretto? > > The current patch does it in such a way that we are depending on > AsyncProfiler and it will be eventually included in release tarball. > So if you start Cassandra, that library will be on the class path > (even though until a system property is set to true which enables it, > it will not be possible to use it and it is not in any way > instantiated or initialized, it is also not possible to enable it in > runtime). > > (1) > https://github.com/apache/cassandra/blob/1b6e538c98db4287795692b7df88aa4940c3a7af/doc/modules/cassandra/pages/managing/operating/async-profiler.adoc#using-a-different-async-profiler-version > (2) > https://github.com/async-profiler/async-profiler/blob/master/docs/IntegratingAsyncProfiler.md#example-usage-with-the-api > >> >> Thanks! >> Jon >> >> [1] >> https://github.com/async-profiler/async-profiler/blob/master/docs/ProfilingInContainer.md >> >> On Thu, Dec 11, 2025 at 1:12 PM Josh McKenzie <[email protected]> wrote: >>> >>> If we expose whatever API the 3rd party has and they drift or break it in >>> the future, we could introduce a shim that would keep prior ergonomics at >>> that time w/sane defaults or graceful handling of removals. >>> >>> Think "manager" is referring to the sidecar here. >>> >>> On Thu, Dec 11, 2025, at 2:03 PM, Štefan Miklošovič wrote: >>> >>> Can you help me to understand what you mean by that? I have a feeling >>> I am missing something here or we are not on the same page. >>> >>> When it comes to API, we are not touching anything already there. We >>> expose this through brand new >>> org.apache.cassandra.profiler.AsyncProfilerMBean. >>> >>> So we are not really breaking anything here? >>> >>> I am also not completely sure what you meant by "manager", what >>> manager? Is that some terminology from your work or something we have >>> here? Genuinely asking what you mean by that, I am lost a bit here. >>> >>> If you mean that "we start to call AsyncProfiler and then in later >>> versions these guys decide that they will change how it is called" I >>> do not think that is really an issue here, is it? A user does not deal >>> with that directly anyway at all, only via MBean and there will >>> presumably always be a way to start and stop profiling, that is >>> basically at the very core of what that library is doing, no? >>> >>> On Thu, Dec 11, 2025 at 7:03 PM David Capwell <[email protected]> wrote: >>>> >>>> If disabled, which is default, >>>> >>>> >>>> I def won’t block on this, I just want us to think about these possible >>>> problems before we touch a public API; ill leave it to >>>> author(s)/reviewer(s). >>>> >>>> One thing that has been brought up in a different context is if we can >>>> make breaking changes to public facing APIs if the thing is disabled by >>>> default (debug tables is the example); I personally don’t have clarity >>>> here for the project so hard to say. >>>> >>>> TL;DR I am +0 >>>> >>>> On Dec 11, 2025, at 3:30 AM, Štefan Miklošovič <[email protected]> >>>> wrote: >>>> >>>> Oh wow! Thanks Dmitry for all these references. I think that the fact >>>> Corretto includes that into JDK is the testament of the quality. >>>> >>>> David, I hope this answers your concerns pretty much? >>>> >>>> On Thu, Dec 11, 2025 at 12:27 PM Dmitry Konstantinov <[email protected]> >>>> wrote: >>>> >>>> >>>> + 1 from my side >>>> >>>> 1) it is well known mature profiling tool, there are other cases when >>>> Apache projects embedded it, for example: >>>> - https://issues.apache.org/jira/browse/HADOOP-18055 >>>> - https://issues.apache.org/jira/browse/HBASE-29045 >>>> - https://issues.apache.org/jira/browse/FLINK-33325 >>>> 2) Apache-2.0 license >>>> 3) the dependency has a small size (less than 1Mb) and does not have >>>> transitive dependencies to other 3rd parties >>>> 4) the main contributors are now in Amazon, it is even included into >>>> Corretto JDK now >>>> (https://aws.amazon.com/about-aws/whats-new/2025/10/amazon-corretto-october-2025-quarterly-updates/ >>>> ) >>>> 5) the logic is disabled by default, so no impact if you do not use it. >>>> >>>> >>>> On Wed, 10 Dec 2025 at 18:08, Štefan Miklošovič <[email protected]> >>>> wrote: >>>> >>>> >>>> This capability is disabled by default, it is driven by a system >>>> property you have to set to true in order to be able to get an >>>> instance of AsyncProfiler which does the actual profiling. If >>>> disabled, which is default, then any calls via nodetool which needs >>>> AsyncProfiler (start, stop, status) would return a message that >>>> profiling is not enabled. >>>> >>>> Not sure if this answers your concerns but without knowingly turning >>>> it on nothing happens. >>>> >>>> On Wed, Dec 10, 2025 at 6:28 PM David Capwell <[email protected]> wrote: >>>> >>>> >>>> I have no issues adding it. I think my only real comment would be the >>>> same as with manager; w/e we expose to the public api (in this case >>>> Nodetool) we have to support, so if a 3rd party lib breaks compatibility >>>> that puts us in a bind if we didn’t think about that up front. >>>> >>>> Having async-profiler exposed makes it easier to profile is a good thing. >>>> Manager has (or is in the process of adding) API auth so we can lock down >>>> async-profiler to those “allowed” but do we have similar in Nodetool? We >>>> had an issue in the past that async-profiler would trigger a JVM crash >>>> (JVM bug), so we had to limit calls to it until it was fixed. >>>> >>>> On Dec 10, 2025, at 9:00 AM, Štefan Miklošovič <[email protected]> >>>> wrote: >>>> >>>> Worth to mention that we were also contemplating about the inclusion >>>> of jfr-convert so a user can also convert raw JFR files to e.g. HTML >>>> with heatmaps but we evaluated that it is not necessary. Sure, it >>>> would be comfortable, but ultimately not needed. Conversion of such a >>>> file via nodetool, on server side, is just not a good idea, it is not >>>> a job of a server to convert anything. >>>> >>>> In majority of cases, people using the profiler just want to get a >>>> HTML with cpu / allocation profile, it can even gather JFR files as >>>> such and fetch it is, it is just that the conversion as such can >>>> happen on client's side instead. >>>> >>>> I am +1 for introducing the core async profiler library only. >>>> >>>> On Wed, Dec 10, 2025 at 5:46 PM Bernardo Botella >>>> <[email protected]> wrote: >>>> >>>> >>>> Hi everyone! >>>> >>>> I’d like to propose adding the async-profiler library to the Cassandra >>>> project. This will enable us to add a new nodetool command to do profiling >>>> tasks on the process running Cassandra. This information can be useful to >>>> debug a wide range of potential issues and performance optimizations. >>>> CASSANDRA-20854 captures the effort and the details of the proposal, and >>>> this PR proposes its implementation. >>>> >>>> I want to note that this feature was already discussed in this thread, and >>>> this one only want to make sure that no one has any concerns about adding >>>> the library as a dependency. >>>> >>>> What is async-profiler? >>>> async-profiler is a low overhead sampling profiler for Java that does not >>>> suffer from the Safepoint bias problem. It features HotSpot-specific API >>>> to collect stack traces and to track memory allocations. The profiler >>>> works with OpenJDK and other Java runtimes based on the HotSpot JVM. >>>> >>>> Unlike traditional Java profilers, async-profiler monitors non-Java >>>> threads (e.g., GC and JIT compiler threads) and shows native and kernel >>>> frames in stack traces. >>>> >>>> What can be profiled: >>>> >>>> CPU time >>>> Allocations in Java Heap >>>> Native memory allocations and leaks >>>> Contended locks >>>> Hardware and software performance counters like cache misses, page faults, >>>> context switches >>>> and more. >>>> >>>> >>>> We propose to add async-profiler 4.2 as a dependency to Cassandra. >>>> >>>> Any concerns? >>>> Bernardo >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Dmitry Konstantinov >>>> >>>> >>> >>>
