[
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17928218#comment-17928218
]
Dmitry Konstantinov edited comment on CASSANDRA-20250 at 2/18/25 11:38 PM:
---------------------------------------------------------------------------
I have implemented a logic with PhantomReference to release metric ids used by
metric objects (plan to cover with tests and commit soon). I used LocalPool
logic related to PhantomReferences as an example.
In case of ThreadLocalMetrics I see the following resolvable issue: if I track
ThreadLocalMetrics themself I cannot keep a hard reference to this object to
allow PhantomReference to trigger. At the same time I need an accurate value
for count, so I cannot lost a content of ThreadLocalMetrics when a related
thread is dead and count logic still needs an access to it while the
ThreadLocalMetrics content is not transferred to a summary. So, we need a list
of ThreadLocalMetrics (or list of ThreadLocalMetrics inner arrays) to be able
to calculate and return an actual count.
The possible way to resolve this issue is to use PhantomReference to track
another object associated with a thread, for example Thread itself:
!image-2025-02-18-23-22-19-983.png|width=600!
So, we have a schema like this. We keep hard references to ThreadLocalMetrics
and access them to calculate a count even if a thread is dead; we use
PhantomReference which refers Thread object itself to trigger eventual
transferring of ThreadLocalMetrics values to a summary and removal of
ThreadLocalMetrics object from the shared list and releasing it in this way.
This is a simplest possible way to handle unused ThreadLocalMetrics to which I
have found so far..
An additional thought about PhantomReferences: if the referent object (Thread
in our case) is long-living and promoted to an old generation we may wait for
an old gen GC for it for quite a long time and the related ThreadLocalMetrics
can be kept alive much longer (and affects get count overhead) compared to the
option when we cleanup ThreadLocalMetrics by a periodic task which checks
thread aliveness directly..
was (Author: dnk):
I have implemented a logic with PhantomReference to release metric ids used by
metric objects (plan to cover with tests and commit soon). I used LocalPool
logic related to PhantomReferences as an example.
In case of ThreadLocalMetrics I see the following resolvable issue: if I track
ThreadLocalMetrics themself I cannot keep a hard reference to this object to
allow PhantomReference to trigger. At the same time I need an accurate value
for count, so I cannot lost a content of ThreadLocalMetrics when a related
thread is dead and count logic still needs an access to it while the
ThreadLocalMetrics content is not transferred to a summary. So, we need a list
of ThreadLocalMetrics (or list of ThreadLocalMetrics inner arrays) to be able
to calculate and return an actual count.
The possible way to resolve this issue is to use PhantomReference to track
another object associated with a thread, for example Thread itself:
!image-2025-02-18-23-22-19-983.png|width=600!
So, we have a schema like this. We keep hard references to ThreadLocalMetrics
and calculate a count even if a thread is dead and we use PhantomReference
which refers Thread object itself to trigger transferring of ThreadLocalMetrics
values to a summary and removal of ThreadLocalMetrics object from the shared
list.
This is a simplest possible way to handle unused ThreadLocalMetrics to which I
have found so far..
An additional thought about PhantomReferences: if the referent object (Thread
in our case) is long-living and promoted to an old generation we may wait for
an old gen GC for it for quite a long time and the related ThreadLocalMetrics
can be kept alive much longer (and affects get count overhead) compared to the
option when we cleanup ThreadLocalMetrics by a periodic task which checks
thread aliveness directly..
> Optimize Counter, Meter and Histogram metrics using thread local counters
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-20250
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Observability/Metrics
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 5.1_profile_cpu.html,
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html,
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip,
> cpu_profile_insert.html, image-2025-02-18-23-22-19-983.png, jmh-result.json,
> vmstat.log, vmstat_without_metrics.log
>
>
> Cassandra has a lot of metrics collected, many of them are collected per
> table, so their instance number is multiplied by number of tables. From one
> side it gives a better observability, from another side metrics are not for
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of
> total CPU spent for metrics in cpu framegraphs for read load and 11% for
> write load.
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The
> framegraph is captured using Async profiler build:
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics
> such as LongAdders and reservoirs + for MBeans (String concatenation within
> object names is a major cause of it, for each table+metric name combination a
> new String is created)
> LongAdder is used by Dropwizard Counter/Meter and Histogram metrics for
> counting purposes. It has severe memory overhead + while has a better scaling
> than AtomicLong we still have to pay some cost for the concurrent operations.
> Additionally, in case of Meter - we have a non-optimal behaviour when we
> count the same things several times.
> The idea (suggested by [~benedict]) is to switch to thread-local counters
> which we can store in a common thread-local array to reduce memory overhead.
> In this way we can avoid concurrent update overheads/contentions and to
> reduce memory footprint as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]