[
https://issues.apache.org/jira/browse/CASSANDRA-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023417#comment-18023417
]
Dmitry Konstantinov commented on CASSANDRA-20250:
-------------------------------------------------
Test run results (https://pre-ci.cassandra.apache.org/job/cassandra/107/)
* [^CASSANDRA-20250_ci_summary.html]
* [^CASSANDRA-20250_results_details.tar.xz]
Failed tests:
dtest-latest.cqlsh_tests.test_cqlsh.TestCqlsh test_unicode_invalid_request_error
dtest-latest.client_request_metrics_test.TestClientRequestMetrics
test_client_request_metrics
dtest-latest.cqlsh_tests.test_cqlsh.TestCqlsh test_unicode_invalid_request_error
dtest-latest.client_request_metrics_test.TestClientRequestMetrics
test_client_request_metrics
dtest.cqlsh_tests.test_cqlsh.TestCqlsh test_unicode_invalid_request_error
dtest.client_request_metrics_test.TestClientRequestMetrics
test_client_request_metrics
dtest.cqlsh_tests.test_cqlsh.TestCqlsh test_unicode_invalid_request_error
dtest.client_request_metrics_test.TestClientRequestMetrics
test_client_request_metrics
distributed.test.log.InProgressSequenceCoordinationTest
bootstrapProgressTest-_jdk17_x86_64
distributed.test.log.InProgressSequenceCoordinationTest
decommissionProgressTest-_jdk17_x86_64
distributed.test.log.InProgressSequenceCoordinationTest
inProgressSequenceRetryTest-_jdk17_x86_64
fuzz.topology.AccordBootstrapTest
bootstrapFuzzTest-cassandra.testtag_IS_UNDEFINED
fuzz.topology.JournalGCTest journalGCTest-_jdk11_x86_64
fuzz.topology.JournalGCTest journalGCTest-_jdk17_x86_64
simulator.test.AccordHarrySimulationTest test-cassandra.testtag_IS_UNDEFINED
simulator.test.ShortAccordSimulationTest simulationTest-_jdk11_x86_64
simulator.test.SingleNodeSingleTableASTTest
normal-cassandra.testtag_IS_UNDEFINED
transport.AuthMessageSizeLimitTest
sendTooBigAuthMultiFrameMessage-latest_jdk17_x86_64
are not related to the current ticket changes.
TestClientRequestMetrics fails for the current trunk as well
(https://ci-cassandra.apache.org/job/Cassandra-trunk)
> Optimize Counter, Meter and Histogram metrics using thread local counters
> -------------------------------------------------------------------------
>
> Key: CASSANDRA-20250
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20250
> Project: Apache Cassandra
> Issue Type: New Feature
> Components: Observability/Metrics
> Reporter: Dmitry Konstantinov
> Assignee: Dmitry Konstantinov
> Priority: Normal
> Fix For: 5.x
>
> Attachments: 5.1_profile_cpu.html,
> 5.1_profile_cpu_without_metrics.html, 5.1_tl4_profile_cpu.html,
> CASSANDRA-20250_ci_summary.html, CASSANDRA-20250_results_details.tar.xz,
> Histogram_AtomicLong.png, async_profiler_cpu_profiles.zip,
> cas_reverse_graph_metrics.png, cpu_profile_insert.html,
> image-2025-02-18-23-22-19-983.png, jmh-result.json, vmstat.log,
> vmstat_without_metrics.log
>
> Time Spent: 11h 40m
> Remaining Estimate: 0h
>
> Cassandra has a lot of metrics collected, many of them are collected per
> table, so their instance number is multiplied by number of tables. From one
> side it gives a better observability, from another side metrics are not for
> free, there is an overhead associated with them:
> 1) CPU overhead: in case of simple CPU bound load: I already see like 5.5% of
> total CPU spent for metrics in cpu framegraphs for read load and 11% for
> write load.
> Example: [^cpu_profile_insert.html] (search by "codahale" pattern). The
> framegraph is captured using Async profiler build:
> async-profiler-3.0-29ee888-linux-x64
> 2) memory overhead: we spend memory for entities used to aggregate metrics
> such as LongAdders and reservoirs + for MBeans (String concatenation within
> object names is a major cause of it, for each table+metric name combination a
> new String is created)
> LongAdder is used by Dropwizard Counter/Meter and Histogram metrics for
> counting purposes. It has severe memory overhead + while has a better scaling
> than AtomicLong we still have to pay some cost for the concurrent operations.
> Additionally, in case of Meter - we have a non-optimal behaviour when we
> count the same things several times.
> The idea (suggested by [~benedict]) is to switch to thread-local counters
> which we can store in a common thread-local array to reduce memory overhead.
> In this way we can avoid concurrent update overheads/contentions and to
> reduce memory footprint as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]