[ 
https://issues.apache.org/jira/browse/CASSANDRA-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886529#comment-17886529
 ] 

Dmitry Konstantinov commented on CASSANDRA-19702:
-------------------------------------------------

Please find the analysis of the test failure below with a suggestion how to fix 
it. It took more time that I expected initially due to discovering of some 
issues which are not related directly to the test failure but are important in 
general.

== Background
The failing test was introduced as a part of CASSANDRA-8670 
([https://github.com/aweisberg/cassandra/commit/1b2e4f31871a4d0cadfb3bce0d1e38c711ecdeef])
 to test fixing of native memory leaking through JDK-level thread-local caches 
for direct buffers.
The test sends few CQL inserts with a large column value (63 MiB) and asserts 
that Cassandra server direct byte buffer memory usage is not increased a lot 
after it.

Observation: [https://bugs.openjdk.org/browse/JDK-8147468] - a JDK-level fix 
for the issue (by adding of -Djdk.nio.MaxCachedBufferSize option) was 
implemented some time after CASSANDRA-8670 and backported to JDK8.
I have created a ticket to consider if it makes sense to add 
-Djdk.nio.MaxCachedBufferSize by default to Cassandra JVM settings: 
CASSANDRA-19976

== Root cause analysis

I have observed 2 causes for the test failure:
1) Sometimes the test fails with OutOfMemory error in one of started Cassandra 
server nodes. Based on heap dump analysis heap there is a kind of memory leak 
through thread locals, more details are captured in CASSANDRA-19942, the issue 
has been fixed recently in CASSANDRA-19785.

2) The primary reason for the test to fail is hitting specific corner case of 
node overload logic. The test sends a big request message. v5 protocol is used, 
so the message is split to multiple frames which are re-aggregated on a server 
side, more precisely the logic in 
org.apache.cassandra.transport.CQLMessageHandler#processFirstFrameOfLargeMessage
 method is used to process the first frame of the large request. The attached 
diagram show it in more details: [^cassandra_rate_limit.svg]

Observations:
 - the test message size is bigger than global and per endpoint limits:
global limit is controlled by native_transport_max_request_data_in_flight 
config option and by default it is Xmx / 10. The test is executed with Xmx = 
495 MiB, so native_transport_max_request_data_in_flight = 49 MiB.
same story for per endpoint limit, it is controlled by 
native_transport_max_request_data_in_flight_per_ip option with default = Xmx / 
40 = 12 MiB. So, both limits are less than the big value sent by the test.

 - when we cannot allocate the required capacity for a big message we still 
process it and it is not actually classified as an overload. The overload logic 
was changed when v5 protocol frames were introduced and later it was adjusted 
additionally to avoid client starvation in commit: 
[https://github.com/beobal/cassandra/commit/0aae5b4921a3ce6c21a8a0e977624c877b19cd5b]
 (a part of CASSANDRA-14973)

 - The error log message "Could not aquire capacity while processing native 
protocol message" was added recently as a part of CASSANDRA-19534 commit but it 
was added for the already existing logic branch, I suppose it was done to 
highlight what we are hitting a pathological case.

 - the python test generic logic checks for any errors in logs and fails due to 
observing this error message.

== Suggested changes

The current test itself was designed not for testing overload use cases, so I 
suggest the following fix plan: 
1) fix the test itself by configuring 
native_transport_max_request_data_in_flight and 
native_transport_max_request_data_in_flight_per_ip to handle the messages using 
a normal processing logic. MR for the change: 
[https://github.com/apache/cassandra-dtest/pull/268]

2) I will create a separate ticket to improve handling of the pathological 
overload branch. Currently, we do not have a real protection against 
overloading of Cassandra server with big CQL messages.

> Test failure: largecolumn_test.TestLargeColumn
> ----------------------------------------------
>
>                 Key: CASSANDRA-19702
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19702
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Core
>            Reporter: Brandon Williams
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 4.1.x
>
>         Attachments: cassandra_rate_limit.svg, ci_summary.html
>
>
> https://app.circleci.com/pipelines/github/driftx/cassandra/1671/workflows/980121dd-ac7f-4747-91fa-391d961c9c5b/jobs/93813/tests
> {quote}
> failed on teardown with "Unexpected error found in node logs (see stdout for 
> full details). Errors: [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13 
> 15:06:46,808 NoSpamLogger.java:111 - Could not aquire capacity while 
> processing native protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4] 
> 2024-06-13 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity 
> while processing native protocol message']"
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13 15:06:46,808 
> NoSpamLogger.java:111 - Could not aquire capacity while processing native 
> protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4] 2024-06-13 
> 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity while 
> processing native protocol message']
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to