[ https://issues.apache.org/jira/browse/CASSANDRA-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17886529#comment-17886529 ]
Dmitry Konstantinov commented on CASSANDRA-19702: ------------------------------------------------- Please find the analysis of the test failure below with a suggestion how to fix it. It took more time that I expected initially due to discovering of some issues which are not related directly to the test failure but are important in general. == Background The failing test was introduced as a part of CASSANDRA-8670 ([https://github.com/aweisberg/cassandra/commit/1b2e4f31871a4d0cadfb3bce0d1e38c711ecdeef]) to test fixing of native memory leaking through JDK-level thread-local caches for direct buffers. The test sends few CQL inserts with a large column value (63 MiB) and asserts that Cassandra server direct byte buffer memory usage is not increased a lot after it. Observation: [https://bugs.openjdk.org/browse/JDK-8147468] - a JDK-level fix for the issue (by adding of -Djdk.nio.MaxCachedBufferSize option) was implemented some time after CASSANDRA-8670 and backported to JDK8. I have created a ticket to consider if it makes sense to add -Djdk.nio.MaxCachedBufferSize by default to Cassandra JVM settings: CASSANDRA-19976 == Root cause analysis I have observed 2 causes for the test failure: 1) Sometimes the test fails with OutOfMemory error in one of started Cassandra server nodes. Based on heap dump analysis heap there is a kind of memory leak through thread locals, more details are captured in CASSANDRA-19942, the issue has been fixed recently in CASSANDRA-19785. 2) The primary reason for the test to fail is hitting specific corner case of node overload logic. The test sends a big request message. v5 protocol is used, so the message is split to multiple frames which are re-aggregated on a server side, more precisely the logic in org.apache.cassandra.transport.CQLMessageHandler#processFirstFrameOfLargeMessage method is used to process the first frame of the large request. The attached diagram show it in more details: [^cassandra_rate_limit.svg] Observations: - the test message size is bigger than global and per endpoint limits: global limit is controlled by native_transport_max_request_data_in_flight config option and by default it is Xmx / 10. The test is executed with Xmx = 495 MiB, so native_transport_max_request_data_in_flight = 49 MiB. same story for per endpoint limit, it is controlled by native_transport_max_request_data_in_flight_per_ip option with default = Xmx / 40 = 12 MiB. So, both limits are less than the big value sent by the test. - when we cannot allocate the required capacity for a big message we still process it and it is not actually classified as an overload. The overload logic was changed when v5 protocol frames were introduced and later it was adjusted additionally to avoid client starvation in commit: [https://github.com/beobal/cassandra/commit/0aae5b4921a3ce6c21a8a0e977624c877b19cd5b] (a part of CASSANDRA-14973) - The error log message "Could not aquire capacity while processing native protocol message" was added recently as a part of CASSANDRA-19534 commit but it was added for the already existing logic branch, I suppose it was done to highlight what we are hitting a pathological case. - the python test generic logic checks for any errors in logs and fails due to observing this error message. == Suggested changes The current test itself was designed not for testing overload use cases, so I suggest the following fix plan: 1) fix the test itself by configuring native_transport_max_request_data_in_flight and native_transport_max_request_data_in_flight_per_ip to handle the messages using a normal processing logic. MR for the change: [https://github.com/apache/cassandra-dtest/pull/268] 2) I will create a separate ticket to improve handling of the pathological overload branch. Currently, we do not have a real protection against overloading of Cassandra server with big CQL messages. > Test failure: largecolumn_test.TestLargeColumn > ---------------------------------------------- > > Key: CASSANDRA-19702 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19702 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core > Reporter: Brandon Williams > Assignee: Dmitry Konstantinov > Priority: Normal > Fix For: 4.1.x > > Attachments: cassandra_rate_limit.svg, ci_summary.html > > > https://app.circleci.com/pipelines/github/driftx/cassandra/1671/workflows/980121dd-ac7f-4747-91fa-391d961c9c5b/jobs/93813/tests > {quote} > failed on teardown with "Unexpected error found in node logs (see stdout for > full details). Errors: [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13 > 15:06:46,808 NoSpamLogger.java:111 - Could not aquire capacity while > processing native protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4] > 2024-06-13 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity > while processing native protocol message']" > Unexpected error found in node logs (see stdout for full details). Errors: > [[node1] 'ERROR [epollEventLoopGroup-5-8] 2024-06-13 15:06:46,808 > NoSpamLogger.java:111 - Could not aquire capacity while processing native > protocol message', [node2] 'ERROR [epollEventLoopGroup-5-4] 2024-06-13 > 15:06:48,851 NoSpamLogger.java:111 - Could not aquire capacity while > processing native protocol message'] > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org