Paul Chandler created CASSANDRA-19785:
-----------------------------------------

             Summary: Possible memory leak in BTree.FastBuilder 
                 Key: CASSANDRA-19785
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19785
             Project: Cassandra
          Issue Type: Bug
            Reporter: Paul Chandler
         Attachments: image-2024-07-19-08-44-56-714.png, 
image-2024-07-19-08-45-17-289.png, image-2024-07-19-08-45-33-933.png, 
image-2024-07-19-08-45-50-383.png, image-2024-07-19-08-46-06-919.png, 
image-2024-07-19-08-46-42-979.png, image-2024-07-19-08-46-56-594.png, 
image-2024-07-19-08-47-19-517.png, image-2024-07-19-08-47-34-582.png

We are having a problem with the heap growing in size, This is a large cluster 
> 1,000 nodes across a large number of dc’s.

 

Each node has a 32GB heap, and the amount used continues to grow until it 
reaches 30GB, it then struggles with multiple Full GC pauses, as can be seen 
here:



!image-2024-07-19-08-44-56-714.png!

We took 2 heap dumps on one node a few days after it was restarted, and the 
heap had grown by 2.7GB

 

9{^}th{^} July

!image-2024-07-19-08-45-17-289.png!

11{^}th{^} July

!image-2024-07-19-08-45-33-933.png!

This can be seen as mainly an increase of memory used by FastThreadLocalThread, 
increasing from 5.92GB to 8.53GB

!image-2024-07-19-08-45-50-383.png!

!image-2024-07-19-08-46-06-919.png!

Looking deeper into this it can be seen that the growing heap is contained 
within the threads for the MutationStage, Native-transport-Requests, ReadStage 
etc. We would expect the memory used within these threads to be short lived, 
and not grow as time goes on.  We recently increased the size of theses 
threadpools, and that has increased the size of the problem.

 

Top memory usage for FastThreadLocalThread

9{^}th{^} July 



!image-2024-07-19-08-46-42-979.png!

11{^}th{^} July


!image-2024-07-19-08-46-56-594.png!
This has led us to investigate whether there could be a memory leak, and we 
have found the following issues within the retained references in 
BTree.FastBuilder objects. The issue appears to stem from the reset() method, 
which does not properly clear all buffers.  We are not really sure how the 
BTree.FastBuilder works, but this this is our analysis of where a leak might 
occur.

 

Specifically:

Leaf Buffer Not Being Cleared:
When leaf().count is 0, the statement Arrays.fill(leaf().buffer, 0, 
leaf().count, null); does not clear the buffer because the end index is 0. This 
leaves the buffer with references to potentially large objects, preventing 
garbage collection and increasing heap usage.

Branch inUse Property:
If the inUse property of the branch is set to false elsewhere in the code, the 
while loop while (branch != null && branch.inUse) does not execute, resulting 
in uncleared branch buffers and retained references.

 

This is based on the following observations:

    Heap Dumps: Analysis of heap dumps shows that leaf().count is often 0, and 
as a result, the buffer is not being cleared, leading to high heap utilization.

!image-2024-07-19-08-47-19-517.png!


    Remote Debugging: Debugging sessions indicate that the drain() method sets 
count to 0, and the inUse flag for the parent branch is set to false, 
preventing the while loop in reset() from clearing the branch buffers.



!image-2024-07-19-08-47-34-582.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to