[ 
https://issues.apache.org/jira/browse/CASSANDRA-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-19336:
------------------------------------------
    Description: 
CASSANDRA-14096 introduced {{repair_session_space}} as a limit for the memory 
usage for Merkle tree calculations during repairs. This limit is applied to the 
set of Merkle trees built for a received validation request 
({{{}VALIDATION_REQ{}}}), divided by the replication factor so as not to 
overwhelm the repair coordinator, who will have requested RF sets of Merkle 
trees. That way the repair coordinator should only use {{repair_session_space}} 
for the RF Merkle trees.

However, a repair session without {{{}-pr-{}}}/{{{}-partitioner-range{}}} will 
send RF*RF validation requests, because the repair coordinator node has RF-1 
replicas and is also the replica of RF-1 nodes. Since all the requests are sent 
at the same time, at some point the repair coordinator can have up to 
RF*{{{}repair_session_space{}}} worth of Merkle trees if none of the validation 
responses is fully processed before the last response arrives.

Even worse, if the cluster uses virtual nodes, many nodes can be replicas of 
the repair coordinator, and some nodes can be replicas of multiple token 
ranges. It would mean that the repair coordinator can send more than RF or 
RF*RF simultaneous validation requests.

For example, in an 11-node cluster with RF=3 and 256 tokens, we have seen a 
repair session involving 44 groups of ranges to be repaired. This produces 
44*3=132 validation requests contacting all the nodes in the cluster. When the 
responses for all these requests start to arrive to the coordinator, each 
containing up to {{repair_session_space}}/3 of Merkle trees, they accumulate 
quicker than they are consumed, greatly exceeding {{repair_session_space}} and 
OOMing the node.

  was:
CASSANDRA-14096 introduced {{repair_session_space}} as a limit for the memory 
usage for Merkle tree calculations during repairs. This limit is applied to the 
set of Merkle trees built for a received validation request 
({{{}VALIDATION_REQ{}}}), divided by the replication factor so as not to 
overwhelm the repair coordinator, who will have requested RF sets of Merkle 
trees. That way the repair coordinator should only use {{repair_session_space}} 
for the RF Merkle trees.

However, a repair session without {{{}-pr{}}}/{{{}--partitioner-range{}}} will 
send RF*RF validation requests, because the repair coordinator node has RF-1 
replicas and is also the replica of RF-1 nodes. Since all the requests are sent 
at the same time, at some point the repair coordinator can have up to 
RF*{{{}repair_session_space{}}} worth of Merkle trees if none of the validation 
responses is fully processed before the last response arrives.

Even worse, if the cluster uses virtual nodes, many nodes can be replicas of 
the repair coordinator, and some nodes can be replicas of multiple token 
ranges. It would mean that the repair coordinator can send more than RF or 
RF*RF simultaneous validation requests.

For example, in an 11-node cluster with RF=3 and 256 tokens, we have seen a 
repair session involving 44 groups of ranges to be repaired. This produces 
44*3=132 validation requests contacting all the nodes in the cluster. When the 
responses for all these requests start to arrive to the coordinator, each 
containing up {{{}repair_session_space{}}}/3 of Merkle trees, they accumulate 
quicker than they are consumed, greatly exceeding {{repair_session_space}} and 
OOMing the node.


> Repair causes out of memory
> ---------------------------
>
>                 Key: CASSANDRA-19336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Andres de la Peña
>            Priority: Normal
>
> CASSANDRA-14096 introduced {{repair_session_space}} as a limit for the memory 
> usage for Merkle tree calculations during repairs. This limit is applied to 
> the set of Merkle trees built for a received validation request 
> ({{{}VALIDATION_REQ{}}}), divided by the replication factor so as not to 
> overwhelm the repair coordinator, who will have requested RF sets of Merkle 
> trees. That way the repair coordinator should only use 
> {{repair_session_space}} for the RF Merkle trees.
> However, a repair session without {{{}-pr-{}}}/{{{}-partitioner-range{}}} 
> will send RF*RF validation requests, because the repair coordinator node has 
> RF-1 replicas and is also the replica of RF-1 nodes. Since all the requests 
> are sent at the same time, at some point the repair coordinator can have up 
> to RF*{{{}repair_session_space{}}} worth of Merkle trees if none of the 
> validation responses is fully processed before the last response arrives.
> Even worse, if the cluster uses virtual nodes, many nodes can be replicas of 
> the repair coordinator, and some nodes can be replicas of multiple token 
> ranges. It would mean that the repair coordinator can send more than RF or 
> RF*RF simultaneous validation requests.
> For example, in an 11-node cluster with RF=3 and 256 tokens, we have seen a 
> repair session involving 44 groups of ranges to be repaired. This produces 
> 44*3=132 validation requests contacting all the nodes in the cluster. When 
> the responses for all these requests start to arrive to the coordinator, each 
> containing up to {{repair_session_space}}/3 of Merkle trees, they accumulate 
> quicker than they are consumed, greatly exceeding {{repair_session_space}} 
> and OOMing the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to