[ 
https://issues.apache.org/jira/browse/CASSANDRA-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-19336:
------------------------------------------
    Test and Documentation Plan: 
||Patch||CI||
|[4.0|https://github.com/apache/cassandra/compare/trunk...adelapena:19336-4.0]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/3423/workflows/6dd2bc40-d663-4c38-96d2-1a9d98b531da]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/3423/workflows/d6255d7f-a238-4eb6-93f1-fe373ad567c5]|
|[4.1|https://github.com/apache/cassandra/compare/trunk...adelapena:19336-4.1]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/3424/workflows/7e153df1-c7c3-453d-9003-e1cacaf0d9fb]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/3424/workflows/68503730-3744-4d65-8484-658711d01bf6]|
|[5.0|https://github.com/apache/cassandra/pull/3073]|[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/3417/workflows/9a3ef50e-1616-4bca-b9f9-275eb1ddf5fa]
 
[j17|https://app.circleci.com/pipelines/github/adelapena/cassandra/3417/workflows/43c45c1b-7fa8-48e6-9137-1ed52594b03d]|
|[trunk|https://github.com/apache/cassandra/compare/trunk...adelapena:19336-trunk]|[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/3425/workflows/0a657cb6-f749-4caa-97cd-ed6660736313]
 
[j17|https://app.circleci.com/pipelines/github/adelapena/cassandra/3425/workflows/3c827700-b4a9-4860-94a9-a641dd06dfe1]|

  was:
||PR||CI||
|[5.0|https://github.com/apache/cassandra/pull/3073]|[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/3408/workflows/417096cc-570b-4fb5-b467-08e087c41395]
 
[j17|https://app.circleci.com/pipelines/github/adelapena/cassandra/3408/workflows/273c21bf-42d3-4812-b3f6-d2da64fd5f9a]|


> Repair causes out of memory
> ---------------------------
>
>                 Key: CASSANDRA-19336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Andres de la Peña
>            Assignee: Andres de la Peña
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> CASSANDRA-14096 introduced {{repair_session_space}} as a limit for the memory 
> usage for Merkle tree calculations during repairs. This limit is applied to 
> the set of Merkle trees built for a received validation request 
> ({{{}VALIDATION_REQ{}}}), divided by the replication factor so as not to 
> overwhelm the repair coordinator, who will have requested RF sets of Merkle 
> trees. That way the repair coordinator should only use 
> {{repair_session_space}} for the RF Merkle trees.
> However, a repair session without {{{}-pr-{}}}/{{{}-partitioner-range{}}} 
> will send RF*RF validation requests, because the repair coordinator node has 
> RF-1 replicas and is also the replica of RF-1 nodes. Since all the requests 
> are sent at the same time, at some point the repair coordinator can have up 
> to RF*{{{}repair_session_space{}}} worth of Merkle trees if none of the 
> validation responses is fully processed before the last response arrives.
> Even worse, if the cluster uses virtual nodes, many nodes can be replicas of 
> the repair coordinator, and some nodes can be replicas of multiple token 
> ranges. It would mean that the repair coordinator can send more than RF or 
> RF*RF simultaneous validation requests.
> For example, in an 11-node cluster with RF=3 and 256 tokens, we have seen a 
> repair session involving 44 groups of ranges to be repaired. This produces 
> 44*3=132 validation requests contacting all the nodes in the cluster. When 
> the responses for all these requests start to arrive to the coordinator, each 
> containing up to {{repair_session_space}}/3 of Merkle trees, they accumulate 
> quicker than they are consumed, greatly exceeding {{repair_session_space}} 
> and OOMing the node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to