[ https://issues.apache.org/jira/browse/CASSANDRA-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812818#comment-17812818 ]
David Capwell edited comment on CASSANDRA-19336 at 1/31/24 5:34 PM: -------------------------------------------------------------------- bq. I'm not sure if the behaviour proposed by the patch should be done in all cases or if it should depend on a n optional flag I think it should be a new flag, but cool with yaml defining a default (as this is really a cluster problem rather than a single repair problem). Skimmed the patch quickly and this will cause a large performance regression for many users; threads is normally "1" (for this patch that means 1 RepairJob at a time). This also just tries to limit the RepairJob but not RepairSession, so it might make more sense to have a higher level scheduling.... we really don't want to wait on RepairJob as streaming can take a long time, we really just want to wait on Validation, so block concurrent validations. [~adelapena] I am cool with the general direction of this patch, but think we should make 2 changes 1) RepairCoordinator should have a "scheduler" to limit RepairJob concurrency (really Validation). 2) this "scheduler" should be configurable: disabled (do existing behavior), max validations (limits concurrent validations), etc. I am super flexible here, not saying you must do my design... just more that the problem is higher up than RepairSession trying to handle. bq. I think there isn't much we can do for the case with multiple tables besides repairing the tables sequentially. Nah, I think we can limit the concurrent validations rather than forcing all tables to be sequential. If we cause a perf regression for "nodetool repair ks" then users will just be forced to work around it by creating a coordinator per table (which is a undocumented best practice...), at which point solving this problem gets harder as you have multiple RepairCoordinators rather than a single one... was (Author: dcapwell): bq. I'm not sure if the behaviour proposed by the patch should be done in all cases or if it should depend on a n optional flag I think it should be a new flag, but cool with yaml defining a default (as this is really a cluster problem rather than a single repair problem). Skimmed the patch quickly and this will cause a large performance regression for many users; threads is normally "1" (for this patch that means 1 RepairJob at a time). This also just tries to limit the RepairJob but not RepairSession, so it might make more sense to have a higher level scheduling.... we really don't want to wait on RepairJob as streaming can take a long time, we really just want to wait on Validation, so block concurrent validations. [~adelapena] I am cool with the general direction of this patch, but think we should make 2 changes 1) RepairCoordinator should have a "scheduler" to limit RepairJob concurrency (really Validation). 2) this "scheduler" should be configurable: disabled (do existing behavior), max validations (limits concurrent validations), etc. I am super flexible here, not saying you must do my design... just more that the problem is higher up than RepairSession trying to handle. > Repair causes out of memory > --------------------------- > > Key: CASSANDRA-19336 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19336 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair > Reporter: Andres de la Peña > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > CASSANDRA-14096 introduced {{repair_session_space}} as a limit for the memory > usage for Merkle tree calculations during repairs. This limit is applied to > the set of Merkle trees built for a received validation request > ({{{}VALIDATION_REQ{}}}), divided by the replication factor so as not to > overwhelm the repair coordinator, who will have requested RF sets of Merkle > trees. That way the repair coordinator should only use > {{repair_session_space}} for the RF Merkle trees. > However, a repair session without {{{}-pr-{}}}/{{{}-partitioner-range{}}} > will send RF*RF validation requests, because the repair coordinator node has > RF-1 replicas and is also the replica of RF-1 nodes. Since all the requests > are sent at the same time, at some point the repair coordinator can have up > to RF*{{{}repair_session_space{}}} worth of Merkle trees if none of the > validation responses is fully processed before the last response arrives. > Even worse, if the cluster uses virtual nodes, many nodes can be replicas of > the repair coordinator, and some nodes can be replicas of multiple token > ranges. It would mean that the repair coordinator can send more than RF or > RF*RF simultaneous validation requests. > For example, in an 11-node cluster with RF=3 and 256 tokens, we have seen a > repair session involving 44 groups of ranges to be repaired. This produces > 44*3=132 validation requests contacting all the nodes in the cluster. When > the responses for all these requests start to arrive to the coordinator, each > containing up to {{repair_session_space}}/3 of Merkle trees, they accumulate > quicker than they are consumed, greatly exceeding {{repair_session_space}} > and OOMing the node. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org