Caleb Rackliffe created CASSANDRA-17324:
-------------------------------------------

             Summary: Allow node to reject internode messages that create work 
for the MUTATION stage
                 Key: CASSANDRA-17324
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17324
             Project: Cassandra
          Issue Type: Improvement
          Components: Messaging/Internode
            Reporter: Caleb Rackliffe
            Assignee: Caleb Rackliffe


When a node is struggling under the weight of a compaction backlog and becomes 
a cause of increased read latency for clients, we have two safety valves:

1.) Disabling the native protocol server, which stops the node from 
coordinating reads and writes.
2.) Jacking up the severity on the node, which tells the dynamic snitch to 
avoid the node for reads from other coordinators.

These are useful, but we don’t appear to have any mechanism that would allow us 
to temporarily reject internode hint, batch, and mutation messages that could 
further delay resolution of the compaction backlog. There is a parameter in 
{{cassandra.yaml}} called {{hinted_handoff_throttle}} (formerly 
{{hinted_handoff_throttle_in_kb}}) that allows us to control the rate at which 
we read hints before they are delivered, but how fast that should happen and 
whether it should happen at all are two different questions.

The proposal here is to add this rejection mechanism and publish it via JMX, 
along with any metrics and logging that would be necessary to make its effects 
visible. (Ex. Hint delivery already has metrics around success, failure, and 
timeouts, which would be helpful around this.) The error-handling pathways for 
hints, writes, and batches should already be capable of handling one more type 
of error (i.e. “that replica is overloaded”), but some non-spammy logging 
around that probably wouldn’t hurt.

In implementation space, one idea that would minimize the amount of surgery we 
need to do is making the decision around whether to send back a failure message 
directly in {{InboundSink}}. This would avoid having to duplicate the logic in 
multiple downstream handlers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to