[ 
https://issues.apache.org/jira/browse/CASSANDRA-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-17324:
----------------------------------------
    Change Category: Operability
         Complexity: Normal
      Fix Version/s: 4.x
             Status: Open  (was: Triage Needed)

> Allow node to reject internode messages that create work for the MUTATION 
> stage
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17324
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17324
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Internode
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.x
>
>
> When a node is struggling under the weight of a compaction backlog and 
> becomes a cause of increased read latency for clients, we have two safety 
> valves:
> 1.) Disabling the native protocol server, which stops the node from 
> coordinating reads and writes.
> 2.) Jacking up the severity on the node, which tells the dynamic snitch to 
> avoid the node for reads from other coordinators.
> These are useful, but we don’t appear to have any mechanism that would allow 
> us to temporarily reject internode hint, batch, and mutation messages that 
> could further delay resolution of the compaction backlog. There is a 
> parameter in {{cassandra.yaml}} called {{hinted_handoff_throttle}} (formerly 
> {{hinted_handoff_throttle_in_kb}}) that allows us to control the rate at 
> which we read hints before they are delivered, but how fast that should 
> happen and whether it should happen at all are two different questions.
> The proposal here is to add this rejection mechanism and publish it via JMX, 
> along with any metrics and logging that would be necessary to make its 
> effects visible. (Ex. Hint delivery already has metrics around success, 
> failure, and timeouts, which would be helpful around this.) The 
> error-handling pathways for hints, writes, and batches should already be 
> capable of handling one more type of error (i.e. “that replica is 
> overloaded”), but some non-spammy logging around that probably wouldn’t hurt.
> In implementation space, one idea that would minimize the amount of surgery 
> we need to do is making the decision around whether to send back a failure 
> message directly in {{InboundSink}}. This would avoid having to duplicate the 
> logic in multiple downstream handlers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to