[ 
https://issues.apache.org/jira/browse/CASSANDRA-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484003#comment-17484003
 ] 

Caleb Rackliffe commented on CASSANDRA-17324:
---------------------------------------------

tl;dr Disabling gossip is a big hammer.

If we stop gossiping and the node is down, there are schema/repair/read 
messages we'll cut off entirely. Reads being lumped in is probably the most 
problematic, as that probably makes it more likely we'll get read unavailables, 
and somebody might get paged ;)

(i.e. A read against a node trying to dig itself out of compaction problems 
might still be better than no read.)

> Allow node to reject internode messages that create work for the MUTATION 
> stage
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17324
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17324
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Internode
>            Reporter: Caleb Rackliffe
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.x
>
>
> When a node is struggling under the weight of a compaction backlog and 
> becomes a cause of increased read latency for clients, we have two safety 
> valves:
> 1.) Disabling the native protocol server, which stops the node from 
> coordinating reads and writes.
> 2.) Jacking up the severity on the node, which tells the dynamic snitch to 
> avoid the node for reads from other coordinators.
> These are useful, but we don’t appear to have any mechanism that would allow 
> us to temporarily reject internode hint, batch, and mutation messages that 
> could further delay resolution of the compaction backlog. There is a 
> parameter in {{cassandra.yaml}} called {{hinted_handoff_throttle}} (formerly 
> {{hinted_handoff_throttle_in_kb}}) that allows us to control the rate at 
> which we read hints before they are delivered, but how fast that should 
> happen and whether it should happen at all are two different questions.
> The proposal here is to add this rejection mechanism and publish it via JMX, 
> along with any metrics and logging that would be necessary to make its 
> effects visible. (Ex. Hint delivery already has metrics around success, 
> failure, and timeouts, which would be helpful around this.) The 
> error-handling pathways for hints, writes, and batches should already be 
> capable of handling one more type of error (i.e. “that replica is 
> overloaded”), but some non-spammy logging around that probably wouldn’t hurt.
> In implementation space, one idea that would minimize the amount of surgery 
> we need to do is making the decision around whether to send back a failure 
> message directly in {{InboundSink}}. This would avoid having to duplicate the 
> logic in multiple downstream handlers.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to