[ https://issues.apache.org/jira/browse/CASSANDRA-17324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484008#comment-17484008 ]
Brandon Williams commented on CASSANDRA-17324: ---------------------------------------------- This makes sense, then. Reads can be avoided since all replicas aren't needed, but writes don't have that luxury. > Allow node to reject internode messages that create work for the MUTATION > stage > ------------------------------------------------------------------------------- > > Key: CASSANDRA-17324 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17324 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Internode > Reporter: Caleb Rackliffe > Assignee: Caleb Rackliffe > Priority: Normal > Fix For: 4.x > > > When a node is struggling under the weight of a compaction backlog and > becomes a cause of increased read latency for clients, we have two safety > valves: > 1.) Disabling the native protocol server, which stops the node from > coordinating reads and writes. > 2.) Jacking up the severity on the node, which tells the dynamic snitch to > avoid the node for reads from other coordinators. > These are useful, but we don’t appear to have any mechanism that would allow > us to temporarily reject internode hint, batch, and mutation messages that > could further delay resolution of the compaction backlog. There is a > parameter in {{cassandra.yaml}} called {{hinted_handoff_throttle}} (formerly > {{hinted_handoff_throttle_in_kb}}) that allows us to control the rate at > which we read hints before they are delivered, but how fast that should > happen and whether it should happen at all are two different questions. > The proposal here is to add this rejection mechanism and publish it via JMX, > along with any metrics and logging that would be necessary to make its > effects visible. (Ex. Hint delivery already has metrics around success, > failure, and timeouts, which would be helpful around this.) The > error-handling pathways for hints, writes, and batches should already be > capable of handling one more type of error (i.e. “that replica is > overloaded”), but some non-spammy logging around that probably wouldn’t hurt. > In implementation space, one idea that would minimize the amount of surgery > we need to do is making the decision around whether to send back a failure > message directly in {{InboundSink}}. This would avoid having to duplicate the > logic in multiple downstream handlers. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org