[ 
https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15388282#comment-15388282
 ] 

Sergio Bossa commented on CASSANDRA-9318:
-----------------------------------------

[~Stefania],

bq. We just need to decide if we are happy having a back pressure state for 
each OutboundTcpConnectionPool, even though back pressure may be disabled or 
some strategies may use different metrics, e.g. total in-flight requests or 
memory, or whether the states should be hidden inside the strategy.

As I said, if you move the states inside the strategy, you make the strategy 
"fattier", and I don't like that. I think keeping the states associated to the 
connections (as a surrogate for the actual nodes) makes sense, because that's 
what back-pressure is about; any strategy that doesn't take into account 
per-node state, is IMHO not a proper back-pressure strategy, and can be 
implemented in the same way as hints overflowing is implemented (that is, 
outside the back-pressure APIs).

Anyway, I've made the {{BackPressureState}} interface more generic so it can 
react on messages when they're about to be sent, which should allow to 
implement different strategies as you mentioned. This should also make the 
current {{RateBased}} implementation clearer (see below), which is nice :)

bq. Now the back pressure is applied before hinting, or inserting local 
mutations, or sending to remote data centers, but after sending to local 
replicas. I think we may need to rework SP.sendToHintedEndpoints a little bit, 
I think we want to fire the insert local, then block, then send all messages.

Apologies, the {{SP}} method went through several rewrites in our back and 
forth and I didn't pay enough attention during the last one, totally stupid 
mistake. Should be fixed now, but I've kept the back-pressure application on 
top, as in case the strategy implementation wants to terminate the request 
straight away (i.e. by throwing an exception) it doesn't make sense to 
partially send messages (that is, either all or nothing is IMHO better).

bq. There is one more potential issue in the case of non local replicas. We 
send the mutation only to one remote replica, which then forwards it to other 
replicas in that DC. So remote replicas may have the outgoing rate set to zero, 
and therefore the limiter rate set to positive infinity, which means we won't 
throttle at all with FAST flow selected.

The {{RateBased}} implementation doesn't "count" outgoing requests when they're 
sent, which would cause the bug you mentioned, but when the response is 
received or the callback is expired, in order to guarantee a consistent 
counting between outgoing and incoming messages, i.e. if outgoing messages are 
more than incoming ones we know that's because of timeouts, not because some 
requests are still in-flight; this makes the algorithm way more stable, and 
also overcomes the problem you mentioned. The latest changes to the state 
interface should hopefully make this implementation detail clearer.

bq. In fact, we have no way of reliably knowing the status of remote replicas, 
or replicas to which we haven't sent any mutations recently, and we should 
perhaps exclude them.

This can only happen when a new replica is injected into the system, or if it 
comes back alive after being dead, otherwise we always send mutations to all 
replicas, or am I missing something? There's no lock-free way to remove such 
replicas, because "0 outgoing" is a very transient state, and that's actually 
the point: the algorithm is designed to work at steady state, and after a 
single "window" (rpc timeout seconds) the replica state will have received 
enough updates.

bq. I think we can start testing once we have sorted the second discussion 
point above, as for the API issues, we'll eventually need to reach consensus 
but we don't need to block the tests for it.

Please have a look at my changes, I will do one more review pass myself as I 
think some aspects of the rate-based algorithm can be improved, then if we're 
all happy I will go ahead with rebasing to trunk and then moving into testing.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-9318
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9318
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, 
> limit.btm, no_backpressure.png
>
>
> It's possible to somewhat bound the amount of load accepted into the cluster 
> by bounding the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding 
> bytes and requests and if it reaches a high watermark disable read on client 
> connections until it goes back below some low watermark.
> Need to make sure that disabling read on the client connection won't 
> introduce other issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to