[ https://issues.apache.org/jira/browse/CASSANDRA-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373624#comment-15373624 ]
Sergio Bossa edited comment on CASSANDRA-9318 at 7/12/16 8:23 PM: ------------------------------------------------------------------ [~Stefania], bq. if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with {{OverloadedException}} (this time with its original meaning), so we should be covered. bq. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. See my answer at the end. bq. AbstractWriteResponseHandler sets the start time in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards the total time before the coordinator throws a write timeout exception. See my answer at the end. bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the destination is alive. I know, I'm holding on these changes until we settle on a plan for the whole write path (in terms of what to do with CL, the exception to be thrown etc.). bq. Let's use UnavailableException since WriteFailureException indicates a non-timeout failure when processing a mutation, and so it is not appropriate for this case. For protocol V4 we cannot change UnavailableException, but for V5 we should add a new parameter to it. At the moment it contains <cl><required><alive>, we should add the number of overloaded replicas, so that drivers can treat the two cases differently. Does it mean we should advance the protocol version in this issue, or delegate to a new issue? bq. Marking messages as throttled would let the replica know if backpressure was enabled, that's true, but it also makes the existing mechanism even more complex. How so? In implementation terms, it should be literally as easy as: 1) Add a byte parameter to {{MessageOut}}. 2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it replica-side. 3) If possible (didn't check it), when a "late" response is received on the coordinator, try to cancel the related hint. Do you see any complexity I'm missing there? bq. dropping mutations that have been in the queue for longer that the RPC write timeout is done not only to shed load on the replica, but also to avoid wasting resources to perform a mutation when the coordinator has already returned a timeout exception to the client. This is very true and that's why I said it's a bit of a wild idea. Obviously, that is true outside of back-pressure, as even now it is possible to return a write timeout to clients and still have some or all mutations applied. In the end, it might be good to optionally enable such behaviour, as the advantage would be increased consistency at the expense of more resource consumption, which is a tradeoff some users might want to make, but to be clear, I'm not strictly lobbying to implement it, just trying to reason about pros and cons. bq. I still have concerns regarding additional write timeout exceptions and whether an overloaded or slow replica can slow everything down. These are valid concerns of course, and given similar concerns from [~jbellis], I'm working on some changes to avoid write timeouts due to healthy replicas unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to my last comment above, maybe only actually back-pressure if the CL is not met. Stay tuned. was (Author: sbtourist): [~Stefania], bq. if a message expires before it is sent, we consider this negatively for that replica, since we increment the outgoing rate but not the incoming rate when the callback expires, and still it may have nothing to do with the replica if the message was not sent, it may be due to the coordinator dealing with too many messages. Right, but there isn't much we can do without way more invasive changes. Anyway, I don't think that's actually a problem, as if the coordinator is overloaded we'll end up generating too many hints and fail with {{OverloadedException}} (this time with its original meaning), so we should be covered. bq. I also observe that if a replica has a low rate, then we may block when acquiring the limiter, and this will indirectly throttle for all following replicas, even if they were ready to receive mutations sooner. See my answer at the end. bq. AbstractWriteResponseHandler sets the start time in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards the total time before the coordinator throws a write timeout exception. See my answer at the end. bq. SP.sendToHintedEndpoints(), we should apply backpressure only if the destination is alive. I know, I'm holding on these changes until we settle on a plan for the whole write path (in terms of what to do with CL, the exception to be thrown etc.). bq. Let's use UnavailableException since WriteFailureException indicates a non-timeout failure when processing a mutation, and so it is not appropriate for this case. For protocol V4 we cannot change UnavailableException, but for V5 we should add a new parameter to it. At the moment it contains <cl><required><alive>, we should add the number of overloaded replicas, so that drivers can treat the two cases differently. Does it mean we should advance the protocol version in this issue, or delegate to a new issue? bq. Marking messages as throttled would let the replica know if backpressure was enabled, that's true, but it also makes the existing mechanism even more complex. How so? In implementation terms, it should be literally as easy as: 1) Add a byte parameter to {{MessageOut}}. 2) Read such byte parameter from {{MessageIn}} and eventually skip dropping it replica-side. 3) If possible (didn't check it), when a "late" response is received on the coordinator, try to cancel the related hint. Do you see any complexity I'm missing there? bq. dropping mutations that have been in the queue for longer that the RPC write timeout is done not only to shed load on the replica, but also to avoid wasting resources to perform a mutation when the coordinator has already returned a timeout exception to the client. This is very true and that's why I said it's a bit of a wild idea. Obviously, that is true outside of back-pressure, as even now it is possible to return a write timeout to clients and still have some or all mutations applied. In the end, it might be good to optionally enable such behaviour, as the advantage would be increased consistency at the expense of more resource consumption, which is a tradeoff some users might want to make, but to be clear, I'm not strictly lobbying to implement it, just trying to reason about pros and cons. bq. I still have concerns regarding additional write timeout exceptions and whether an overloaded or slow replica can slow everything down. These are valid concerns of course, and given similar concerns from [~jbellis], I'm working on some changes to avoid write timeouts due to healthy replicas unnaturally throttled by unhealthy ones, and depending on [~jbellis] answer to my last comment above, maybe only actually back-pressure if the CL is not met. Stay tuned. > Bound the number of in-flight requests at the coordinator > --------------------------------------------------------- > > Key: CASSANDRA-9318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9318 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths, Streaming and Messaging > Reporter: Ariel Weisberg > Assignee: Sergio Bossa > Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, > limit.btm, no_backpressure.png > > > It's possible to somewhat bound the amount of load accepted into the cluster > by bounding the number of in-flight requests and request bytes. > An implementation might do something like track the number of outstanding > bytes and requests and if it reaches a high watermark disable read on client > connections until it goes back below some low watermark. > Need to make sure that disabling read on the client connection won't > introduce other issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)