[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-04-01 Thread Sumanth Pasupuleti (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806417#comment-16806417
 ] 

Sumanth Pasupuleti commented on CASSANDRA-15049:


FYI, I have submitted a patch on CASSANDRA-15013.

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Normal
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-03-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789989#comment-16789989
 ] 

Michaël Figuière commented on CASSANDRA-15049:
--

[~sumanth.pasupuleti] Thanks for pointing it out. I'm likely to experiment 
sending back an Overloaded error on my side to see how it works with our 
clusters and I'll track the progress on CASSANDRA-15013 before pushing a patch 
on this Jira.

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-03-11 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789977#comment-16789977
 ] 

Dinesh Joshi commented on CASSANDRA-15049:
--

I think the 3.0.x / 3.x deserves a discussion on dev@ mailing list.

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-03-11 Thread Sumanth Pasupuleti (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789978#comment-16789978
 ] 

Sumanth Pasupuleti commented on CASSANDRA-15049:


On a very related front, I am working on a patch for 
https://issues.apache.org/jira/browse/CASSANDRA-15013 (almost done with the 
patch, writing UTs).
This is to tackle exactly the same issue, to prevent any blocking of event loop 
threads while trying to enqueue on NTR queue. Patch involves the option to 
either throw OverloadedException or put backpressure on the channel. More in 
CASSANDRA-15013.

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-03-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789975#comment-16789975
 ] 

Michaël Figuière commented on CASSANDRA-15049:
--

[~djoshi3] I can put together a patch for this. Do you think this could be 
targeting 3.0.x / 3.x branches as well? Many people would just manually 
backport it otherwise.

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15049) Requests blocked at NTR stage should be rejected

2019-03-11 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789963#comment-16789963
 ] 

Dinesh Joshi commented on CASSANDRA-15049:
--

Makes sense to add this protection. This is something I was considering adding 
for internode messaging as well. Do you want to send a patch?

> Requests blocked at NTR stage should be rejected
> 
>
> Key: CASSANDRA-15049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15049
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Michaël Figuière
>Priority: Major
>
> CASSANDRA-11363 has emphasized that if the NTR stage's thread pool and queue 
> are full, the Netty Event Loops may block waiting on the NTR queue. The 
> solution that was brought in CASSANDRA-11363 was to increase the default 
> queue size from 128 to 1024. This significantly reduced the number of blocked 
> requests observed but hasn't removed the problem entirely. Whenever a Netty 
> Event Loop is blocked, the responsiveness of Cassandra is significantly 
> impacted so it seems inappropriate to rely solely on increasing this queue 
> size until everything looks fine... at the time the tuning was done.
> In fact, this situation looks exactly like the definition of the 
> {{Overloaded}} error of the CQL Protocol:
> {code:java}
> 0x1001 Overloaded: the request cannot be processed because the
>   coordinator node is overloaded{code}
> Therefore, whenever a request can't make it to the NTR stage, it should be 
> rejected with an {{Overloaded}} error to the client. This can be done at low 
> cost as we're already in the Netty Event Loop owning the channel to that 
> client.
> It would then be the client responsibility to retry with another coordinator, 
> which is likely to lead to a better P99 latency than blocking on an already 
> too long queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org