[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2020-01-31 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027916#comment-17027916
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

Thanks Dinesh. Applied the most of the changes you suggested. 

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2020-01-31 Thread Dinesh Joshi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027909#comment-17027909
 ] 

Dinesh Joshi commented on CASSANDRA-15350:
--

I think the overall PR looks good. I just had a few minor nits that I've mocked 
[here|https://github.com/yifan-c/cassandra/commit/b9a58b38cf9ce85203021bd48f89d98d0f69acd5].
 Please feel free to cherry pick.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2020-01-31 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027827#comment-17027827
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

Thanks Alex. 

Using fuzz to reproduce contentions sounds OK to me.

Having deterministic test cases to precisely produce contentions is also 
valuable, IMO. But I am good to not have those test cases being merged as of 
now. 

I rebased my branch with your improvements commit and added/tweaked test cases 
in {{CasWriteTest}}.

Still the same [PR|https://github.com/apache/cassandra/pull/379] and [code 
branch|https://github.com/yifan-c/cassandra/tree/cas-exception-changes]. Test 
result is 
[here.|https://app.circleci.com/github/yifan-c/cassandra/pipelines/458cafcb-117b-443a-ae6c-c089efd6a929/workflows/7742c20d-1a82-47a8-93d0-cc40a542db5f]
 

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2020-01-31 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027710#comment-17027710
 ] 

Alex Petrov commented on CASSANDRA-15350:
-

I've pushed a rebased branch with minor improvements 
[here|https://github.com/ifesdjeen/cassandra/tree/cas-exception-changes], tests 
running 
[here|https://circleci.com/gh/ifesdjeen/cassandra/tree/cas-exception-changes].

I'd propose to use fuzz tests instead of cascades of latches. If we could 
implement a simple test to reproduce (any) contention in this fashion, it'd be 
great. Could you implement it? 

Other than that, I'm +1 and ready to commit when contention fuzz test is added.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-12-06 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990294#comment-16990294
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

[~ifesdjeen], thanks for the reply. 
h4. Exception naming

Put more thoughts into the exception naming (I am definitely not good at it :/ 
). 
 # The exception is *not* sent to clients, e.g. java driver. What being sent to 
the clients is the error message with the error code defined at 
[native_protocol_v5, Section 9]. 
 # The exception is used mainly to understanding the code or reasoning 
about/debugging server errors from log. 
 # The exception is caused by contention during Paxos
 # The result of the CAS is unknown, and the exception should express the 
status.
 # Timeout implies the result is unknown and the operation exceeds the 
permitted duration, which is not the case here.

The name of the exception should describe what happened (the unknown result) 
and how it happened (due to contention during Paxos). 

{{CasWriteResultUnknownException}} is a reasonable name too. The name indicates 
the result explicitly and how it happened. 

The error code name in protocol spec should be updated to 
{{CAS_WRITE_RESULT_UNKNOWN}} from {{CAS_UNCERTAINTY}} in order to easily 
associate with the exception when reading the code.
h4. The cas write test case

Understand the concern. IMO, the purpose of the test is to validate the specs 
remains correct with the new changes. The tests defines the specs. If someone 
changes the code related to CAS and the tests are broken, it is very likely the 
changes introduce some bug. However, if the implementation of CAS is entirely 
changed, then the tests are not valid anymore and can be removed from the code 
base. 

Paxos has multiple rounds and there are rules in each round. The tests check 
the rules and assert the expected result of each round. Although the test case 
here is not comprehensive, it provides a certain level of coverage for the 
contention case. Fuzzing can surely be more inclusive. 

The maintenance concern might be bigger than I thought. So I am all ear and OK 
to remove those complex test cases if insist. 
h4. Others
 * The test case for encode/decode the error message with the new fields are 
missing. I will add them in the {{ErrorMessageTest}} to test the scenario when 
communicating with a lower version client.
 * Maybe the {{CasWriteTimeoutException}} is not necessary? I take a second 
look and feel the {{contentions#}} sent to clients might not be helpful to 
them. 
 * The {{InterceptFilter}} can clog the message outbound sink. It is because 
the jvm dtest instance run the filters in the same thread. I made the 
{{MessageDeliverySink}} to run in async for each outbound message to avoid 
blocking the evaluation of other filters. 

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-12-03 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986687#comment-16986687
 ] 

Alex Petrov commented on CASSANDRA-15350:
-

To be honest, I think the fact that names {{WriteStalled}} and {{WriteTimeout}} 
are quite close to each other might confuse the user. We need to reflect the 
fact that it's a Paxos round failure or that the reason is that _we do not 
know_ whether the value is going to go through or not.

bq. ErrorMessage is not involved in internode messageing

Err; of course. Sorry about that: was thinking in a different context and 
phrased it wrong. Also, both messages occur on the coordinator, so internode 
doesn't apply. What I should have written is that this logic is used by the 
{{SimpleClient}}.

bq. the scenario was carefully crafted to be deterministic and aims to produce 
the same kind of contention.

This is precisely what I'm concerned about: it is carefully crafted and might 
be difficult to maintain. Everyone who'll be modifying the code in the future 
will have to re-craft the test as well. I think we can relatively easily 
reproduce it with a fuzz test that introduces contention. I think introducing 
latency/partition in the test is a reasonable thing, I'd just make it random 
rather than handcrafted. This will also help us to see how it all behaves when 
contention is higher.

bq. Do you mean rename the method to activate/deactivate,

Right, I'd just call them {{activate}} and {{deactivate}}.

We also need at least a version of the {{SimpleClient}} to be tested with the 
changes. Ideally, we need an accompanying patch for the java-driver, since it 
changes the native protocol.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-12-02 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986310#comment-16986310
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

[~ifesdjeen] and [~spod], big thanks for reviewing the patch!

Renaming the exception to {{CasWriteStalledException}} and the suggested 
rephrased description for {{CAS_UNCERTAINTY}} sounds good. 
 The meaning of {{CasWriteUncertainty}} is vague, and as being pointed out, the 
WTE indicates that the result is uncertain too. {{CasWriteStalled}} describes 
what happened better. 
 I meant to put _Paxos read_ when writing the description for the 
{{CAS_UNVERTAINTY}} clause. I will update the description with the suggested 
one considered. 

Regarding the cross-version scenarios, I may be wrong, but my current 
understanding is that the ErrorMessage is *not* involved in internode 
messageing. ErrorMessage, derived from 
{{org.apache.cassandra.transport.Message.Response}}, is client-facing. When a 
sub-V5 (i.e. V4) client connects to the V5 server and gets the 
{{CasWriteTimeoutException}}, the server encoding makes sure to produce a 
backward compatible one, so the sub-V5 client is still good to understand the 
server response. 
 The {{decode}} method in {{ErrorMessage}} seems to be only useful for 
{{org.apache.cassandra.transport.SimpleClient/Client}}, which is not started in 
cassandra server.
{quote}Unless you're submitting patches to 2.2, 3.0, and 3.11, let's roll back 
changes to IMessageFilters, since their API has to be binary compatible with 
older versions.
{quote}
The test cases in {{CasWriteTest}} relies on the message intercept function. I 
will back-port the changes to IMessageFilter to the prior versions.
{quote}Should we add timeout tests for responses as well as requests in 
CasWriteTest?
{quote}
Sure. Sound good.
{quote}Is it possible to try and simplify testCasWriteTimeoutDueToContention, 
can we achieve contention with N threads?
{quote}
The test does achieve contention with N threads (1 thread per client). In 
addition, the scenario was carefully crafted to be deterministic and aims to 
produce the same kind of contention.
{quote}both tests peer quite a lot into implementation internals.
{quote}
The test cases mainly manipulate the internode networking to introduce 
latency/partition. In order to produce (and always produce) a rare contention 
scenario, I think those fine-grained control is necessary.
{quote}In ErrorMessage#decode, there are extra brackets around WRITE_TIMEOUT 
clause. You can remove those and fix indentation. Same happens in 
CAS_UNCERTAINTY case.
{quote}
Removing the brackets in the {{switch-case}} statements gives syntax error 
since we are defining the variables with the same name. The brackets help to 
scope the variables.
{quote}If we add comments for activate and deactivate for off/on, maybe it's 
worth to call those off/on?
{quote}
Do you mean rename the method to activate/deactivate, or change the comments to 
on/off? Both sound good to me.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-11-29 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985099#comment-16985099
 ] 

Alex Petrov commented on CASSANDRA-15350:
-

 [~yifanc] thank you for the patch. I agree with [~spod] notes on the naming. 
Maybe other potential name could be something like 
{{CASUnknownResultException}}?

Regarding whether you'd like to retry or not. Since contention itself is a 
result of concurrency, backing off and not retrying may actually a good idea. 
In case of uncertainty, I'd say you'd usually retry with another write rather 
than reading and then writing, at least that's how I would imagine application 
logic in such case. In case of a failed subsequent write, you'll also get a 
reason in form of a new value. Similar naming should apply to {{uncertainties}} 
in metrics, etc.

Comments: 
  * In {{ErrorMessage#decode}}, you're unconditionally changing the protocol 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR126].
 If I understand correctly, we can still have a combination of {{WriteTimeout}} 
and {{CAS}} in {{v4}}. So if we receive a message from {{v4}} on a {{v5}} node, 
we won't be able to interpret it correctly. You can try and write in-jvm test 
to test for that, or use in-jvm test facilities to try and encode message on 
one version of the node and decode it on the other. 
  * {{ErrorMessage#encode}} we seem to be aided by the fact that 
{{getBackwardsCompatibleException}} gives us the right exception. However, 
since these two parts of code are disjoint, and there's no guarantee future 
changes will be careful, I'd add an assertion that [this code 
path|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR242]
 should be possible only under {{v5}}.
  * Same logic as above applies for size calculation 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR307]
 and 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-a339339f7a71b38f6a277888f38eb48eR325].
 
  * {{CAS_UNCERTAINTY}} clause is new in protocol {{v5}}. We should not attempt 
to interpret or write it for the older protocol versions. In fact, for older 
protocol version we should still encode it as write timeout to be fully 
compatible.
  * I'd argue that "uncertainties" should not be counted towards timeouts like 
[here|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes#diff-71f06c193f5b5e270cf8ac695164f43aR295].
 But someone with better operational insight might be a better person to answer 
this.
  * Unless you're submitting patches to 2.2, 3.0, and 3.11, let's roll back 
changes to {{IMessageFilters}}, since their API has to be binary compatible 
with older versions.
  * Should we add timeout tests for responses as well as requests in 
{{CasWriteTest}}?
  * Is it possible to try and simplify {{testCasWriteTimeoutDueToContention}}, 
can we achieve contention with {{N}} threads?
  * Similar comment for {{testCasWriteUncertainty_ProposalNotReplayed}}: both 
tests peer quite a lot into implementation internals. If we can't do this, I'd 
just leave the tests out from commit since it might be difficult to maintain 
them. I've also been able to catch a failure of 
{{testCasWriteUncertainty_ProposalNotReplayed}} after several thousand runs. 
Maybe we can use fuzzing to reproduce both scenarios?
  * Can we add a cross-version ser-de test, too?

Some nits: 
  * In {{ErrorMessage#decode}}, there are extra brackets around 
{{WRITE_TIMEOUT}} clause. You can remove those and fix indentation. Same 
happens in {{CAS_UNCERTAINTY}} case. 
  * If we add comments for {{activate}} and {{deactivate}} for off/on, maybe 
it's worth to call those off/on?
  * Maybe it's worth to rename {{Filter}} to {{Sink}} and separate filters from 
interceptors, allowing a common class "sink" between those?
  * We don't really need to rename {{equals}} to {{equals0}} 


> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/brows

[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-11-29 Thread Stefan Podkowinski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985029#comment-16985029
 ] 

Stefan Podkowinski commented on CASSANDRA-15350:


Just want to add some quick notes here.

First of all, the naming for {{CasWriteUncertaintyException}} could be slightly 
confusing to users, since it doesn't tell you what exactly is uncertain. Also 
there are situations where the write could be "uncertain" without receiving 
such an exception, e.g. when the exception could not be successfully delivered 
to the client and you get a client timeout instead. 
 How about renaming it to something like {{CasWriteStalledException}} instead?

>From the protocol change:
 "_CAS_UNCERTAINTY: An exception during contended Compare And Set write/update. 
The exception indicates the CAS operation result may or may not be sucessful. 
Clients receiving the exception can send a read query to confirm._"

First of all, I'm not fully sure why you would want to read, instead of just 
re-issuing the cas write in that case. Maybe we can change the phrasing to 
something along the line of "_write was only partially completed and operation 
may or may not get completed by a following CAS write or SERIAL/LOCAL_SERIAL 
read_".

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-11-10 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971341#comment-16971341
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

The change of calculating the exact size of UTF-8 string has negative 
performance impact. It needs to iterate through the entire string to determine 
the actual size in UTF-8. 

The benchmark setup and the result:
{code:java}
 [java] Benchmark  Mode  Cnt
Score Error  Units
 [java] Utf8StringEncodeBench.writeLongTextavgt6  
552.458 ±   9.141  ns/op
 [java] Utf8StringEncodeBench.writeLongTextWithExactSize   avgt6  
787.676 ± 120.057  ns/op
 [java] Utf8StringEncodeBench.writeShortText   avgt6   
70.311 ±   8.031  ns/op
 [java] Utf8StringEncodeBench.writeShortTextWithExactSize  avgt6   
71.716 ±   4.790  ns/op

{code}
I will revert the change. 

 

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Assignee: Yifan Cai
>Priority: Normal
>  Labels: protocolv5, pull-request-available
> Attachments: Utf8StringEncodeBench.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-10-29 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962289#comment-16962289
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

 
||Branch||Diff||Tests||
|[cas-exception-changes|https://github.com/yifan-c/cassandra/tree/cas-exception-changes]|[diff|https://github.com/apache/cassandra/compare/trunk...yifan-c:cas-exception-changes]|[tests|https://circleci.com/workflow-run/67f7d98f-5c2b-4b2c-9fd7-120862e554e7]|

 Changes:
 # Added {{CasWriteTimeoutException}} and {{CasWriteUncertaintyException}}
 # Added {{encode}}/{{decode}}/{{encodeSize}} for the new exceptions. Test 
cases added in ErrorMessageTest
 # Modified {{MessageFilters}} in dtest to support constructing CAS scenario.
 # Added CasWriteTest.
 # Minor changes
 ** Calculate the exact UTF-8 string byte size in CBUtil to reduce unnecessary 
memory allocation, (over-estimating with utf8MaxBytes)
 ** Corrected {{assertRows}} parameters sequence
 ** Moved {{DatabaseDescriptor}} initialization for dtest to test base class.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Priority: Normal
>  Labels: client-impacting, protocolv5
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15350) Add CAS “uncertainty” and “contention" messages that are currently propagated as a WriteTimeoutException.

2019-10-29 Thread Yifan Cai (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962278#comment-16962278
 ] 

Yifan Cai commented on CASSANDRA-15350:
---

In the current {{cas}} implementation, WriteTimeoutExceptions 
({{WriteType.CAS}}) are thrown under the following scenarios
 * The overall {{cas}} operation times out.
 * The PREPARE phase times out.
 ** Multiple unsuccessful retries and eventually times out
 ** RPC requests with nodes time out. (networking)
 ** Multiple proposers contend.Each proposer get promise from the majority and 
pre-empt the other proposers from proceeding to PROPOSE phase. When the other 
proposers (thinking they are still the winners, but in fact not) send proposal, 
they gets rejections from *ALL* acceptors. Such contention continues and time 
runs out.
 ** A repair attempt is added in this phase.
 *** Propose to replay the previous accepted update timeouts
 *** Commit the update timeout
 * The PROPOSE phase times out.
 ** RPC requests with nodes time out. (networking)
 ** Send proposal to *ALL* acceptors and wait,
 *** If successful, i.e. majority accepts, we are good.
 *** If *all* acceptors rejects, it is safe for the proposer to re-submit the 
proposal with a higher ballot.
 *** {color:#ff8b00}If some but *not quorum* accepts, the proposal may or may 
not be replayed by new proposers. (Uncertainty){color}
  If the new proposer reaches to the acceptors that accepted the old 
proposal, it replays the proposal when it is the most recent in-progress one.
  If the new proposer does not reach to those acceptors, it is free for the 
new proposer to choose a value and possibly making the earlier proposal to not 
be qualified for replaying.
 * The COMMIT phase times out.
 ** Apply update times out. Note that this is a normal write. The WriteType is 
{{SIMPLE}} instead of {{CAS}}
 ** Only exception is when the timeout is from the repair attempt in the 
PREPARE phase. In this case, the WriteType is overridden to {{CAS}}

Most of the timeouts are genuine in the list, except the one colored in 
{color:#ff8b00}orange{color}.

> Add CAS “uncertainty” and “contention" messages that are currently propagated 
> as a WriteTimeoutException.
> -
>
> Key: CASSANDRA-15350
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15350
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/Lightweight Transactions
>Reporter: Alex Petrov
>Priority: Normal
>  Labels: client-impacting, protocolv5
>
> Right now, CAS uncertainty introduced in 
> https://issues.apache.org/jira/browse/CASSANDRA-6013 is propagating as 
> WriteTimeout. One of this conditions it manifests is when there’s at least 
> one acceptor that has accepted the value, which means that this value _may_ 
> still get accepted during the later round, despite the proposer failure. 
> Similar problem happens with CAS contention, which is also indistinguishable 
> from the “regular” timeout, even though it is visible in metrics correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org