[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-24 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066073#comment-17066073
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/24/20, 7:34 PM:
--

After review from [~blerer] I have made some changes and the latest CI run is 
[here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]
 as #11. All the unit and JVM dtests ran successfully.


was (Author: newkek):
After review from [~blerer] I have made some changes and the latest CI run is 
[here|https://app.circleci.com/pipelines/github/newkek/cassandra?branch=15543-trunk]
 as #11. All the unit and JVM dtests run successfully.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-17 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061159#comment-17061159
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/17/20, 8:07 PM:
--

Hi [~benedict], I've seen you have done quite a bit of work in 
{{ReadCallback}}, I was wondering if you had some cycles to take a look at the 
patch above (and optionally the discussion going on in this ticket)? Thanks


was (Author: newkek):
Hi [~benedict], I've seen you have done quite a bit of work in 
{{ReadCallback}}, I was wondering if you had some cycles to take a look at the 
patch above? Thanks

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM:
--

Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.


was (Author: newkek):
Thanks for the reply, Ekaterina. 

Got some more time to think about this, and I think the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it.

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can discuss pros and cons of returning cohesive 
error messages VS fail-fast & incohesive error message.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-11 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057091#comment-17057091
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/11/20, 3:23 PM:
--

Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.

Does that make sense?


was (Author: newkek):
Thanks for the reply, Ekaterina, and for confirming the behavior described.

Got some more time to think about this, and it seems the best way to approach 
it, since what I suggest would be a larger behavioral change, would be to 
create a separate ticket for it. 

I think in this ticket I would rather: Fix the discrepancies between the 
failures number and the failure messages with the immutable map copy + fix the 
test that expects 2 failures to only expect 1 to match current behavior. 

Then, in the other ticket we can properly advocate/discuss pros and cons of 
returning cohesive error messages VS fail-fast & incohesive error message.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-10 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056278#comment-17056278
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15543 at 3/10/20, 7:27 PM:
---

Hi Kevin,

"In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent." 

Agree with you. I would personally expect 1 response from node1 where the 
schema was altered and two failures with the two messages. 

"In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response."

Or we can say "as soon as it fails more than 0 times" and be safe?

This is my personal interpretation.

I believe  [~ifesdjeen] would be the best person to confirm any details but for 
test fix this would be enough I think to make it deterministic. I don't see a 
bug but confusing responses to the user (same as you, as we discussed on Slack)


was (Author: e.dimitrova):
Hi Kevin,

"In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent." 

Agree with you. I would personally expect 1 response from node1 where the 
schema was altered and two failures with the two messages. 

"In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response."

Or we can say "as soon as it fails more than 0 times" and be safe?

This is my personal interpretation.

I believe  [~ifesdjeen] would be the best person to confirm any details but for 
test fix this would be enough. I don't see a bug but confusing responses to the 
user (same as you, as we discussed on Slack)

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-10 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056278#comment-17056278
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15543 at 3/10/20, 7:26 PM:
---

Hi Kevin,

"In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent." 

Agree with you. I would personally expect 1 response from node1 where the 
schema was altered and two failures with the two messages. 

"In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response."

Or we can say "as soon as it fails more than 0 times" and be safe?

This is my personal interpretation.

I believe  [~ifesdjeen] would be the best person to confirm any details but for 
test fix this would be enough. I don't see a bug but confusing responses to the 
user (same as you, as we discussed on Slack)


was (Author: e.dimitrova):
Hi Kevin,

"In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent." 

Agree with you. I would personally expect 1 response from node1 where the 
schema was altered and two failures with the two messages. 

"In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response."

Or we can say "as soon as it fails more than 0 times" and be safe?

This is my personal interpretation.

I believe  [~ifesdjeen] would be the best person to confirm any details.

 

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-09 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:08 PM:
-

Sounds good, thanks, hope you had a good weekend :)

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.


was (Author: newkek):
Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-09 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054990#comment-17054990
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/9/20, 2:07 PM:
-

Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that, the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.


was (Author: newkek):
Sounds good.

As a summary, 

In any case, I believe passing an immutable copy of the 
{{failureReasonByEndpoint}} map to the constructor of 
Read/WriteFailureException would reduce the chances for the {{number of 
failures}} and the failure messages to be inconsistent.

In addition to that, there's the remaining question of the behavior of 
ReadCallback when failures happen (do we fail fast? or do we wait for all 
responses to come back/timeout?). Depending on the outcome of that the test 
that is flaky at the moment would need to be adjusted to expect 1 *or* 2 
failures in the response.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>Priority: Normal
> Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:45 PM:
-

Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

EDIT: I confused the number of received and the number of failures, I think 
what is below is incorrect.

-Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with-
{code:java}
new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, 
resolver.isDataPresent(), failureReasonByEndpoint){code}
-Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:-
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012{code}
-"2" failures but only 1 error message. (number of {{received}} is irrelevant)-

-Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:-
 - -1 error triggers awaitResults() to unblock-
 - -call to {{new ReadFailureException(received, failureReasonByEndpoint)}}-
 - -value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given-
 - -concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map-
 - -{{ReadFailureException}} has the {{1}} value for {{failures}} but the map 
has been updated to have 2 failures and constructs the error message-

-We end up with the error message:-
{code:java}
Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
-"1" failure but 2 error messages. (number of {{received}} is irrelevant)-

-One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.-

-If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.-


was (Author: newkek):
Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: 

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053647#comment-17053647
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:04 PM:
-

Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.


was (Author: newkek):
Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve this second issue would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> ---
>
> Key: CASSANDRA-15543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: David Capwell
>Assignee: Kevin Gallardo
>

[jira] [Comment Edited] (CASSANDRA-15543) flaky test org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement

2020-03-06 Thread Kevin Gallardo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053578#comment-17053578
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 4:36 PM:
-

I have been looking into this issue and trying to reproduce it locally. Did not 
have much luck, re-executed the test about 2500 times (let it run overnight) 
and only managed to reproduce something ~inconsistent in one case.

The test setup is: execute a schema alteration request on 1 node, don't let the 
node propagate the schema changes to other nodes, then execute a select at 
Consistency = ALL on that same node and expect a failure, because the 2 other 
nodes did not get the schema alteration request.

So the message we expect from the Select request would be the following:
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
Node1's response is successful (because it has the schema changes), but all the 
others fail because they have a stale schema.

The closest I got to reproducing something inconsistent to this was:
{code:java}
Operation failed - received 0 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
But also let's consider the ones Brandon had observed.

I have looked into some of the places where {{ReadFailureException}} are 
generated, and landed in the 
[{{ReadCallback}}|https://github.com/apache/cassandra/blob/aed15bfb01e753f9bac8b149b382c7a7c8d33183/src/java/org/apache/cassandra/service/reads/ReadCallback.java#L162]
 class.

The logic of the code is:
- call {{awaitResults()}}
- in {{awaitResults()}}:
-- block on {{await()}}
-- when {{await()}} exits, check what responses we got and whether errors came 
back

Also looked into {{onFailure()}} method, the logic is the following:
{code:java}
int n = waitingFor(from)
  ? failuresUpdater.incrementAndGet(this)
  : failures;

failureReasonByEndpoint.put(from, failureReason);

if (blockFor + n > replicaPlan().contacts().size()) {
condition.signalAll(); // this signals as soon as 1 failure happens
}
{code}
{{waitingFor(from)}} is supposed to filter out errors coming from non local 
nodes it seems. In this test it always returns true, for simplicity we'll 
consider it is.
 Same for {{blockFor}}, and {{replicaPlan().contacts().size()}} these are both 
{{3}} for the test, I supposed these may differ in some use case.

As we can see above once 1 failure happens, the {{condition}} will be signaled 
to unblock waiting callers. When {{awaitResults()}} unblocks, it checks the 
value of {{failures}} to see how many errors happened and report it as a 
{{ReadFailureException}}.

The code is designed so that the {{condition}} is freed when either all 
responses are successful, or one failure happened.

Now my understanding of this is that, on certain circumstances, we end up 
returning an error message as soon as 1 error happens, and not wait for all 
responses to come back, and the values of {{failures}} and {{received}} are not 
guaranteed to be at a certain value, except we know in case of a failure that 
{{failures}} will be > 0.

I am unclear on what the expected behavior should be here. 
>From my POV, I would think the most reasonable logic would be that {{await()}} 
>only exits once all responses have come back, failed or successful, or it 
>times out. Therefore I would replace the {{"SimpleCondition condition"}} with 
>a {{CountDownLatch}} initialized to the value of the number of expected 
>responses, and {{countDown()}} on {{onReceived()}} and {{onFailure()}} and 
>have {{await()}} use that countDownLatch.

If the behavior of failing as soon as 1 failure happens is preferred, then the 
logic can stay that way, but the test needs to be modified to account for the 
possibility that only 1 error and No successful response may be returned in the 
{{ReadFailureException}}.

[~brandon.williams] [~e.dimitrova] got any insights?


was (Author: newkek):
I have been looking into this issue and trying to reproduce it locally. Did not 
have much luck, re-executed the test about 2500 times (let it run overnight) 
and only managed to reproduce something ~inconsistent in one case.

The test setup is: execute a schema alteration request on 1 node, don't let the 
node propagate the schema changes to other nodes, then execute a select at 
Consistency = ALL on that same node and expect a failure, because the 2 other 
nodes did not get the schema alteration request.

So the message we expect from the Select request would be the following:
{code:java}
Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA 
from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code}
Node1's response is successful (because it has the schema changes), but all the 
others fail because