[ 
https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053647#comment-17053647
 ] 

Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:04 PM:
---------------------------------------------------------------------

Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve these issues would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.


was (Author: newkek):
Also, I suggested we wait for all responses because this is what the test 
expects, the test waits for the {{ReadFailureException}} to contain *both* 
failures without accounting for possible race conditions of the 
{{ReadCallback}} returning before all responses have come back.

Also noticed that in the current context, the number of {{failure}} and the 
list of failures can be inconsistent with each other, because we create the 
exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), 
received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. 
Concurrently, {{received}} can have been incremented but not the map of 
{{failureReasonByEndpoint}} and we could end up with:

{code}Operation failed - received 1 responses and 2 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code}

"2" failures but only 1 error message. (number of {{received}} is irrelevant)

Conversely, since the reference to the map is passed to 
{{ReadFailureException}} but {{received}} is passed by value, we can end up in 
the following scenario:

- 1 error triggers awaitResults() to unblock
- call to {{new ReadFailureException(received, failureReasonByEndpoint)}}
- value of 1 for {{received}} copied when calling {{ReadFailureException}} 
constructor but only reference to {{failureReasonByEndpoint}} given
- concurrently second failure calls to {{onFailure()}} which updates 
{{failures}} and the map
- {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has 
been updated to have 2 failures and constructs the error message

We end up with the error message:
{code}Operation failed - received 1 responses and 1 failures: 
INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 
127.0.0.2:7012{code}

"1" failure but 2 error messages. (number of {{received}} is irrelevant)

One way to solve this second issue would be to pass a copy of the 
{{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}.

If {{await()}} waited until all messages came back or timeout the 2 issues 
wouldn't be problematic I think.

> flaky test 
> org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement
> -----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15543
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15543
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest
>            Reporter: David Capwell
>            Assignee: Kevin Gallardo
>            Priority: Normal
>             Fix For: 4.0-alpha
>
>
> This fails infrequently, last seen failure was on java 8
> {code}
> junit.framework.AssertionFailedError
>       at 
> org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to