[ https://issues.apache.org/jira/browse/CASSANDRA-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053647#comment-17053647 ]
Kevin Gallardo edited comment on CASSANDRA-15543 at 3/6/20, 7:04 PM: --------------------------------------------------------------------- Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve these issues would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. was (Author: newkek): Also, I suggested we wait for all responses because this is what the test expects, the test waits for the {{ReadFailureException}} to contain *both* failures without accounting for possible race conditions of the {{ReadCallback}} returning before all responses have come back. Also noticed that in the current context, the number of {{failure}} and the list of failures can be inconsistent with each other, because we create the exception with {code}new ReadFailureException(replicaPlan().consistencyLevel(), received, blockFor, resolver.isDataPresent(), failureReasonByEndpoint){code}. Concurrently, {{received}} can have been incremented but not the map of {{failureReasonByEndpoint}} and we could end up with: {code}Operation failed - received 1 responses and 2 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012{code} "2" failures but only 1 error message. (number of {{received}} is irrelevant) Conversely, since the reference to the map is passed to {{ReadFailureException}} but {{received}} is passed by value, we can end up in the following scenario: - 1 error triggers awaitResults() to unblock - call to {{new ReadFailureException(received, failureReasonByEndpoint)}} - value of 1 for {{received}} copied when calling {{ReadFailureException}} constructor but only reference to {{failureReasonByEndpoint}} given - concurrently second failure calls to {{onFailure()}} which updates {{failures}} and the map - {{ReadFailureException}} has the {{1}} value for {{failures}} but the map has been updated to have 2 failures and constructs the error message We end up with the error message: {code}Operation failed - received 1 responses and 1 failures: INCOMPATIBLE_SCHEMA from 127.0.0.3:7012, INCOMPATIBLE_SCHEMA from 127.0.0.2:7012{code} "1" failure but 2 error messages. (number of {{received}} is irrelevant) One way to solve this second issue would be to pass a copy of the {{failureReasonByEndpoint}} to the constructor of {{ReadFailureException}}. If {{await()}} waited until all messages came back or timeout the 2 issues wouldn't be problematic I think. > flaky test > org.apache.cassandra.distributed.test.SimpleReadWriteTest.readWithSchemaDisagreement > ----------------------------------------------------------------------------------------------- > > Key: CASSANDRA-15543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15543 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest > Reporter: David Capwell > Assignee: Kevin Gallardo > Priority: Normal > Fix For: 4.0-alpha > > > This fails infrequently, last seen failure was on java 8 > {code} > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.DistributedReadWritePathTest.readWithSchemaDisagreement(DistributedReadWritePathTest.java:276) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org