[ 
https://issues.apache.org/jira/browse/CASSANDRA-11414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353032#comment-15353032
 ] 

Paulo Motta commented on CASSANDRA-11414:
-----------------------------------------

Since this test kills streaming at random points, it was causing various errors 
or race conditions causing the test to fail, so the basic idea here is to 
improve synchronization to avoid these races when a node is randomly killed in 
the middle of a streaming. With that said, I made the following improvements:

* 2.2+
** Add null protection on ConnectionHandler.signalCloseDone
** Stream session was not being failed on {{SocketException}}, what could cause 
it to hang on broken connections

* 3.0+
** Synchronize access to transaction on {{StreamReceiveTask}}
** Abort {{SSTableWriter}} if received after {{StreamReceiveTask}} is finished
** Abort {{SSTableWriter}} if there's a failure during finalization on 
{{StreamReceiveTask}}
** Synchronize access to {{StreamSession}} methods: {{prepareReceiving}}, 
{{addTransferFiles}} and {{addTransferRanges}}, so they don't race with 
{{onError}}, since that will try to abort active tasks.
*** Throw exception if any of these are executed after stream session is 
finished (added tests on {{StreamReceiveTask}}))

After these were addressed, the number of failures have gone down from 28/100 
to 11/100 on this [multiplexer 
job|https://cassci.datastax.com/view/Parameterized/job/parameterized_dtest_multiplexer/149/].

The remaining failures are due to bad timing on dtest, so I [updated the 
dtest|https://github.com/riptano/cassandra-dtest/pull/1051/commits/51ed5f55c85a3a1c339b265ac4b056137215e5fd]
 to address those and submitted a new multiplexer run (still queued).

Patch and tests available below:
||2.2||3.0||3.9||trunk||dtest||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-11414]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-11414]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.9...pauloricardomg:3.9-11414]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-11414]|[branch|https://github.com/riptano/cassandra-dtest/compare/master...pauloricardomg:11414]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11414-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11414-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.9-11414-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11414-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-11414-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-11414-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.9-11414-dtest/lastCompletedBuild/testReport/]|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-11414-dtest/lastCompletedBuild/testReport/]|

Will set to PA once new multiplexer and CI run looks good.

> dtest failure in bootstrap_test.TestBootstrap.resumable_bootstrap_test
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-11414
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11414
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Testing
>            Reporter: Philip Thompson
>            Assignee: Paulo Motta
>              Labels: dtest
>             Fix For: 3.x
>
>
> Stress is failing to read back all data. We can see this output from the 
> stress read
> {code}
> java.io.IOException: Operation x0 on key(s) [314c384f304f4c325030]: Data 
> returned was not validated
>       at org.apache.cassandra.stress.Operation.error(Operation.java:138)
>       at 
> org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:116)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:261)
>       at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:327)
> java.io.IOException: Operation x0 on key(s) [33383438363931353131]: Data 
> returned was not validated
>       at org.apache.cassandra.stress.Operation.error(Operation.java:138)
>       at 
> org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:116)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
>       at 
> org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:261)
>       at 
> org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:327)
> FAILURE
> {code}
> Started happening with build 1075. Does not appear flaky on CI.
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1076/testReport/bootstrap_test/TestBootstrap/resumable_bootstrap_test
> Failed on CassCI build trunk_dtest #1076



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to