[jira] [Commented] (CASSANDRA-10938) test_bulk_round_trip_blogposts is failing occasionally

Stefania (JIRA) Thu, 21 Jan 2016 20:59:52 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15111916#comment-15111916
 ]


Stefania commented on CASSANDRA-10938:
--------------------------------------

I've examined more closely the failure on Jenkins since CASSANDRA-9303 was 
committed and I've noted that:

* They happen more rarely and mostly on 2.1
* The problem is only with COPY TO, not COPY FROM so we cannot reduce the 
ingest rate.

I've set-up an AWS box with the same specs as the ones used by Jenkings 
(m3.2xlarge). I've run  {{test_bulk_round_trip_blogposts}} 50 times with no 
failures. There must be something else on Jenkins boxes that causes connections 
to be rejected but I could not work it out. 

So I decided to simulate a failed connection by setting 
{{native_transport_max_concurrent_connections}} to limit the number of 
connections accepted by hosts. It doesn't tell us what's happening on Jenkins 
but at least it allows us to test COPY TO in the face of failed connections, 
which is a good thing anyway and should hopefully ensure that the Jenkins 
failures disappear. Note that just stopping replicas would not have easily 
allowed testing this because the code selects only replicas that are up. I've 
also increased the replication factor from 1 to 3 and the nodes from 3 to 5 for 
{{test_bulk_round_trip_blogposts}} to give it more resilience.

I've changed the COPY TO connection logic to try multiple replicas one by one 
in case of failure - previously we were giving multiple replicas to the load 
balancing policy but the contact point was only the chosen replica. More 
importantly, if all replicas fail, instead of killing the worker process - 
which would halt the entire export - we return an error for that token - which 
means that the token is tried again later for up to MAXATTEMPTS times.

New test code is 
[here|https://github.com/stef1927/cassandra-dtest/commits/10938].

The [2.1 patch|https://github.com/stef1927/cassandra/commits/10938-2.1] is its 
own patch, the [2.2 
patch|https://github.com/stef1927/cassandra/commits/10938-2.2] is identical to 
the 2.1 patch except for a conflict with the imports and it applies cleanly 
upwards.

CI is still pending:

||2.1||2.2||3.0||3.3||trunk||
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-2.2-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.0-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-3.3-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10938-dtest/]|

[~pauloricardomg] could you review the python changes? Sylvan has already noted 
above that the change from NBHM to CHM is fine.

> test_bulk_round_trip_blogposts is failing occasionally
> ------------------------------------------------------
>
>                 Key: CASSANDRA-10938
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10938
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Tools
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
>
>         Attachments: 6452.nps, 6452.png, 7300.nps, 7300a.png, 7300b.png, 
> node1_debug.log, node2_debug.log, node3_debug.log, recording_127.0.0.1.jfr
>
>
> We get timeouts occasionally that cause the number of records to be incorrect:
> http://cassci.datastax.com/job/trunk_dtest/858/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10938) test_bulk_round_trip_blogposts is failing occasionally

Reply via email to