[ 
https://issues.apache.org/jira/browse/CASSANDRA-14078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272271#comment-16272271
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-14078:
---------------------------------------------------

Hi [~KurtG]

Thanks! for the review.

{quote}
My understanding was that the max # connections was configured so that {{COPY 
TO}} would always exceed the max and fail-over.
{quote}
Yes it is designed to failover to peer node, but this configuration 
{{'native_transport_max_concurrent_connections': '12'}} is applicable to all 
the nodes in cluster, not just a few. So client tries to fail over to peer and 
finds that peer is also busy, as a result {{COPY TO}} times out from client 
side and this test fails.
In this 
[run|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-trunk-dtest/353/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts_with_max_connections/]
 we can see that {{COPY FROM}} command tries to fail over to peer node but all 
the nodes have exhausted connections so cannot failover, it retries by 
sleeping, etc. and finally gives up.

{code}
All replicas busy, sleeping for 4 second(s)...
...
All replicas busy, sleeping for 1 second(s)...
...
All replicas busy, sleeping for 23 second(s)...
Replicas too busy, given up
{code}

{quote}
We've had no record of this particular failure before (at least in JIRA), seems 
like it could actually be something that needs fixing.
{quote}
[This run 
|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-trunk-dtest/353/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts_with_max_connections/]
 on {{trunk}} shows 7/30 failure rate

{quote}
To some degree all tests kind of depend on what hardware you run, this is the 
nature of the beast with C* and dtests. Can you elaborate on why it's not 
deterministic?
{quote}
I agree that many of the dtests depend on the hardware we run, but this 
particular test is hardware dependent + *timing related*. If threads are less 
busy then it may not require more connections and test will pass, but if 
threads are more busy then connections will pile up and results in timeout, 
etc. 

We can try tweking {{INGESTRATE}}, 
{{'native_transport_max_concurrent_connections': '<>'}}, etc. options to make 
this test working but in my opinion it would be difficult to find ideal tuning 
which always works.

Jaydeep

> Fix dTest test_bulk_round_trip_blogposts_with_max_connections
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-14078
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14078
>             Project: Cassandra
>          Issue Type: Test
>          Components: Testing
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Minor
>
> This ticket is regarding following dTest 
> {{cqlsh_tests.cqlsh_copy_tests.CqlshCopyTest.test_bulk_round_trip_blogposts_with_max_connections}}
> This test is trying to limit number of client connections and assumes that 
> once connection count has reached then client will fail-over to other node 
> and do the request. The reason is, it is not deterministic test case as it 
> totally depends on what hardware you run, timing, etc.
> For example
> If we look at 
> https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-trunk-dtest/353/testReport/cqlsh_tests.cqlsh_copy_tests/CqlshCopyTest/test_bulk_round_trip_blogposts_with_max_connections/
> {quote}
> ...
> Processed: 5000 rows; Rate:    2551 rows/s; Avg. rate:    2551 rows/s
> All replicas busy, sleeping for 4 second(s)...
> Processed: 10000 rows; Rate:    2328 rows/s; Avg. rate:    2307 rows/s
> All replicas busy, sleeping for 1 second(s)...
> Processed: 15000 rows; Rate:    2137 rows/s; Avg. rate:    2173 rows/s
> All replicas busy, sleeping for 11 second(s)...
> Processed: 20000 rows; Rate:    2138 rows/s; Avg. rate:    2164 rows/s
> Processed: 25000 rows; Rate:    2403 rows/s; Avg. rate:    2249 rows/s
> Processed: 30000 rows; Rate:    2582 rows/s; Avg. rate:    2321 rows/s
> Processed: 35000 rows; Rate:    2835 rows/s; Avg. rate:    2406 rows/s
> Processed: 40000 rows; Rate:    2867 rows/s; Avg. rate:    2458 rows/s
> Processed: 45000 rows; Rate:    3163 rows/s; Avg. rate:    2540 rows/s
> Processed: 50000 rows; Rate:    3200 rows/s; Avg. rate:    2596 rows/s
> Processed: 50234 rows; Rate:    2032 rows/s; Avg. rate:    2572 rows/s
> All replicas busy, sleeping for 23 second(s)...
> Replicas too busy, given up
> ...
> {quote}
> Here we can see request is timing out, sometimes it resumes after 1 second, 
> next time 11 seconds and some times it doesn't work at all. 
> In my opinion this test is not a good fit for dTest as dTest(s) should be 
> deterministic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to