[ 
https://issues.apache.org/jira/browse/CASSANDRA-15338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17055533#comment-17055533
 ] 

Yifan Cai edited comment on CASSANDRA-15338 at 3/10/20, 2:55 AM:
-----------------------------------------------------------------

The test failure was not able to be reproduced when simply running it from my 
laptop. 
 
However, it can be easily reproduced when running in a docker container with 
limited CPUs (i.e., 2). 
 
After multiple runs, the observation was that the test runs only failed when 
testing with LargeMessage. It indicated that the failures were probably related 
with {{LargeMessageDelivery}}. 
 
The following is what I think have happened. 
# When the {{inbound}} just opened and the first message gets queued into the 
{{outbound}}, handshake happens and the execution was deferred once the 
connection was established (executeAgain). 
# Since enqueue is not blocking, the next line, {{unsafeRunOnDelivery}} runs 
immediately. The effect is that the runnable gets registered, but not run yet. 
# Connection is established, so we {{executeAgain()}}. Because the runnable 
{{stopAndRun}} is present, and at this point, the {{inProgress}} flag is still 
false. The test runs the runnable, which counts down {{deliveryDone}} 
unexpectedly. 
# Delivery proceeds to flush the message. In {{LargeMessageDelivery}}, the 
flush is async and race condition can happen.
   ## when the inbound has received message (and countdown receiveDone)
   ## {{LargeMessageDelivery}} is still polling for the completion of flush, so 
not yet release capacity. 

Therefore, the assertion on the pendingCount failed. 
 
There are 2 places in the test flow are (or can go) wrong. See step 3 and step 
4. 

Regarding step 3, the runnable {{stopAndRun}} should not be registered when 
establishing the connection. In production, is there a case that a 
{{stopAndRun}} being registered this early? Probably not.

Regarding step 4, the {{outbound}} has no knowledge about whether the 
{{inbound}} has received any message. Test should register the runnable 
{{stopAndRun}} at the message handler to count down the {{deliveryDone}}. 
Therefore, the runnable can correctly wait for the current delivery to 
complete. Then it runs. 
 
PR is here: https://github.com/apache/cassandra/pull/466

As mentioned, I reproduced using the docker. Here is the bundle that one can 
simply download and run.  [^CASS-15338-Docker.zip] It runs {{ConnectionTest}} 
repeatedly until failures.
I have included the patch within the image too. 

To reproduce, run
{code:bash}
bash build_and_run.sh
{code}

To see the runs with the patch, run
{code:bash}
bash build_and_run.sh patched
{code}


was (Author: yifanc):
The test failure was not able to reproduce when simply running it from my 
laptop. 
 
However, it can be easily reproduced when running in a docker container with 
limited CPUs (i.e., 2). 
 
After multiple runs, the observation was that the test runs only failed when 
testing with LargeMessage. It indicated that the failures were probably related 
with {{LargeMessageDelivery}}. 
 
The following is what I think have happened. 
# When the {{inbound}} just opened and the first message gets queued into the 
{{outbound}}, handshake happens and the execution was deferred once the 
connection was established (executeAgain). 
# Since enqueue is not blocking, the next line, {{unsafeRunOnDelivery}} runs 
immediately. The effect is that the runnable gets registered, but not run yet. 
# Connection is established, so we {{executeAgain()}}. Because the runnable 
{{stopAndRun}} is present, and at this point, the {{inProgress}} flag is still 
false. The test runs the runnable, which counts down {{deliveryDone}} 
unexpectedly. 
# Delivery proceeds to flush the message. In {{LargeMessageDelivery}}, the 
flush is async and race condition can happen.
   ## when the inbound has received message (and countdown receiveDone)
   ## {{LargeMessageDelivery}} is still polling for the completion of flush, so 
not yet release capacity. 

Therefore, the assertion on the pendingCount failed. 
 
There are 2 places in the test flow are (or can go) wrong. See step 3 and step 
4. 

Regarding step 3, the runnable {{stopAndRun}} should not be registered when 
establishing the connection. In production, is there a case that a 
{{stopAndRun}} being registered this early? Probably not.

Regarding step 4, the {{outbound}} has no knowledge about whether the 
{{inbound}} has received any message. Test should register the runnable 
{{stopAndRun}} at the message handler to count down the {{deliveryDone}}. 
Therefore, the runnable can correctly wait for the current delivery to 
complete. Then it runs. 
 
PR is here: https://github.com/apache/cassandra/pull/466

As mentioned, I reproduced using the docker. Here is the bundle that one can 
simply download and run.  [^CASS-15338-Docker.zip] It runs {{ConnectionTest}} 
repeatedly until failures.
I have included the patch within the image too. 

To reproduce, run
{code:bash}
bash build_and_run.sh
{code}

To see the runs with the patch, run
{code:bash}
bash build_and_run.sh patched
{code}

> Fix flakey testMessagePurging - org.apache.cassandra.net.ConnectionTest
> -----------------------------------------------------------------------
>
>                 Key: CASSANDRA-15338
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15338
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/unit
>            Reporter: David Capwell
>            Assignee: Yifan Cai
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0-alpha
>
>         Attachments: CASS-15338-Docker.zip
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Example failure: 
> [https://circleci.com/gh/dcapwell/cassandra/11#artifacts/containers/1]
>   
> {code:java}
> Testcase: testMessagePurging(org.apache.cassandra.net.ConnectionTest):  FAILED
>  expected:<0> but was:<1>
>  junit.framework.AssertionFailedError: expected:<0> but was:<1>
>    at 
> org.apache.cassandra.net.ConnectionTest.lambda$testMessagePurging$38(ConnectionTest.java:625)
>    at 
> org.apache.cassandra.net.ConnectionTest.doTestManual(ConnectionTest.java:258)
>    at 
> org.apache.cassandra.net.ConnectionTest.testManual(ConnectionTest.java:231)
>    at 
> org.apache.cassandra.net.ConnectionTest.testMessagePurging(ConnectionTest.java:584){code}
>   
>  Looking closer at 
> org.apache.cassandra.net.OutboundConnection.Delivery#stopAndRun it seems that 
> the run method is called before 
> org.apache.cassandra.net.OutboundConnection.Delivery#doRun which may lead to 
> a test race condition where the CountDownLatch completes before executing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to