[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Jira Thu, 20 May 2021 09:39:08 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348618#comment-17348618
 ]


Andres de la Peña commented on CASSANDRA-16668:
-----------------------------------------------

+1 on both changes.
{quote}
On the {{shutdownTest} failures I agree it is unlikely related to the change. 
Perhaps 100 milliseconds is not long enough for the thread to notice it is 
being shutdown and exit for the join call. Threads should request a nano sleep 
for at most 10ms, but are not guaranteed to be awoken on a busy CI server and 
may exceed the test deadline. What do you think about an increase from 100 
milliseconds up to 1 second to see if it improves things?
{quote}
Indeed [a wait of 1 
second|https://github.com/adelapena/cassandra/commit/87351868a2b09b605fbd7931e4aba7ef26928ecb]
 seems to make {{shutdownTest}} not to fail anymore, at least not in these 20K 
runs:
* 
[j8-j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/71011b7f-3883-4a73-89cf-552099ff5896/jobs/4200]
* 
[j8-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/71011b7f-3883-4a73-89cf-552099ff5896/jobs/4201]
* 
[j11-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/a20f69cd-ec0e-499b-ad7f-0d2e4a92a2fb/jobs/4208]

I think we could include that here, as part of the fix.

I'm leaving a few very minor suggestions in the commit for {{SEPExecutorTest}}.

> Intermittent failure of 
> SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race 
> condition when shrinking maximum pool size to zero
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16668
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16668
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Other
>            Reporter: Matt Fleming
>            Assignee: Matt Fleming
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> A difficult-to-hit race condition exists in 
> changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool 
> size from 0 -> 4 which results in the test failing like so:
> {{junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but 
> was:<false>junit.framework.AssertionFailedError: Test tasks did not hit max 
> concurrency goal expected:<true> but was:<false> at 
> org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198)
>  at 
> org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}}
> I can hit this issue maybe 2/3 times for every 100 invocations of the unit 
> test.
> The issue that causes the failure is that if tasks are still enqueued when 
> the maximum pool size is set to zero and if all of the SEPWorker threads 
> enter the STOP state before the pool size is bumped to 4, then no SEPWorker 
> threads will be spun up to service the task queue. This causes the above 
> error.
> Why don't we spin up SEPWorker threads when enqueing tasks? Because of the 
> guard logic in addTask: 
> [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121]
> In this scenario taskPermits will not be zero (because we have tasks on the 
> queue) so we never call {{maybeStartSpinningWorker()}}.
> A trick to make this issue much easier to hit is to insert a 
> {{Thread.sleep(500)}} immediately after setting the pool size to zero. This 
> has the effect of guaranteeing that all SEPWorker threads will be STOP'd 
> before enqueueing more work.
> Here's a fix that attempts to spin up an SEPWorker whenever we grow the 
> number of work permits: 
> https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero

Reply via email to