[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349431#comment-17349431 ] Ekaterina Dimitrova commented on CASSANDRA-16668: - Committed to 4.0 and propagated to trunk. Thank you all! > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349349#comment-17349349 ] Matt Fleming commented on CASSANDRA-16668: -- [~adelapena] your changes look good! Thanks for doing that. +1 > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349335#comment-17349335 ] Andres de la Peña commented on CASSANDRA-16668: --- [~e.dimitrova] I have the two patches, the suggestions and few additional minor warning fixes in [this branch|https://github.com/adelapena/cassandra/tree/16668-trunk-review], which is the one used for the CircleCI runs above. > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349305#comment-17349305 ] Ekaterina Dimitrova commented on CASSANDRA-16668: - +1 from me too, if it is fine with the others I can address on commit the nits from Andres later today. Thank you! > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348618#comment-17348618 ] Andres de la Peña commented on CASSANDRA-16668: --- +1 on both changes. {quote} On the {{shutdownTest} failures I agree it is unlikely related to the change. Perhaps 100 milliseconds is not long enough for the thread to notice it is being shutdown and exit for the join call. Threads should request a nano sleep for at most 10ms, but are not guaranteed to be awoken on a busy CI server and may exceed the test deadline. What do you think about an increase from 100 milliseconds up to 1 second to see if it improves things? {quote} Indeed [a wait of 1 second|https://github.com/adelapena/cassandra/commit/87351868a2b09b605fbd7931e4aba7ef26928ecb] seems to make {{shutdownTest}} not to fail anymore, at least not in these 20K runs: * [j8-j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/71011b7f-3883-4a73-89cf-552099ff5896/jobs/4200] * [j8-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/71011b7f-3883-4a73-89cf-552099ff5896/jobs/4201] * [j11-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/475/workflows/a20f69cd-ec0e-499b-ad7f-0d2e4a92a2fb/jobs/4208] I think we could include that here, as part of the fix. I'm leaving a few very minor suggestions in the commit for {{SEPExecutorTest}}. > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347196#comment-17347196 ] Jon Meredith commented on CASSANDRA-16668: -- +1 from me on 071516d29e41da9924af24e8002822d3c6af0e01 and b4f43608c9a8db23a622608804d95629616a66da . Thanks so much for fixing it and updating the tests. I think it fixes the issue you found, and I don't think it is possible to create an unbounded number of threads in the shared pool by frequently calling {{setMaximumPoolSize}} as it only calls schedule if there are no spinning threads, and tries to satisfy itself for work from the parked/spinning pools first. I'm slightly concerned about the fixed duration sleep() in the new test given how much variability there is in CI infrastructure timing. Ideally the test would verify that all of the SEPWorkers were parked (and none spinning), but there's not any straightforward way I can see to do that without refactoring. Worst case is the test doesn't always cover the all-SEPWorkers parked case every time, but seems unlikely to fail and create a new flaky test. On the {{shutdownTest} failures I agree it is unlikely related to the change. Perhaps 100 milliseconds is not long enough for the thread to notice it is being shutdown and exit for the join call. Threads should request a nano sleep for at most 10ms, but are not guaranteed to be awoken on a busy CI server and may exceed the test deadline. What do you think about an increase from 100 milliseconds up to 1 second to see if it improves things? > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346892#comment-17346892 ] Matt Fleming commented on CASSANDRA-16668: -- I think that failure is probably unrelated because we saw similar failures for shutdownTest before this patch here [https://app.circleci.com/pipelines/github/adelapena/cassandra/441/workflows/bcf154ff-0b56-48ed-9f82-6b3e395f53ed/jobs/3880/tests#failed-test-0] Btw, I've also written a new unit test to catch this bug in the future: [https://github.com/mfleming/cassandra/commit/b4f43608c9a8db23a622608804d95629616a66da] > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346130#comment-17346130 ] Andres de la Peña commented on CASSANDRA-16668: --- Here are 10K runs of {{SEPExecutorTest}} with the patch, using the [CircleCI multiplexer|https://github.com/apache/cassandra/blob/trunk/doc/source/development/testing.rst#circleci]: * [j8-j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/cb5b1d27-75d4-4b3a-814c-04454fb4f4ef/jobs/4017] * [j8-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/cb5b1d27-75d4-4b3a-814c-04454fb4f4ef/jobs/4015] * [j11-j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/457/workflows/1b82f571-9dd8-4a98-892b-1c0e3704f2d9/jobs/4013] It seems that {{changingMaxWorkersMeetsConcurrencyGoalsTest}} happily survives the three 10K runs, but there are some uncommon failures on {{shutdownTest}}: * [j8-j8 runner 62 iteration 49|https://4017-85817267-gh.circle-artifacts.com/62/stdout/fails/049/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt] * [j8-j11 runner 73 iteration 9|https://4015-85817267-gh.circle-artifacts.com/73/stdout/fails/009/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt] * [j11-j11 runner 68 iteration 51|https://4013-85817267-gh.circle-artifacts.com/68/stdout/fails/051/testsome-org.apache.cassandra.concurrent.SEPExecutorTest.txt] Not sure whether that is related or an independent failure. > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16668) Intermittent failure of SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race condition when shrinking maximum pool size to zero
[ https://issues.apache.org/jira/browse/CASSANDRA-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344146#comment-17344146 ] Ekaterina Dimitrova commented on CASSANDRA-16668: - Jenkins [CI run | https://jenkins-cm4.apache.org/job/Cassandra-devbranch/776/] > Intermittent failure of > SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest caused by race > condition when shrinking maximum pool size to zero > - > > Key: CASSANDRA-16668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16668 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Matt Fleming >Assignee: Matt Fleming >Priority: Normal > Fix For: 4.0-rc > > > A difficult-to-hit race condition exists in > changingMaxWorkersMeetsConcurrencyGoalsTest when changing the maximum pool > size from 0 -> 4 which results in the test failing like so: > {{junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but > was:junit.framework.AssertionFailedError: Test tasks did not hit max > concurrency goal expected: but was: at > org.apache.cassandra.concurrent.SEPExecutorTest.assertMaxTaskConcurrency(SEPExecutorTest.java:198) > at > org.apache.cassandra.concurrent.SEPExecutorTest.changingMaxWorkersMeetsConcurrencyGoalsTest(SEPExecutorTest.java:132)}} > I can hit this issue maybe 2/3 times for every 100 invocations of the unit > test. > The issue that causes the failure is that if tasks are still enqueued when > the maximum pool size is set to zero and if all of the SEPWorker threads > enter the STOP state before the pool size is bumped to 4, then no SEPWorker > threads will be spun up to service the task queue. This causes the above > error. > Why don't we spin up SEPWorker threads when enqueing tasks? Because of the > guard logic in addTask: > [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/concurrent/SEPExecutor.java#L113,L121] > In this scenario taskPermits will not be zero (because we have tasks on the > queue) so we never call {{maybeStartSpinningWorker()}}. > A trick to make this issue much easier to hit is to insert a > {{Thread.sleep(500)}} immediately after setting the pool size to zero. This > has the effect of guaranteeing that all SEPWorker threads will be STOP'd > before enqueueing more work. > Here's a fix that attempts to spin up an SEPWorker whenever we grow the > number of work permits: > https://github.com/mfleming/cassandra/commit/071516d29e41da9924af24e8002822d3c6af0e01 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org