[
https://issues.apache.org/jira/browse/FLINK-38439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023967#comment-18023967
]
Mate Czagany commented on FLINK-38439:
--------------------------------------
I could not find any recent commits that may cause this, I think it's a race
condition that's pretty hard to track down. In the test method, we fill all the
thread pool to the max thread size (500), assert that no new operation can be
submitted in this state, then we wait for the first task to finish, and try to
submit another task, which should be successful.
This last step fails in rare cases, but I can occasionally reproduce it
locally. As the stack trace suggests, the internal `ThreadPoolExecutor` does
not have any free threads and that's why it fails. I think it's because there's
a very slim chance that we try to submit the operation between the time that
the task reports that it's done using `notifyAll` in `OperationManager`, and
the thread actually being freed up. It's a pretty rare case, and I don't think
it's easy to reproduce outside a test environment.
> SqlGatewayServiceITCase failed in test_cron_azure table
> -------------------------------------------------------
>
> Key: FLINK-38439
> URL: https://issues.apache.org/jira/browse/FLINK-38439
> Project: Flink
> Issue Type: Bug
> Components: Tests
> Affects Versions: 2.2.0
> Reporter: Ruan Hang
> Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69891&view=logs&j=a9db68b9-a7e0-54b6-0f98-010e0aff39e2&t=feeea8c8-fb6c-541e-ab85-af75c9efb8e4
--
This message was sent by Atlassian Jira
(v8.20.10#820010)