[ 
https://issues.apache.org/jira/browse/FLINK-38439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18023967#comment-18023967
 ] 

Mate Czagany commented on FLINK-38439:
--------------------------------------

I could not find any recent commits that may cause this, I think it's a race 
condition that's pretty hard to track down. In the test method, we fill all the 
thread pool to the max thread size (500), assert that no new operation can be 
submitted in this state, then we wait for the first task to finish, and try to 
submit another task, which should be successful.

This last step fails in rare cases, but I can occasionally reproduce it 
locally. As the stack trace suggests, the internal `ThreadPoolExecutor` does 
not have any free threads and that's why it fails. I think it's because there's 
a very slim chance that we try to submit the operation between the time that 
the task reports that it's done using `notifyAll` in `OperationManager`, and 
the thread actually being freed up. It's a pretty rare case, and I don't think 
it's easy to reproduce outside a test environment. 

> SqlGatewayServiceITCase failed in test_cron_azure table
> -------------------------------------------------------
>
>                 Key: FLINK-38439
>                 URL: https://issues.apache.org/jira/browse/FLINK-38439
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 2.2.0
>            Reporter: Ruan Hang
>            Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=69891&view=logs&j=a9db68b9-a7e0-54b6-0f98-010e0aff39e2&t=feeea8c8-fb6c-541e-ab85-af75c9efb8e4



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to