Ilya Soin created FLINK-37589:
---------------------------------
Summary: Job submission via REST API is not thread safe
Key: FLINK-37589
URL: https://issues.apache.org/jira/browse/FLINK-37589
Project: Flink
Issue Type: Bug
Components: Client / Job Submission
Affects Versions: 1.20.1, 1.20.0
Reporter: Ilya Soin
Sometimes when Flink K8S Operator deploys more than one job in parallel, some
jobs are deployed twice, thrice, etc. For example, if 5 jobs are being deployed
at the same time, instead of jobs 1,2,3,4,5 on the cluster there can be jobs
1,1,3,4,5 or 1,2,2,3,5 or even 1,2,2,2,5, and so on. It happens all the time
with python jobs and rarely with other types of jobs. The easiest way to
reproduce is to deploy 2-3 python jobs on a standalone Flink cluster in
parallel.
The issue is definitely not in the Operator, has has been discussed here and
here. I was able to fix it by introducing a synchronized lock in the
[JarRunHandler|https://github.com/apache/flink/blob/release-1.20/flink-runtime-web/src/main/java/org/apache/flink/runtime/webmonitor/handlers/JarRunHandler.java#L108]
like this:
{code:java}
() -> {
synchronized (runLock) {
return applicationRunner.run(gateway, program,
effectiveConfiguration);
}
},
{code}
I'm not sure if it's the best solution to this problem and could use some
pointers / discussion. I suspect that we see it mostly on python jobs because
they take longer to deploy and leave more time to "overlap".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)