Sounds like a pretty bad bug / race condition. It would be great if you could open a JIRA ticket with this information and if you could attach the operator logs for maybe possible a slightly smaller repro case (with for example 2 concurrent jobs if the issue comes up)
Next will be to dive into the operator logic to try to figure out what's going on. Not sure how to mitigate it unless we know what causes it. Cheers Gyula On Wed, Sep 10, 2025 at 5:25 PM Nikola Milutinovic <[email protected]> wrote: > Hi all. > > It seems that we have problems when we try to create a group of Flink > Session Jobs, the Operator first runs into some timeouts, and most jobs go > into RECONCILING state. Then they do manage to get to running state, but > when we inspect Flink UI, we can see that some of the jobs are duplicates. > The K8s Operator mixed up > > So, for instance, we would see: > > % kubectl get FlinkSessionJobs > > NAME JOB STATUS LIFECYCLE STATE > > cache-cdc-dynamic-ims-config-entities RUNNING STABLE > > cache-cdc-equipment RUNNING STABLE > > cache-cdc-floc RUNNING STABLE > > cache-cdc-maintenance-entities RUNNING STABLE > > cache-cdc-notification-entities RUNNING STABLE > > cache-cdc-schedule RUNNING STABLE > > cache-cdc-static-entities RUNNING STABLE > > cache-cdc-task-list-entities RUNNING STABLE > > cache-cdc-work-order-entities RUNNING STABLE > > But listing the jobs shows duplicates. > > # flink list > > ------------------ Running/Restarting Jobs ------------------- > > 10.09.2025 09:36:26 : d0f18a972db81a49763dfddc59eff21d : CACHE - CDC > Static Entities (RUNNING) > > 10.09.2025 09:36:28 : b045747e9f83a5dfb99f306838d23346 : CACHE - CDC > Equipment (RUNNING) > > 10.09.2025 09:36:42 : 57874e342a9386ad5c022d15da947508 : CACHE - CDC > Notification entities (RUNNING) > > 10.09.2025 09:36:49 : 8dce7b2ffb701b619938ab4989f18aba : CACHE - CDC > Maintenance entities (RUNNING) > > 10.09.2025 09:36:51 : bac0c76a4c48e207b065576f80750e8d : CACHE - CDC FLOC > (RUNNING) > > 10.09.2025 09:37:21 : 4e065fa31fcd640c18b8d2f9d832a9ea : CACHE - CDC > Maintenance entities (RUNNING) > > 10.09.2025 09:37:30 : 3ff775f7bda5b83d26b43ecfaddbf030 : CACHE - CDC > Schedule (RUNNING) > > 10.09.2025 09:38:02 : cc7c25a0ed3c1dbf4ea28c9b31ae328d : CACHE - CDC > Schedule (RUNNING) > > 10.09.2025 09:38:42 : 7ea9c0ff50499003e669e27d9d15767f : CACHE - CDC Work > Order entities (RUNNING) > > 10.09.2025 09:39:07 : 119ef8358eebfbe7ff0fa7223ab2e22d : CACHE - CDC Work > Order entities (RUNNING) > > 10.09.2025 09:39:17 : c00161bf832557f7afac66cd8d32b3ac : *CACHE - CDC > Work Order entities (RUNNING)* > > -------------------------------------------------------------- > > No scheduled jobs. > > So, the job with name bolded IS the correct one (K8s resource and the > Flink job match) other two jobs are attached to wrong K8s resources. So, > for instance FlinkSessionJob/cache-cdc-schedule is referencing correct > parameters in “spec” section, but its “status” section is showing another > job entirely. There is some sort of mixup. > > What can we do to at least mitigate this? It seems that creating K8s > resources one-by-one is OK. > Can we set an operator config option “ > kubernetes.operator.reconcile.parallelism” to 1 to force one-by-one > launching? > Nikola. > >
