Re: Flink Kubernetes Operator problems

Gyula Fóra Wed, 10 Sep 2025 08:29:20 -0700

Sounds like a pretty bad bug / race condition.
It would be great if you could open a JIRA ticket with this information and
if you could attach the operator logs for maybe possible a slightly smaller
repro case (with for example 2 concurrent jobs if the issue comes up)


Next will be to dive into the operator logic to try to figure out
what's going on. Not sure how to mitigate it unless we know what causes it.

Cheers
Gyula

On Wed, Sep 10, 2025 at 5:25 PM Nikola Milutinovic <[email protected]>
wrote:

> Hi all.
>
> It seems that we have problems when we try to create a group of Flink
> Session Jobs, the Operator first runs into some timeouts, and most jobs go
> into RECONCILING state. Then they do manage to get to running state, but
> when we inspect Flink UI, we can see that some of the jobs are duplicates.
> The K8s Operator mixed up
>
> So, for instance, we would see:
>
> % kubectl get FlinkSessionJobs
>
> NAME                                    JOB STATUS   LIFECYCLE STATE
>
> cache-cdc-dynamic-ims-config-entities   RUNNING      STABLE
>
> cache-cdc-equipment                     RUNNING      STABLE
>
> cache-cdc-floc                          RUNNING      STABLE
>
> cache-cdc-maintenance-entities          RUNNING      STABLE
>
> cache-cdc-notification-entities         RUNNING      STABLE
>
> cache-cdc-schedule                      RUNNING      STABLE
>
> cache-cdc-static-entities               RUNNING      STABLE
>
> cache-cdc-task-list-entities            RUNNING      STABLE
>
> cache-cdc-work-order-entities           RUNNING      STABLE
>
> But listing the jobs shows duplicates.
>
> # flink list
>
> ------------------ Running/Restarting Jobs -------------------
>
> 10.09.2025 09:36:26 : d0f18a972db81a49763dfddc59eff21d : CACHE - CDC
> Static Entities (RUNNING)
>
> 10.09.2025 09:36:28 : b045747e9f83a5dfb99f306838d23346 : CACHE - CDC
> Equipment (RUNNING)
>
> 10.09.2025 09:36:42 : 57874e342a9386ad5c022d15da947508 : CACHE - CDC
> Notification entities (RUNNING)
>
> 10.09.2025 09:36:49 : 8dce7b2ffb701b619938ab4989f18aba : CACHE - CDC
> Maintenance entities (RUNNING)
>
> 10.09.2025 09:36:51 : bac0c76a4c48e207b065576f80750e8d : CACHE - CDC FLOC
> (RUNNING)
>
> 10.09.2025 09:37:21 : 4e065fa31fcd640c18b8d2f9d832a9ea : CACHE - CDC
> Maintenance entities (RUNNING)
>
> 10.09.2025 09:37:30 : 3ff775f7bda5b83d26b43ecfaddbf030 : CACHE - CDC
> Schedule (RUNNING)
>
> 10.09.2025 09:38:02 : cc7c25a0ed3c1dbf4ea28c9b31ae328d : CACHE - CDC
> Schedule (RUNNING)
>
> 10.09.2025 09:38:42 : 7ea9c0ff50499003e669e27d9d15767f : CACHE - CDC Work
> Order entities (RUNNING)
>
> 10.09.2025 09:39:07 : 119ef8358eebfbe7ff0fa7223ab2e22d : CACHE - CDC Work
> Order entities (RUNNING)
>
> 10.09.2025 09:39:17 : c00161bf832557f7afac66cd8d32b3ac : *CACHE - CDC
> Work Order entities (RUNNING)*
>
> --------------------------------------------------------------
>
> No scheduled jobs.
>
> So, the job with name bolded IS the correct one (K8s resource and the
> Flink job match) other two jobs are attached to wrong K8s resources. So,
> for instance FlinkSessionJob/cache-cdc-schedule is referencing correct
> parameters in “spec” section, but its “status” section is showing another
> job entirely. There is some sort of mixup.
>
> What can we do to at least mitigate this? It seems that creating K8s
> resources one-by-one is OK.
> Can we set an operator config option “
> kubernetes.operator.reconcile.parallelism” to 1 to force one-by-one
> launching?
> Nikola.
>
>

Re: Flink Kubernetes Operator problems

Reply via email to