Hello,
I am troubleshooting a job that started failing in one of our environments.
We are using Flink 2.1.1 with a native Kubernetes deployment using the
operator. The job was running fine until we did a controlled reboot of the
Kubernetes cluster, which rescheduled all pods as the nodes got drained.
Afterwards, the job kept crashing with:
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException:
Partition
363904b4cdd4faecca1f0ac2002a40dc#0@9a9f2bc5959928aa147462296a9909ed_ca1f0ac1d6408ae5363904b4a3eb44f5_0_87
not found.
at org.apache.flink.runtime.io.network.partition.ResultPartitionManager
.createSubpartitionView(ResultPartitionManager.java:121)
at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel
.requestSubpartitions(LocalInputChannel.java:144)
at org.apache.flink.runtime.io.network.partition.consumer.
LocalInputChannel$1.run(LocalInputChannel.java:193)
at java.base/java.util.TimerThread.mainLoop(Unknown Source)
at java.base/java.util.TimerThread.run(Unknown Source)
I eventually noticed that an upstream task was stuck in DEPLOYING - I can
see in the Job Manager logs the task is deployed:
2026-04-20 09:37:10,306 INFO [flink-pekko.actor.default-dispatcher-13]
org.apache.flink.runtime.executiongraph.ExecutionGraph:609 - Deploying
ReduceWindow15Seconds (1/1) (attempt #52) with attempt id
9a9f2bc5959928aa147462296a9909ed_ca1f0ac1d6408ae5363904b4a3eb44f5_0_52 and
vertex id ca1f0ac1d6408ae5363904b4a3eb44f5_0 to mbcs-flink-taskmanager-2-3
@ 10.81.25.192 (dataPort=35861) with allocation id
b8e6f924d0fc4f0a525279d7d7e05f4d
but the Task Manager logs don't show anything related to
ReduceWindow15Seconds.
Strangely, I see the following slot report in the Job Manager:
2026-04-20 10:26:42,679 DEBUG [flink-pekko.actor.default-dispatcher-15]
org.apache.flink.runtime.resourcemanager.slotmanager.DefaultSlotStatusSyncer
:254 - Received slot report from instance 4af61c2d44a4fcba3047be14592e646e:
SlotReport{
SlotStatus{slotID=mbcs-taskmanager-2-3_0, allocationID=null, jobID=null,
resourceProfile=ResourceProfile{cpuCores=2, taskHeapMemory=911.500mb (
955777024 bytes), taskOffHeapMemory=0 bytes, managedMemory=474.500mb (
497549312 bytes), networkMemory=256.000mb (268435456 bytes)}}
SlotStatus{slotID=mbcs-taskmanager-2-3_1,
allocationID=cf15891b1e643c2c6973921c38ebe82e,
jobID=e7ca599d4cf2125165b39095454a5b70, resourceProfile=ResourceProfile
{cpuCores=2, taskHeapMemory=911.500mb (955777024 bytes), taskOffHeapMemory=0
bytes, managedMemory=474.500mb (497549312 bytes), networkMemory=256.000mb (
268435456 bytes)}}}.
but the job has always specified numberOfTaskSlots: 1 in the config
explicitly, as well as parallelism: 1.
The topology of the job is very complex, but the failing parts of the graph
basically have multiple windows that are unioned, forwarded to
a KeyedCoProcessFunction with a control stream in its 2nd input, and then
sent to a Kafka sink.
Does anyone know what could be causing this? Considering the config of the
job didn't change when the problems started, and a checkpoint exists and is
used to try and start the pipeline from its previous state.
Regards,
Alexis.