Hello,

Set up

I am running my Flink streaming jobs (upgradeMode = stateless) on an AWS
EKS cluster. The node-type for the pods of the streaming jobs belongs to a
node-group that has an AWS ASG (auto scaling group).
The streaming jobs are the FlinkDeployments managed by the
flink-k8s-operator (1.8) and I have enabled the job autoscaler.

Scenario

When the flink auto-scaler scales up a flink streaming job, new flink TMs
are first added onto any existing nodes with available resources. If
resources are not enough to schedule all the TM pods,  ASG adds new nodes
to the EKS cluster and the rest of the TM pods are scheduled on these new
nodes.

Issue

After the scale-up, the TM pods scheduled on the existing nodes with
available resources successfully read the checkpoint from S3 however the TM
pods scheduled on the new nodes added by ASG run into 403 (access denied)
while reading the same checkpoint file from the checkpoint location in S3.

Just FYI: I have disabled memory auto-tuning so the auto-scaling events are
in place.

1. The IAM role associated with the service account being used by the
FlinkDeployment is as expected for the new pods.
2. I am able to reproduce this issue every single time there is a scale-up
that requires ASG to add new nodes to the cluster.
3. If I delete the FlinkDeployment and allow the operator to restart it, it
starts and stops throwing 403.
4. I am also observing some 404 (not found) being reported by certain newly
added TM pods. They are looking for an older checkpoint (for example
looking for chk10 while a chk11 has already been created in S3 and chk10
would have gotten subsumed by chk11)

I would appreciate it if there are any pointers on how to debug this
further.

Let me know if you need more information.

Thank you
Chetas

Reply via email to