Flink autoscaler with AWS ASG: checkpoint access issue

2024-05-13 Thread Chetas Joshi
Hello,

Set up

I am running my Flink streaming jobs (upgradeMode = stateless) on an AWS
EKS cluster. The node-type for the pods of the streaming jobs belongs to a
node-group that has an AWS ASG (auto scaling group).
The streaming jobs are the FlinkDeployments managed by the
flink-k8s-operator (1.8) and I have enabled the job autoscaler.

Scenario

When the flink auto-scaler scales up a flink streaming job, new flink TMs
are first added onto any existing nodes with available resources. If
resources are not enough to schedule all the TM pods,  ASG adds new nodes
to the EKS cluster and the rest of the TM pods are scheduled on these new
nodes.

Issue

After the scale-up, the TM pods scheduled on the existing nodes with
available resources successfully read the checkpoint from S3 however the TM
pods scheduled on the new nodes added by ASG run into 403 (access denied)
while reading the same checkpoint file from the checkpoint location in S3.

Just FYI: I have disabled memory auto-tuning so the auto-scaling events are
in place.

1. The IAM role associated with the service account being used by the
FlinkDeployment is as expected for the new pods.
2. I am able to reproduce this issue every single time there is a scale-up
that requires ASG to add new nodes to the cluster.
3. If I delete the FlinkDeployment and allow the operator to restart it, it
starts and stops throwing 403.
4. I am also observing some 404 (not found) being reported by certain newly
added TM pods. They are looking for an older checkpoint (for example
looking for chk10 while a chk11 has already been created in S3 and chk10
would have gotten subsumed by chk11)

I would appreciate it if there are any pointers on how to debug this
further.

Let me know if you need more information.

Thank you
Chetas


Re: Flink autoscaler with AWS ASG: checkpoint access issue

2024-05-20 Thread Chetas Joshi
Hello,

After digging into the 403 issue a bit, I figured out that after the
scale-up event, the flink-s3-fs-presto uses the node-profile instead of
IRSA (Iam Role for Service Account) on some of the newly created TM pods.

1. Anyone else experienced this as well?
2. Verified that this is an issue with the flink-s3-fs-presto plugin as if
I switch to the hadoop plugin, I don't run into 403 errors after the
scale-up events.
3. What is the reason why the presto plugin is recommended over the hadoop
plugin while working with the checkpoint files in S3?

Thank you
Chetas

On Mon, May 13, 2024 at 6:59 PM Chetas Joshi  wrote:

> Hello,
>
> Set up
>
> I am running my Flink streaming jobs (upgradeMode = stateless) on an AWS
> EKS cluster. The node-type for the pods of the streaming jobs belongs to a
> node-group that has an AWS ASG (auto scaling group).
> The streaming jobs are the FlinkDeployments managed by the
> flink-k8s-operator (1.8) and I have enabled the job autoscaler.
>
> Scenario
>
> When the flink auto-scaler scales up a flink streaming job, new flink TMs
> are first added onto any existing nodes with available resources. If
> resources are not enough to schedule all the TM pods,  ASG adds new nodes
> to the EKS cluster and the rest of the TM pods are scheduled on these new
> nodes.
>
> Issue
>
> After the scale-up, the TM pods scheduled on the existing nodes with
> available resources successfully read the checkpoint from S3 however the TM
> pods scheduled on the new nodes added by ASG run into 403 (access denied)
> while reading the same checkpoint file from the checkpoint location in S3.
>
> Just FYI: I have disabled memory auto-tuning so the auto-scaling events
> are in place.
>
> 1. The IAM role associated with the service account being used by the
> FlinkDeployment is as expected for the new pods.
> 2. I am able to reproduce this issue every single time there is a scale-up
> that requires ASG to add new nodes to the cluster.
> 3. If I delete the FlinkDeployment and allow the operator to restart it,
> it starts and stops throwing 403.
> 4. I am also observing some 404 (not found) being reported by certain
> newly added TM pods. They are looking for an older checkpoint (for example
> looking for chk10 while a chk11 has already been created in S3 and chk10
> would have gotten subsumed by chk11)
>
> I would appreciate it if there are any pointers on how to debug this
> further.
>
> Let me know if you need more information.
>
> Thank you
> Chetas
>
>