SameerMesiah97 opened a new issue, #62301:
URL: https://github.com/apache/airflow/issues/62301

   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   `apache-airflow-providers-google>=20.0.0rc1`
   
   ### Apache Airflow version
   
   main
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When using `GKECreateClusterOperator`, a GKE cluster may be successfully 
created even when the GCP service account has partial GKE permissions, for 
example lacking `container.operations.get`.
   
   In this scenario, the operator successfully calls `create_cluster` and the 
GKE cluster begins provisioning in GCP. However, subsequent steps—such as 
polling the operation in non-deferrable mode—fail due to insufficient 
permissions.
   
   The Airflow task then fails, but the GKE cluster continues provisioning or 
remains active in GCP, resulting in leaked infrastructure and ongoing cost.
   
   This can occur, for example, when the service account allows 
`container.clusters.create` but explicitly denies `container.operations.get`, 
which is required to monitor the long-running operation.
   
   ### What you think should happen instead
   
   If the operator fails after successfully initiating cluster creation (for 
example due to missing `container.operations.get` or other follow-up 
permissions), it should make a best-effort attempt to clean up the partially 
created resource by deleting the cluster.
   
   Cleanup should be attempted opportunistically (i.e. only if the cluster name 
is known and deletion permissions are available), and failure to clean up 
should not mask or replace the original exception.
   
   ### How to reproduce
   
   1. Create a custom IAM role that allows `container.clusters.create` and 
denies/omits `container.operations.get`
   
   2. Create a service account and attach this custom role.
   
   3. Create a GCP connection in Airflow using this service account.
   (For example: `gcp_cloud_default`.)
   
   4. Use the following DAG:
   (Please replace `<PROJECT_ID>` and `<REGION>` 
   with your GCP project ID and a valid region, respectively.) 
   
   ```python
   from datetime import datetime
   
   from airflow import DAG
   from airflow.providers.google.cloud.operators.kubernetes_engine import 
GKECreateClusterOperator
   
   with DAG(
       dag_id="gke_partial_auth_cluster_leak_repro",
       start_date=datetime(2025, 1, 1),
       schedule=None,
       catchup=False,
   ) as dag:
   
       create_cluster = GKECreateClusterOperator(
           task_id="create_gke_cluster",
           project_id= <PROJECT_ID>,
           location=<REGION>,
           body={
               "name": "leaky-gke-cluster",
               "initial_node_count": 1,
           },
           gcp_conn_id="gcp_cloud_default",
           deferrable=False,  # triggers polling via operations.get
       )
   ```
   
   5. Trigger the DAG.
   
   **Observed Behaviour**
   
   The task fails with:
   
   `PermissionDenied: Required "container.operations.get" permission(s)`
   
    However, the GKE cluster continues to provision in the background.
   
   ### Anything else
   
   GKE clusters begin provisioning immediately once creation is initiated. Even 
if the Airflow task fails shortly after, the cluster may continue creating and 
eventually become active.
   
   When failures occur after a successful create call (for example, due to 
partially scoped IAM permissions), leaked clusters can result in unnecessary 
cost and manual cleanup effort. This pattern is not novel in Airflow. Similar 
behaviour has been accepted in AWS resource-creation operators, for example 
with Amazon Redshift cluster creation (see PR #61333), where infrastructure can 
be created successfully but leak if subsequent steps fail. Aligning the GKE 
operator with a best-effort cleanup approach would therefore not introduce a 
new behavioural precedent. It would bring it in line with existing provider 
patterns.
   
   **Relying solely on teardown tasks is not sufficient, as that shifts 
responsibility for preventing resource leaks onto DAG authors**. Operators that 
create infrastructure should make reasonable best-effort attempts to clean up 
resources they successfully create, even if later steps fail.
   
   While the GKE API does not always accept deletion requests during 
`PROVISIONING`, that limitation does not preclude best-effort cleanup logic 
(e.g. retrying deletion or attempting deletion once the cluster becomes 
deletable).
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to