SameerMesiah97 opened a new pull request, #62302:
URL: https://github.com/apache/airflow/pull/62302

   **Description**
   
   This change adds bounded best-effort cleanup to GKE cluster creation to 
prevent clusters from being orphaned when failures occur after creation has 
been initiated. The cleanup behavior is configurable and enabled by default, 
and is implemented via a helper method called `_attempt_cleanup_with_retry`.
   
   Previously, `create_cluster` could successfully start provisioning, but the 
operator could fail during post-creation steps in non-deferrable mode if the 
execution identity lacked permissions such as `container.operations.get`. In 
those cases, the task failed while the cluster continued provisioning. Now, if 
a `PermissionDenied` error occurs after creation has started, the operator 
attempts deletion with `wait_to_complete=False`, retrying on 
`FailedPrecondition` while an active operation exists. Cleanup is bounded by a 
default 600-second timeout (configurable via `cleanup_timeout_seconds`), after 
which it is abandoned. Cleanup failures are logged and do not mask the original 
exception.
   
   
   **Rationale**
   
   GKE cluster creation is asynchronous. `create_cluster` can successfully 
initiate provisioning while subsequent polling or operation retrieval fails due 
to partially scoped IAM permissions (for example, allowing 
`container.clusters.create` but denying `container.operations.get`). Although 
the GKE API does not allow deletion during active provisioning, best-effort 
cleanup can still be attempted to avoid orphaned infrastructure and unnecessary 
cost.
   
   Deletion may initially fail with `FailedPrecondition` if another operation 
is active, so the operator performs semantic retries within a bounded window 
(`cleanup_timeout_seconds`). If cleanup does not succeed within that window, 
the original exception is preserved. This mirrors the existing pattern used in 
several AWS operators, ensuring consistent behavior across providers.
   
   **Tests**
   
   Added unit tests that verify:
   
   * cluster deletion is attempted when `PermissionDenied` occurs after cluster 
creation has been initiated.
   * cleanup failures do not mask or replace the original exception.
   * deletion is retried when `FailedPrecondition` indicates an active cluster 
operation, within the configured timeout window.
   
   System tests are not practical for this change because the behavior depends 
on specific IAM permission combinations (e.g., allowing 
`container.clusters.create` while denying `container.operations.get`). 
Reproducing this reliably would require tightly controlled external IAM 
configuration to assert the invariant, so the behavior is validated via unit 
tests instead.
   
   **Documentation**
   
   The docstring for `GKECreateClusterOperator` has been updated to document 
both `delete_cluster_on_failure` and `cleanup_timeout_seconds`, including their 
defaults and behavior.
   
   **Backwards Compatibility**
   
   Two new optional flags have been added to `GKECreateClusterOperator`: 
`delete_cluster_on_failure` (defaulting to `True`) and 
`cleanup_timeout_seconds` (defaulting to 600 seconds). These introduce 
configurable best-effort cleanup behavior without changing existing error 
semantics. Existing DAGs continue to otherwise function the same as before, 
with cleanup enabled by default in non-deferrable mode when applicable.
   
   Closes: #62301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to