SameerMesiah97 opened a new pull request, #62302: URL: https://github.com/apache/airflow/pull/62302
**Description** This change adds bounded best-effort cleanup to GKE cluster creation to prevent clusters from being orphaned when failures occur after creation has been initiated. The cleanup behavior is configurable and enabled by default, and is implemented via a helper method called `_attempt_cleanup_with_retry`. Previously, `create_cluster` could successfully start provisioning, but the operator could fail during post-creation steps in non-deferrable mode if the execution identity lacked permissions such as `container.operations.get`. In those cases, the task failed while the cluster continued provisioning. Now, if a `PermissionDenied` error occurs after creation has started, the operator attempts deletion with `wait_to_complete=False`, retrying on `FailedPrecondition` while an active operation exists. Cleanup is bounded by a default 600-second timeout (configurable via `cleanup_timeout_seconds`), after which it is abandoned. Cleanup failures are logged and do not mask the original exception. **Rationale** GKE cluster creation is asynchronous. `create_cluster` can successfully initiate provisioning while subsequent polling or operation retrieval fails due to partially scoped IAM permissions (for example, allowing `container.clusters.create` but denying `container.operations.get`). Although the GKE API does not allow deletion during active provisioning, best-effort cleanup can still be attempted to avoid orphaned infrastructure and unnecessary cost. Deletion may initially fail with `FailedPrecondition` if another operation is active, so the operator performs semantic retries within a bounded window (`cleanup_timeout_seconds`). If cleanup does not succeed within that window, the original exception is preserved. This mirrors the existing pattern used in several AWS operators, ensuring consistent behavior across providers. **Tests** Added unit tests that verify: * cluster deletion is attempted when `PermissionDenied` occurs after cluster creation has been initiated. * cleanup failures do not mask or replace the original exception. * deletion is retried when `FailedPrecondition` indicates an active cluster operation, within the configured timeout window. System tests are not practical for this change because the behavior depends on specific IAM permission combinations (e.g., allowing `container.clusters.create` while denying `container.operations.get`). Reproducing this reliably would require tightly controlled external IAM configuration to assert the invariant, so the behavior is validated via unit tests instead. **Documentation** The docstring for `GKECreateClusterOperator` has been updated to document both `delete_cluster_on_failure` and `cleanup_timeout_seconds`, including their defaults and behavior. **Backwards Compatibility** Two new optional flags have been added to `GKECreateClusterOperator`: `delete_cluster_on_failure` (defaulting to `True`) and `cleanup_timeout_seconds` (defaulting to 600 seconds). These introduce configurable best-effort cleanup behavior without changing existing error semantics. Existing DAGs continue to otherwise function the same as before, with cleanup enabled by default in non-deferrable mode when applicable. Closes: #62301 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
