[PR] RedshiftCreateClusterOperator could leave clusters running after failure [airflow]

via GitHub Sun, 01 Feb 2026 10:51:54 -0800


SameerMesiah97 opened a new pull request, #61333:
URL: https://github.com/apache/airflow/pull/61333


   **Description**
   
   Added best-effort cleanup for Redshift cluster creation to ensure clusters 
are deleted when failures occur after a cluster has been successfully created. 
Cleanup behavior is **guarded by a flag** and is **opted in by default**.
   
   Previously, Redshift cluster creation could succeed via `create_cluster`, 
but the operator could then fail during post-creation steps when 
`wait_for_completion=True` and the IAM role lacking `redshift:DescribeClusters` 
permissions. In these cases, the Airflow task failed while the Redshift cluster 
continued provisioning or remained active in AWS, resulting in leaked 
infrastructure.
   
   Cleanup has now been implemented for `RedshiftCreateClusterOperator`. If 
`WaiterError` is raised **after cluster creation has been initiated**. the 
operator attempts a best-effort deletion of the cluster. Cleanup failures are 
logged but do not mask or replace the original exception.
   
   **Rationale**
   
   Redshift cluster creation can succeed while post-creation steps fail. This 
commonly occurs with partially scoped IAM roles, for example, allowing 
`redshift:CreateCluster` but denying `redshift:DescribeClusters`, which is 
required by the availability waiter.
   
   In these scenarios, the Airflow task fails while the cluster continues 
provisioning or running in AWS, leading to leaked infrastructure and ongoing 
cost. This change ensures that when a cluster has been started by the operator, 
failures during post-creation steps trigger a best-effort cleanup without 
altering error semantics or impacting unrelated resources.
   
   **Tests**
   
   * Added a unit test verifying that cluster deletion is attempted when a 
`WaiterError` occurs during the wait phase after successful cluster creation.
   * Added a unit test ensuring that failures during cleanup do not mask or 
override the original exception raised by the waiter.
   
   **Documentation**
   
   The docstring for `RedshiftCreateClusterOperator` has been updated to 
document the new flag `delete_cluster_on_failure` and its default behavior.
   
   **Backwards Compatibility**
   
   A new flag called `delete_cluster_on_failure` has been added to 
`RedshiftCreateClusterOperator` with a default value of `True`. Best-effort 
cleanup will now be attempted if a post-creation failure (including 
`WaiterError`) occurs after the cluster has been successfully created.
   
   Closes: #61324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] RedshiftCreateClusterOperator could leave clusters running after failure [airflow]

Reply via email to