Joel Croteau created AIRFLOW-5281: ------------------------------------- Summary: GCP transfer operators do not detect previous successful runs if interrupted Key: AIRFLOW-5281 URL: https://issues.apache.org/jira/browse/AIRFLOW-5281 Project: Apache Airflow Issue Type: Bug Components: contrib, gcp, operators Affects Versions: 1.10.3 Reporter: Joel Croteau
Operators that rely on GCS/BigQuery transfer service basically work by creating a transfer job, then periodically polling the transfer service for the status of their transfer job, and reporting success or failure upon job completion. This can cause problems if a task instance is terminated by the cluster, e.g. for lack of resources, as retries will try to create a new transfer job, and if the previous job was successful, the new job will usually fail because there will already be files at its destination, and this will cause the task overall to fail, despite having actually succeeded in transferring what it needs to transfer. I have noticed this in particular using `{{}}{{S3ToGoogleCloudStorageTransferOperator}}` and `{{GoogleCloudStorageToBigQueryOperator}}`, but I imagine that it exists in other operators as well. I know that the documentation for at least some of these operators includes a big warning about how they are not idempotent, and that multiple runs will create multiple transfer jobs, but that doesn't actually help with fixing it. What they should do is set a variable or XCom upon creation of the job, and retries should check for existing jobs before starting a new one. I've also noticed this same thing using `PostgresOperator` to execute an `UNLOAD` query on Redshift (I didn't use `RedshiftToS3Transfer` because that operator has zero customization options, and I wanted to actually control what columns I exported). -- This message was sent by Atlassian Jira (v8.3.2#803003)