Hi everyone, I have a doubt/question regarding backfills and hope you can help me to shed some light on this.
For the last few years I was always reprocessing the DAGs with the built-in "Clear task" feature, and never complained. Until recently, when I was involved in a project having its own "Backfill plugin". In the UI, we had to select the DAG name, the tasks to backfill (RegEx) and the time range. Under-the-hood, it called the *`airflow trigger_dag` *command with all the specified parameters. Apparently, the plugin was there to help execute the backfill faster and to not interfere with the "normal" run of the DAG, whose new runs were still scheduled on time. It led me here, with some confusion and hope to learn something new about this practice. The questions are: 1. Using a custom backfill mechanism, is it a popular and good practice? Do you use a similar solution, or maybe you prefer to connect to the worker to run *airflow backfill*, instead of the "Clear task" feature? If yes, why? 2. Unfortunately, I can't follow all technical discussions in the group, so sorry if you already talked about my question. Also, when I was looking for some explanation, I only found this JIRA https://issues.apache.org/jira/browse/AIRFLOW-4913, so maybe it's something new. Anyway, the question is: If for whatever reason, *trigger_dag*, *backfill* or any other custom solution based on the Airflow's API, has/can have better performances than the built-in "Clear task" mechanism, do you think there is a way to redesign the "Clear task" feature and use this or a part of this more optimized method (CLI: trigger_dag, backfill, ...)? Thank you very much for your help! Best, Bartosz. -- Bartosz Konieczny data engineer https://www.waitingforcode.com https://github.com/bartosz25/ https://twitter.com/waitingforcode
