Using a custom plugin for backfills, is it a good practice?

Bartosz Konieczny Wed, 03 Mar 2021 11:03:09 -0800

Hi everyone,

I have a doubt/question regarding backfills and hope you can help me to
shed some light on this.


For the last few years I was always reprocessing the DAGs with the built-in
"Clear task" feature, and never complained. Until recently, when I was
involved in a project having its own "Backfill plugin". In the UI, we had
to select the DAG name, the tasks to backfill (RegEx) and the time range.
Under-the-hood, it called the *`airflow trigger_dag` *command with all the
specified parameters.

Apparently, the plugin was there to help execute the backfill faster and to
not interfere with the "normal" run of the DAG, whose new runs were still
scheduled on time. It led me here, with some confusion and hope to learn
something new about this practice.

The questions are:
1. Using a custom backfill mechanism, is it a popular and good practice? Do
you use a similar solution, or maybe you prefer to connect to the worker to
run *airflow backfill*, instead of the "Clear task" feature? If yes, why?
2. Unfortunately, I can't follow all technical discussions in the group, so
sorry if you already talked about my question. Also, when I was looking for
some explanation, I only found this JIRA
https://issues.apache.org/jira/browse/AIRFLOW-4913, so maybe it's something
new. Anyway, the question is:
If for whatever reason,  *trigger_dag*, *backfill* or any other custom
solution based on the Airflow's API, has/can have better performances than
the built-in "Clear task" mechanism, do you think there is a way to
redesign the "Clear task" feature and use this or a part of this more
optimized method (CLI: trigger_dag, backfill, ...)?

Thank you very much for your help!

Best,
Bartosz.
-- 
Bartosz Konieczny
data engineer
https://www.waitingforcode.com
https://github.com/bartosz25/
https://twitter.com/waitingforcode

Using a custom plugin for backfills, is it a good practice?

Reply via email to