> For example, say we want to support retention in some Airflow tables such
> as task_instance, dag_run and log, it seems reasonable to me to create a
> DAG to periodically clean up the tables

I guess you mean something like 
https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/db-cleanup
 but just shipped with Airflow?

The reason I don't think this should be best as a DAG in Airflow is that we can 
do it better/cleaner if it is core to Airflow:

- We don't need to speculatively run a DAG that does nothing
- We don't need to "waste" an executor slot
- It could automatically be done before/after running another task in the dag
- We don't create extra task instance rows/logs that we than have to clean up 
too.

That is my thinking of why I don't think this sort of built-in functionality 
should be a DAG if it is shipped _with_ Airflow.

-ash


> On 31 Mar 2019, at 20:42, Chao-Han Tsai <milton0...@gmail.com> wrote:
> 
> Thanks Ash and Kevin for the feedback.
> 
> I think there are some utilities that can be solved easily with a DAG
> without introducing more logic to complicate the scheduler code. Also,
> these utilities may run periodically and can be abstracted out with a DAG.
> For example, say we want to support retention in some Airflow tables such
> as task_instance, dag_run and log, it seems reasonable to me to create a
> DAG to periodically clean up the tables.
> 
> Would like to learn more about the concerns about introducing these utility
> DAGs.
> 
> On Sun, Mar 31, 2019 at 1:17 AM Kevin Yang <yrql...@gmail.com> wrote:
> 
>> Agree on having core airflow related stuff built into airflow( like
>> schedule delay instrumentation) and leave the others to cluster maintainer
>> to set up( like log retention). How people handle log retention might be
>> quite different depends on the logging backend. E.g. we use ElasticSearch
>> and we don't even manage the log retention ourselves. Same for stuff like
>> metrics/ alert submitting.
>> 
>> Just my $0.02
>> 
>> Cheers,
>> Kevin Y
>> 
>> On Sun, Mar 31, 2019 at 12:48 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>> 
>>> Do these need to me dags of they are built in to Airflow, or could/should
>>> they be just handled internally by the scheduler?
>>> 
>>> -a
>>> 
>>> On 31 March 2019 03:57:08 BST, Chao-Han Tsai <milton0...@gmail.com>
>> wrote:
>>>> Hi all,
>>>> 
>>>> I have been thinking about adding some DAGs that are for the purpose of
>>>> AIrflow cluster operation, DAG schedule delay instrumentation and log
>>>> retention for instance. Currently we have example_dags, should we add
>>>> another directory utility_dags in the repo? We can have a flag in
>>>> airflow.cfg to let user decide whether to load the utility_dags (just
>>>> like
>>>> what we did for example_dags).
>>>> 
>>>> What do you think?
>>>> 
>>>> --
>>>> Chao-Han Tsai
>>> 
>> 
> 
> 
> -- 
> 
> Chao-Han Tsai

Reply via email to