Hello all,

I would like to resume discussion on this topic.

I am also uncertain about using the DAG as a solution in this case. There
are a couple of issues regarding the DAG if you are also using
CeleryExecutor:

   - there is no dynamic way of telling how many nodes are used on given
   instance (not a one I am aware of at least), so the number of log cleaning
   task has to be specified explicitly
   - without additionally specifying the queue for every log cleaning task
   (unique for every node), there is no warranty that every task will be
   executed on different node. So you would have to modify your setup of
   airflow-workers, possibly also choosing one responsible for log cleaning,
   if you use multiple workers on one node
   - if there is no airflow-worker running on the node were
   airflow-scheduler is running on, then logs would not be cleaned on that
   node at all- this is also the reason why it should not be a part of either
   airflow worker or scheduler, as these can run on separate nodes


Personally, I think the best solution is to create a new command line
sub-command responsible for log and/or database cleaning. Users could then
come up with their own mechanism on how to run it (trigger it when disk or
storage percentage reaches high value or simply periodically using cron) on
every node they use.

Best regards,
Kamil Olszewski

On 2019/04/01 09:54:15, Ash Berlin-Taylor <a...@apache.org> wrote:
> > For example, say we want to support retention in some Airflow tables
such>
> > as task_instance, dag_run and log, it seems reasonable to me to create
a>
> > DAG to periodically clean up the tables>
>
> I guess you mean something like
https://github.com/teamclairvoyant/airflow-maintenance-dags/tree/master/db-cleanup
but just shipped with Airflow?>
>
> The reason I don't think this should be best as a DAG in Airflow is that
we can do it better/cleaner if it is core to Airflow:>
>
> - We don't need to speculatively run a DAG that does nothing>
> - We don't need to "waste" an executor slot>
> - It could automatically be done before/after running another task in the
dag>
> - We don't create extra task instance rows/logs that we than have to
clean up too.>
>
> That is my thinking of why I don't think this sort of built-in
functionality should be a DAG if it is shipped _with_ Airflow.>
>
> -ash>
>
>
> > On 31 Mar 2019, at 20:42, Chao-Han Tsai <mi...@gmail.com> wrote:>
> > >
> > Thanks Ash and Kevin for the feedback.>
> > >
> > I think there are some utilities that can be solved easily with a DAG>
> > without introducing more logic to complicate the scheduler code. Also,>
> > these utilities may run periodically and can be abstracted out with a
DAG.>
> > For example, say we want to support retention in some Airflow tables
such>
> > as task_instance, dag_run and log, it seems reasonable to me to create
a>
> > DAG to periodically clean up the tables.>
> > >
> > Would like to learn more about the concerns about introducing these
utility>
> > DAGs.>
> > >
> > On Sun, Mar 31, 2019 at 1:17 AM Kevin Yang <yr...@gmail.com> wrote:>
> > >
> >> Agree on having core airflow related stuff built into airflow( like>
> >> schedule delay instrumentation) and leave the others to cluster
maintainer>
> >> to set up( like log retention). How people handle log retention might
be>
> >> quite different depends on the logging backend. E.g. we use
ElasticSearch>
> >> and we don't even manage the log retention ourselves. Same for stuff
like>
> >> metrics/ alert submitting.>
> >> >
> >> Just my $0.02>
> >> >
> >> Cheers,>
> >> Kevin Y>
> >> >
> >> On Sun, Mar 31, 2019 at 12:48 AM Ash Berlin-Taylor <as...@apache.org>
wrote:>
> >> >
> >>> Do these need to me dags of they are built in to Airflow, or
could/should>
> >>> they be just handled internally by the scheduler?>
> >>> >
> >>> -a>
> >>> >
> >>> On 31 March 2019 03:57:08 BST, Chao-Han Tsai <mi...@gmail.com>>
> >> wrote:>
> >>>> Hi all,>
> >>>> >
> >>>> I have been thinking about adding some DAGs that are for the purpose
of>
> >>>> AIrflow cluster operation, DAG schedule delay instrumentation and
log>
> >>>> retention for instance. Currently we have example_dags, should we
add>
> >>>> another directory utility_dags in the repo? We can have a flag in>
> >>>> airflow.cfg to let user decide whether to load the utility_dags
(just>
> >>>> like>
> >>>> what we did for example_dags).>
> >>>> >
> >>>> What do you think?>
> >>>> >
> >>>> -->
> >>>> Chao-Han Tsai>
> >>> >
> >> >
> > >
> > >
> > -- >
> > >
> > Chao-Han Tsai>
>
>

Reply via email to