To add to Siddharth's pretty extensive list (in particular, the "delete a
DAG from the code that makes up the dag bag folder, but now it shows up
with a ! icon and you have to manually set it to is_active = f" issue that
I didn't see in 1.8.0-RC4 but started seeing in 1.8.0-RC5 that became
1.8.0) -

how does XCOM data get cleaned up? would be nice to either let tasks
consume the data (and then it goes away from the backing db, after an ack
or something) - or at the very least, TTL after a set interval.



On Wed, Apr 5, 2017 at 7:46 PM, siddharth anand <san...@apache.org> wrote:

> Edgardo,
> This is a great question and something that requires functionality to
> address. As Airflow starts getting used for bigger workloads, we need a way
> to clean up defunct resources.
>
>    - How do we delete a dag and its related resources?
>       - Until the recent release, the way that I stopped having a defunct
>       (retired) dag show up in the UI was to move the DAG file out of the
>       dag_folder or just deleting it from Git. Our dag folders are
> just symlinks
>       to tagged Git repos.
>       - This no longer works -- the UI will display the dag list based on
>       entries in the dag table in the airflow metadata db -- but will no
> longer
>       have code to back that dag table entry. I currently manually delete
> a row
>       from the dag table, but that is surely not the right thing to do.
>       - How do we retire entries from the *task_instance, job, log,  xcom,
>       sla_miss, dag_stats, *and *dag_run* tables for dags that are deleted?
>       (I can surely clean these up manually as well, but we need a UI
>       control).
>          -  *task_instance, job, log, &* *dag_run *tables grow faster than
>          the others
>          - How does one track if variables, connections, or pools are no
>       longer referenced because all of the DAGs that use them are gone?
>          - It would be nice here to have reference counts & links to DAGs
>          that reference a Pool, Connection, or Variable. The reference
> counts can be
>          broken down into paused & unpaused.
>
> It's time we added some functionality to the API/CLI/UI to address these
> functionality gaps.
>
> -s
>
> On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <edgardo.v...@gmail.com>
> wrote:
>
> > Max,
> >
> > Thanks for the reply, it is much appreciated.  I am currently running
> ~10k
> > task a day in our test environment.
> >
> > It is good to know where the archive point is and that I shouldn't have
> any
> > issues for a long time.
> >
> > I was just thinking ahead as we get airflow into production environment.
> > Maybe in this case maybe way too far ahead.
> >
> >
> > Cheers,
> >
> > Edgardo
> >
> > On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > We run ~50k tasks a day at Airbnb. How many tasks/day are you planning
> on
> > > running?
> > >
> > > Though you can archive the `task_instance` and `job` table down the
> line,
> > > but that shouldn't be a concern until you hit tens of millions of
> > entries.
> > > Then you can setup a daily Airflow job that archives some of these
> > entries.
> > > I believe we do it based on `start_date` and move rows to some other
> > table
> > > in the same db.
> > >
> > > Max
> > >
> > > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <edgardo.v...@gmail.com>
> > > wrote:
> > >
> > > > I have been playing with airflow for a few days and it's not obvious
> > what
> > > > will happen down the road when we have lots of dags over a long
> period
> > of
> > > > time. I set a fake dag to run once a minute for a few days and
> > everything
> > > > seems okay except the graph view dropdown which works but take a few
> > > > seconds to show up.
> > > >
> > > > Is there a way roll older data out of the system in order to clean
> > things
> > > > visually and keep the database at a smallish size?
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > Edgardo
> > > >
> > >
> >
> >
> >
> > --
> > Cheers,
> >
> > Edgardo
> >
>

Reply via email to