Hey Vardan, I also run a system with a large number of DAGs.
Regarding the slowness in the UI there are a few fixes that went into 1.10.7 which reduced the number of DAGs Airflow loads when browsing. There is also a couple more changes going into the next release (I hope!) which will speed it up further. Regarding pruning the metadata theres a repository with some example's in here: https://github.com/teamclairvoyant/airflow-maintenance-dags Be very careful pruning DagRuns as you can end up with DagRuns getting re-scheduled by the scheduler if catchup=True. R On Thu, 23 Jan 2020 at 12:54, Vardan Gupta <[email protected]> wrote: > > Hi Devs, > > > > Just wanted to share a production scenario that our team is trying to solve > here in our organization. > > > > Just a little background, we have 20+ Airflow(v1.10.2, with Kubernetes > Executor and MySQL as meta-store) clusters in our organization and have 10k > active workflows and 100k daily runs. Every solution has different nature > of workflow scheduling. Some of them runs them on scheduled basis while > other triggers them on-demand basis. The minimum scheduling interval that > one can configure could be at minute level, so maximum number of runs which > can go for such workflow will be 1440/day and 43200/month, so if such > workflow are not getting deleted for 2-3 months, previous runs details will > be available in meta-store and problem gets amplified with ad-hoc triggers, > where runs can go even larger and that’s where we start hitting the > performance issues on Airflow UI and perhaps on schedule too because of > slowness in results retrieval from the MySQL(there are few bad queries > which gets formulated to use IN clause). > > > > We were thinking to expose a policy at workflow/cluster level which can > restrict number of runs that can be preserved in meta-store for a workflow. > Couple of things which will be required for the same. > > > > 1. *Need to define what’s the definition of older runs*– Can it be time > bound (by exposing variable like maxOlderRunsInDays) or restricting at > number of runs(by exposing variable like keepMaxTotalRuns) after which new > dag_run creation will archive 1 older run in asc order or could it be > function of both? I guess most important would be to control at number of > runs level but other could be sensible too depending upon use case. > 2. *Archive older runs data*, may be in same meta-store with archival > tables having same schema as models but with flexibility in constraints. > > > > For time bound archival, let’s say 30 days history is the policy then we > would be requiring a process within Airflow or outside it (may be a DAG > with periodical runs which can archive older runs) and if we go for > restriction at number of runs level, then perhaps it will be easier to make > provision in airflow code to handle during dag_run creation block. > > > > Though, we haven’t done a formal benchmarking for the slowness observed but > we plan to do that as it will help in knowing the limit we want to apply to > the system. > > > > Would be happy to hear back from community about how they feel about the > problem? > > > Regards, > > Vardan Gupta
