Hi Devs,


Just wanted to share a production scenario that our team is trying to solve
here in our organization.



Just a little background, we have 20+ Airflow(v1.10.2, with Kubernetes
Executor and MySQL as meta-store) clusters in our organization and have 10k
active workflows and 100k daily runs. Every solution has different nature
of workflow scheduling. Some of them runs them on scheduled basis while
other triggers them on-demand basis. The minimum scheduling interval that
one can configure could be at minute level, so maximum number of runs which
can go for such workflow will be 1440/day and 43200/month, so if such
workflow are not getting deleted for 2-3 months, previous runs details will
be available in meta-store and problem gets amplified with ad-hoc triggers,
where runs can go even larger and that’s where we start hitting the
performance issues on Airflow UI and perhaps on schedule too because of
slowness in results retrieval from the MySQL(there are few bad queries
which gets formulated to use IN clause).



We were thinking to expose a policy at workflow/cluster level which can
restrict number of runs that can be preserved in meta-store for a workflow.
Couple of things which will be required for the same.



   1. *Need to define what’s the definition of older runs*– Can it be time
   bound (by exposing variable like maxOlderRunsInDays) or restricting at
   number of runs(by exposing variable like keepMaxTotalRuns) after which new
   dag_run creation will archive 1 older run in asc order or could it be
   function of both? I guess most important would be to control at number of
   runs level but other could be sensible too depending upon use case.
   2. *Archive older runs data*, may be in same meta-store with archival
   tables having same schema as models but with flexibility in constraints.



For time bound archival, let’s say 30 days history is the policy then we
would be requiring a process within Airflow or outside it (may be a DAG
with periodical runs which can archive older runs) and if we go for
restriction at number of runs level, then perhaps it will be easier to make
provision in airflow code to handle during dag_run creation block.



Though, we haven’t done a formal benchmarking for the slowness observed but
we plan to do that as it will help in knowing the limit we want to apply to
the system.



Would be happy to hear back from community about how they feel about the
problem?


Regards,

Vardan Gupta

Reply via email to