Hi Kevin, Have you looked into the KubernetesExecutor? We achieve fault tolerance using the kubernetes resourceVersion to ensure that all state is reproducible.
On Wed, Sep 12, 2018 at 1:08 PM Kevin Lam <ke...@fathomhealth.co> wrote: > Hi all, > > We currently run Airflow as a Deployment in a kubernetes cluster. We also > use a variant of KubernetesOperator to run our DAGs. > > We are investigating how to best make Airflow fault-tolerant, in part, due > to investigating the use of preemptible vms [1]. *Has there been much > discussion about about how to deploy Airflow in a fault-tolerant way? Are > there any best practices? Ideally we'd like our kubernetes-hosted Airflow > to support rolling updates for Docker image updates and also recover from > components (worker, scheduler, web) going down temporarily, including when > DAGs are in flight. * > > Any advice, ideas and/or feedback appreciated! > > [1] https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms >