Friendly ping :). Do you think you could elaborate on the fault tolerance a bit, Daniel? Thanks for your help!
On Wed, Sep 12, 2018 at 5:35 PM Kevin Lam <ke...@fathomhealth.co> wrote: > Hi Daniel, > > Thanks for the reply! > > No we haven't looked too deeply into it. Can you elaborate a bit on how > that works? With the KubernetesExecutor, if a DAG is in flight and part of > airflow go down, it will be able to recover? How do airflow workers > reconnect to Pods that were in flight? > > On Wed, Sep 12, 2018 at 4:59 PM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> Hi Kevin, >> >> Have you looked into the KubernetesExecutor? We achieve fault tolerance >> using the kubernetes resourceVersion to ensure that all state is >> reproducible. >> >> On Wed, Sep 12, 2018 at 1:08 PM Kevin Lam <ke...@fathomhealth.co> wrote: >> >> > Hi all, >> > >> > We currently run Airflow as a Deployment in a kubernetes cluster. We >> also >> > use a variant of KubernetesOperator to run our DAGs. >> > >> > We are investigating how to best make Airflow fault-tolerant, in part, >> due >> > to investigating the use of preemptible vms [1]. *Has there been much >> > discussion about about how to deploy Airflow in a fault-tolerant way? >> Are >> > there any best practices? Ideally we'd like our kubernetes-hosted >> Airflow >> > to support rolling updates for Docker image updates and also recover >> from >> > components (worker, scheduler, web) going down temporarily, including >> when >> > DAGs are in flight. * >> > >> > Any advice, ideas and/or feedback appreciated! >> > >> > [1] >> https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms >> > >> >