Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Grant Nicholas Tue, 19 Dec 2017 07:01:02 -0800

When you see the task being launched again on startup of the executor, does
the second launched task get killed when it tries to startup or does it
actually run the `execute` task instance method (and try to launch a pod on
kubernetes).


I ask because the latter should not happen if postgres is healthy(due to
row level locking in the ti table).

The kubernetes executor work is trying to iron out crash safety issues, as
it's super important to be able to tolerate failures when running the
scheduler in a container orchestrator. So surfacing these edge cases would
be super useful.

On Dec 17, 2017 2:38 PM, "Christopher Bockman" <ch...@fathomhealth.co>
wrote:

> Hmm, perhaps we've just had a couple of bad/unlucky runs but, in general,
> the underlying task-kill process doesn't really seem to work, from what
> we've seen.  I would guess this is related to
> https://issues.apache.org/jira/browse/AIRFLOW-1623.
>
>
>
> On Sun, Dec 17, 2017 at 12:22 PM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
>
> > Shorter heartbeats, you might still have some tasks being scheduled
> > nevertheless due to the time window. However, if the tasks detects it is
> > running somewhere else, it should also terminate itself.
> >
> > [scheduler]
> > # Task instances listen for external kill signal (when you clear tasks
> > # from the CLI or the UI), this defines the frequency at which they
> should
> > # listen (in seconds).
> > job_heartbeat_sec = 5
> >
> > Bolke.
> >
> >
> > > On 17 Dec 2017, at 20:59, Christopher Bockman <ch...@fathomhealth.co>
> > wrote:
> > >
> > >> P.S. I am assuming that you are talking about your scheduler going
> down,
> > > not workers
> > >
> > > Correct (and, in some unfortunate scenarios, everything else...)
> > >
> > >> Normally a task will detect (on the heartbeat interval) whether its
> > state
> > > was changed externally and will terminate itself.
> > >
> > > Hmm, that would be an acceptable solution, but this doesn't
> > (automatically,
> > > in our current configuration) occur.  How can we encourage this
> behavior
> > to
> > > happen?
> > >
> > >
> > > On Sun, Dec 17, 2017 at 11:47 AM, Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >
> > >> Quite important to know is, is that Airflow’s executors do not keep
> > state
> > >> after a restart. This particularly affects distributed executors
> > (celery,
> > >> dask) as the workers are independent from the scheduler. Thus at
> > restart we
> > >> reset all the tasks in the queued state that the executor does not
> know
> > >> about, which means all of them at the moment. Due to the distributed
> > nature
> > >> of the executors, tasks can still be running. Normally a task will
> > detect
> > >> (on the heartbeat interval) whether its state was changed externally
> and
> > >> will terminate itself.
> > >>
> > >> I have done some work some months ago to make the executor keep state
> > over
> > >> restarts, but never got around to finish it.
> > >>
> > >> So at the moment, to prevent requeuing, you need to make the airflow
> > >> scheduler no go down (as much).
> > >>
> > >> Bolke.
> > >>
> > >> P.S. I am assuming that you are talking about your scheduler going
> down,
> > >> not workers
> > >>
> > >>> On 17 Dec 2017, at 20:07, Christopher Bockman <ch...@fathomhealth.co
> >
> > >> wrote:
> > >>>
> > >>> Upon further internal discussion, we might be seeing the task cloning
> > >>> because the postgres DB is getting into a corrupted state...but
> > unclear.
> > >>> If consensus is we *shouldn't* be seeing this behavior, even as-is,
> > we'll
> > >>> push more on that angle.
> > >>>
> > >>> On Sun, Dec 17, 2017 at 10:45 AM, Christopher Bockman <
> > >> ch...@fathomhealth.co
> > >>>> wrote:
> > >>>
> > >>>> Hi all,
> > >>>>
> > >>>> We run DAGs, and sometimes Airflow crashes (for whatever
> reason--maybe
> > >>>> something as simple as the underlying infrastructure going down).
> > >>>>
> > >>>> Currently, we run everything on Kubernetes (including Airflow), so
> the
> > >>>> Airflow pods crashes generally will be detected, and then they will
> > >> restart.
> > >>>>
> > >>>> However, if we have, e.g., a DAG that is running task X when it
> > crashes,
> > >>>> when Airflow comes back up, it apparently sees task X didn't
> complete,
> > >> so
> > >>>> it restarts the task (which, in this case, means it spins up an
> > entirely
> > >>>> new instance/pod).  Thus, both run "X_1" and "X_2" are fired off
> > >>>> simultaneously.
> > >>>>
> > >>>> Is there any (out of the box) way to better connect up state between
> > >> tasks
> > >>>> and Airflow to prevent this?
> > >>>>
> > >>>> (For additional context, we currently execute Kubernetes jobs via a
> > >> custom
> > >>>> operator that basically layers on top of BashOperator...perhaps the
> > new
> > >>>> Kubernetes operator will help address this?)
> > >>>>
> > >>>> Thank you in advance for any thoughts,
> > >>>>
> > >>>> Chris
> > >>>>
> > >>
> > >>
> >
> >
>

Re: how to have good DAG+Kubernetes behavior on airflow crash/recovery?

Reply via email to