Re: Make Scheduler More Centralized

2017-03-15 Thread Maxime Beauchemin
A few related thoughts about the scheduler.

The scheduler is growing to take on much more than just scheduling, so much
so that "supervisor" would be a better name for it. It includes:
* parsing DAGs (eventually it may serialize their metadata to the database
to help make the web server stateless)
* scheduling (duh.)
* monitoring heartbeats and handling related failures (email,
on_error_callback, ...)
* buffering + prioritization while applying pool constraints
* ?+[new] handling all task failures?

One approach would be to break down this workload and distribute it, to a
point where it really doesn't matter who's doing the "supervising" work. A
worker slot could be assigned to run a single "supervisor cycle" for a
single DAG for instance. Finishing a task run could also trigger an attempt
to schedule the dependent tasks (to allow for very low latency scheduling!)

The challenge with this approach is that the logs are all over the place
(until we figure a good distributed logging solution), and that there's a
need for a buffered window to handle prioritization. In many cases
prioritization isn't used or important, and low-latency is much more
important. I would tend to vote down changes and designs that take use
further away from a low-latency scheduling.

I understand this email doesn't clarify things, but wanted to get this out.

Max

On Wed, Mar 15, 2017 at 2:14 PM, Bolke de Bruin  wrote:

> Hi Rui,
>
> We have been discussing this during the hackathon at Airbnb as well.
> Besides the reservations Gerard is documenting, I am also not enthusiastic
> about this design. Currently, the scheduler is our main issue in scaling.
> Scheduler runs will take longer and longer with more DAGs and more complex
> DAGS (ie. more tasks in a DAG). To move more things into the scheduler
> makes it more difficult to move things out again. This is required when we
> want to move to an event driven / snowballing scheduler.
>
> I would suggest documentation and enforcing the contracts between the
> scheduler - executor - task instance. We are lax in that respect and this
> is where a lot of issue stem from. Also the executor is the weak point here
> as it doesn’t do anything with the task state, but it does handle them. The
> points Gerard makes are very valid and we should improve our assumptions of
> the underlying bus.
>
> Cheers
> Bolke
>
> > On 14 Mar 2017, at 15:08, Rui Wang  wrote:
> >
> > Hi,
> > The design doc below I created is trying to make airflow scheduler more
> > centralized. Briefly speaking, I propose moving state change of
> > TaskInstance to scheduler. You can see the reasons for this change below.
> >
> >
> > Could you take a look and comment if you see anything does not make
> sense?
> >
> > -Rui
> >
> > 
> --
> > Current The state of TaskInstance is changed by both scheduler and
> worker.
> > On worker side, worker monitors TaskInstance and changes the state to
> > RUNNING, SUCCESS, if task succeed, or to UP_FOR_RETRY, FAILED if task
> fail.
> > Worker also does failure email logic and failure callback logic.
> > Proposal The general idea is to make a centralized scheduler and make
> > workers dumb. Worker should not change state of TaskInstance, but just
> > executes what it is assigned and reports the result of the task. Instead,
> > the scheduler should make the decision on TaskInstance state change.
> > Ideally, workers should not even handle the failure emails and callbacks
> > unless the scheduler asks it to do so.
> > Why Worker does not have as much information as scheduler has. There were
> > bugs observed caused by worker when worker gets into trouble but cannot
> > make decision to change task state due to lack of information. Although
> > there is airflow metadata DB, it is still not easy to share all
> information
> > that scheduler has with workers.
> >
> > We can also ensure a consistent environment. There are slight differences
> > in the chef recipes for the different workers which can cause strange
> > issues when DAGs parse on one but not the other.
> >
> > In the meantime, moving state changes to the scheduler can reduce the
> > complexity of airflow. It especially helps when airflow needs to move to
> > distributed schedulers. In that case state change everywhere by both
> > schedulers and workers are harder to maintain.
> > How to change After lots of discussions, following step will be done:
> >
> > 1. Add a new column to TaskInstance table. Worker will fill this column
> > with the task process exit code.
> >
> > 2. Worker will only set TaskInstance state to RUNNING when it is ready to
> > run task. There was debate on moving RUNNING to scheduler as well. If
> > moving RUNNING to scheduler, either scheduler marks TaskInstance RUNNING
> > before it gets into queue, or scheduler checks the status code in column
> > above, which is updated 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Dan Davydov
The only thing is that this is a change in semantics and changing semantics
(breaking some DAGs) and then changing them back (and breaking things
again) isn't great.

On Wed, Mar 15, 2017 at 7:02 PM, Bolke de Bruin  wrote:

> Indeed that could be the case. Let's get 1.8.0 out the door so we can
> focus on these bug fixes for 1.8.1.
>
> Bolke
>
> Sent from my iPhone
>
> > On 15 Mar 2017, at 18:25, Dan Davydov 
> wrote:
> >
> > Another issue we are seeing is
> > https://issues.apache.org/jira/browse/AIRFLOW-992 - tasks that have both
> > skipped children and successful children are run instead of skipped. Not
> > blocking the release on this just letting you guys know for the release
> bug
> > notes. We will be cherrypicking a fix for this onto our production when
> we
> > release 1.8 once we come up with one.
> >
> > It's possibly thought not necessarily related to an incomplete/incorrect
> > fix of https://issues.apache.org/jira/browse/AIRFLOW-719 .
> >
> >> On Wed, Mar 15, 2017 at 4:53 PM, siddharth anand 
> wrote:
> >>
> >> Confirmed that Bolke's PR above fixes the issue.
> >>
> >> Also, I agree this is not a blocker for the current airflow release, so
> my
> >> +1 (binding) stands.
> >> -s
> >>
> >>> On Wed, Mar 15, 2017 at 3:11 PM, Bolke de Bruin 
> wrote:
> >>>
> >>> PR is available: https://github.com/apache/incubator-airflow/pull/2154
> >>>
> >>> But marked for 1.8.1.
> >>>
> >>> - Bolke
> >>>
>  On 15 Mar 2017, at 14:37, Bolke de Bruin  wrote:
> 
>  On second thought I do consider it a bug and can have a fix out pretty
> >>> quickly, but I don’t consider it a blocker.
> 
>  - B.
> 
> > On 15 Mar 2017, at 14:21, Bolke de Bruin  wrote:
> >
> > Just to be clear: Also in 1.7.1 the DagRun was marked successful, but
> >>> its tasks continued to be scheduled. So one could also consider 1.7.1
> >>> behaviour a bug. I am not sure here, but I think it kind of makes sense
> >> to
> >>> consider the behaviour of 1.7.1 a bug. It has been present throughout
> all
> >>> the 1.8 rc/beta/apha series.
> >
> > So yes it is a change in behaviour whether it is a regression or an
> >>> integrity improvement is up for discussion. Either way I don’t consider
> >> it
> >>> a blocker.
> >
> > Bolke.
> >
> >> On 15 Mar 2017, at 14:06, siddharth anand 
> wrote:
> >>
> >> Here's the JIRA :
> >> https://issues.apache.org/jira/browse/AIRFLOW-989
> >>
> >> I confirmed it is a regression from 1.7.1.3, which I installed via
> >> pip
> >>> and
> >> tested against the same DAG in the JIRA.
> >>
> >> The issue occurs if a leaf / last / terminal downstream task is not
> >> cleared. You won't see this issue if you clear the entire DAG Run or
> >>> clear
> >> a task and all of its downstream tasks. If you truly want to only
> >>> clear and
> >> rerun a task, but not its downstream tasks, you can use the CLI to
> >>> execute
> >> a specific task (e.g. vial airflow run).
> >>
> >> This is a change in behavior -- if we do go ahead with the release,
> >>> then
> >> this JIRA should be in a list of JIRAs of known issues related to
> the
> >>> new
> >> version.
> >> -s
> >>
> >> On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini <
> >>> criccom...@apache.org>
> >> wrote:
> >>
> >>> @Sid, does this happen if you clear downstream as well?
> >>>
> >>> On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini <
> >>> criccom...@apache.org>
> >>> wrote:
> >>>
>  Has anyone been able to reproduce Sid's issue?
> 
>  On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin <
> >> bdbr...@gmail.com>
>  wrote:
> 
> > That is not an airflow error, but a Kerberos error. Try executing
> >>> the
> > kinit command on the command line by yourself.
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> >> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov <
> >> dautkha...@gmail.com>
> > wrote:
> >>
> >> `airflow kerberos` is broken in 1.8-rc5
> >> https://issues.apache.org/jira/browse/AIRFLOW-987
> >> Hopefully fix can be part of the 1.8 release.
> >>
> >>
> >>
> >> --
> >> Ruslan Dautkhanov
> >>
> >>> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand <
> >>> san...@apache.org>
> > wrote:
> >>>
> >>> FYI,
> >>> I've just hit a major bug in the release candidate related to
> >>> "clear
> > task"
> >>> behavior.
> >>>
> >>> I've been running airflow in both stage and prod since
> yesterday
> >>> on
> > rc5 and
> >>> have reproduced this in both environments. I will file a JIRA
> >> for
> >>> this
> >>> 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread siddharth anand
Confirmed that Bolke's PR above fixes the issue.

Also, I agree this is not a blocker for the current airflow release, so my
+1 (binding) stands.
-s

On Wed, Mar 15, 2017 at 3:11 PM, Bolke de Bruin  wrote:

> PR is available: https://github.com/apache/incubator-airflow/pull/2154
>
> But marked for 1.8.1.
>
> - Bolke
>
> > On 15 Mar 2017, at 14:37, Bolke de Bruin  wrote:
> >
> > On second thought I do consider it a bug and can have a fix out pretty
> quickly, but I don’t consider it a blocker.
> >
> > - B.
> >
> >> On 15 Mar 2017, at 14:21, Bolke de Bruin  wrote:
> >>
> >> Just to be clear: Also in 1.7.1 the DagRun was marked successful, but
> its tasks continued to be scheduled. So one could also consider 1.7.1
> behaviour a bug. I am not sure here, but I think it kind of makes sense to
> consider the behaviour of 1.7.1 a bug. It has been present throughout all
> the 1.8 rc/beta/apha series.
> >>
> >> So yes it is a change in behaviour whether it is a regression or an
> integrity improvement is up for discussion. Either way I don’t consider it
> a blocker.
> >>
> >> Bolke.
> >>
> >>> On 15 Mar 2017, at 14:06, siddharth anand  wrote:
> >>>
> >>> Here's the JIRA :
> >>> https://issues.apache.org/jira/browse/AIRFLOW-989
> >>>
> >>> I confirmed it is a regression from 1.7.1.3, which I installed via pip
> and
> >>> tested against the same DAG in the JIRA.
> >>>
> >>> The issue occurs if a leaf / last / terminal downstream task is not
> >>> cleared. You won't see this issue if you clear the entire DAG Run or
> clear
> >>> a task and all of its downstream tasks. If you truly want to only
> clear and
> >>> rerun a task, but not its downstream tasks, you can use the CLI to
> execute
> >>> a specific task (e.g. vial airflow run).
> >>>
> >>> This is a change in behavior -- if we do go ahead with the release,
> then
> >>> this JIRA should be in a list of JIRAs of known issues related to the
> new
> >>> version.
> >>> -s
> >>>
> >>> On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini <
> criccom...@apache.org>
> >>> wrote:
> >>>
>  @Sid, does this happen if you clear downstream as well?
> 
>  On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini <
> criccom...@apache.org>
>  wrote:
> 
> > Has anyone been able to reproduce Sid's issue?
> >
> > On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
> > wrote:
> >
> >> That is not an airflow error, but a Kerberos error. Try executing
> the
> >> kinit command on the command line by yourself.
> >>
> >> Bolke
> >>
> >> Sent from my iPhone
> >>
> >>> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
> >> wrote:
> >>>
> >>> `airflow kerberos` is broken in 1.8-rc5
> >>> https://issues.apache.org/jira/browse/AIRFLOW-987
> >>> Hopefully fix can be part of the 1.8 release.
> >>>
> >>>
> >>>
> >>> --
> >>> Ruslan Dautkhanov
> >>>
>  On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand <
> san...@apache.org>
> >> wrote:
> 
>  FYI,
>  I've just hit a major bug in the release candidate related to
> "clear
> >> task"
>  behavior.
> 
>  I've been running airflow in both stage and prod since yesterday
> on
> >> rc5 and
>  have reproduced this in both environments. I will file a JIRA for
>  this
>  tonight, but wanted to send a note over email as well.
> 
>  In my example, I have a 2 task DAG. For a given DAG run that has
> >> completed
>  successfully, if I
>  1) clear task2 (leaf task in this case), the previously-successful
>  DAG
> >> Run
>  goes back to Running, requeues, and executes the task
> successfully.
> >> The DAG
>  Run the returns from Running to Success.
>  2) clear task1 (root task in this case), the previously-successful
>  DAG
> >> Run
>  goes back to Running, DOES NOT requeue or execute the task at all.
>  The
> >> DAG
>  Run the returns from Running to Success though it never ran the
> task.
> 
>  1) is expected and previous behavior. 2) is a regression.
> 
>  The only workaround is to use the CLI to run the task cleared.
> Here
>  are
>  some images :
>  *After Clearing the Tasks*
>  https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
>  202017-03-14%2014.09.34.png?dl=0
> 
>  *After DAG Runs return to Success*
>  https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
>  202017-03-14%2014.09.49.png?dl=0
> 
>  This is a major regression because it will force everyone to use
> the
> >> CLI
>  for things that they would normally use the UI for.
> 
>  -s
> 
> 
>  -s
> 
> 
> > On Tue, 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Bolke de Bruin
PR is available: https://github.com/apache/incubator-airflow/pull/2154

But marked for 1.8.1.

- Bolke

> On 15 Mar 2017, at 14:37, Bolke de Bruin  wrote:
> 
> On second thought I do consider it a bug and can have a fix out pretty 
> quickly, but I don’t consider it a blocker.
> 
> - B.
> 
>> On 15 Mar 2017, at 14:21, Bolke de Bruin  wrote:
>> 
>> Just to be clear: Also in 1.7.1 the DagRun was marked successful, but its 
>> tasks continued to be scheduled. So one could also consider 1.7.1 behaviour 
>> a bug. I am not sure here, but I think it kind of makes sense to consider 
>> the behaviour of 1.7.1 a bug. It has been present throughout all the 1.8 
>> rc/beta/apha series.
>> 
>> So yes it is a change in behaviour whether it is a regression or an 
>> integrity improvement is up for discussion. Either way I don’t consider it a 
>> blocker.
>> 
>> Bolke.
>> 
>>> On 15 Mar 2017, at 14:06, siddharth anand  wrote:
>>> 
>>> Here's the JIRA :
>>> https://issues.apache.org/jira/browse/AIRFLOW-989
>>> 
>>> I confirmed it is a regression from 1.7.1.3, which I installed via pip and
>>> tested against the same DAG in the JIRA.
>>> 
>>> The issue occurs if a leaf / last / terminal downstream task is not
>>> cleared. You won't see this issue if you clear the entire DAG Run or clear
>>> a task and all of its downstream tasks. If you truly want to only clear and
>>> rerun a task, but not its downstream tasks, you can use the CLI to execute
>>> a specific task (e.g. vial airflow run).
>>> 
>>> This is a change in behavior -- if we do go ahead with the release, then
>>> this JIRA should be in a list of JIRAs of known issues related to the new
>>> version.
>>> -s
>>> 
>>> On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini 
>>> wrote:
>>> 
 @Sid, does this happen if you clear downstream as well?
 
 On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini 
 wrote:
 
> Has anyone been able to reproduce Sid's issue?
> 
> On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
> wrote:
> 
>> That is not an airflow error, but a Kerberos error. Try executing the
>> kinit command on the command line by yourself.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
>> wrote:
>>> 
>>> `airflow kerberos` is broken in 1.8-rc5
>>> https://issues.apache.org/jira/browse/AIRFLOW-987
>>> Hopefully fix can be part of the 1.8 release.
>>> 
>>> 
>>> 
>>> --
>>> Ruslan Dautkhanov
>>> 
 On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
>> wrote:
 
 FYI,
 I've just hit a major bug in the release candidate related to "clear
>> task"
 behavior.
 
 I've been running airflow in both stage and prod since yesterday on
>> rc5 and
 have reproduced this in both environments. I will file a JIRA for
 this
 tonight, but wanted to send a note over email as well.
 
 In my example, I have a 2 task DAG. For a given DAG run that has
>> completed
 successfully, if I
 1) clear task2 (leaf task in this case), the previously-successful
 DAG
>> Run
 goes back to Running, requeues, and executes the task successfully.
>> The DAG
 Run the returns from Running to Success.
 2) clear task1 (root task in this case), the previously-successful
 DAG
>> Run
 goes back to Running, DOES NOT requeue or execute the task at all.
 The
>> DAG
 Run the returns from Running to Success though it never ran the task.
 
 1) is expected and previous behavior. 2) is a regression.
 
 The only workaround is to use the CLI to run the task cleared. Here
 are
 some images :
 *After Clearing the Tasks*
 https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
 202017-03-14%2014.09.34.png?dl=0
 
 *After DAG Runs return to Success*
 https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
 202017-03-14%2014.09.49.png?dl=0
 
 This is a major regression because it will force everyone to use the
>> CLI
 for things that they would normally use the UI for.
 
 -s
 
 
 -s
 
 
> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
>> wrote:
> 
> +1 (non-binding)!
> 
> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand <
 san...@apache.org>
> wrote:
> 
>> +1 (binding)
>> 
>> 
>> On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>> 
>>> +1 (binding)
>>> 
>>> On 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Bolke de Bruin
On second thought I do consider it a bug and can have a fix out pretty quickly, 
but I don’t consider it a blocker.

- B.

> On 15 Mar 2017, at 14:21, Bolke de Bruin  wrote:
> 
> Just to be clear: Also in 1.7.1 the DagRun was marked successful, but its 
> tasks continued to be scheduled. So one could also consider 1.7.1 behaviour a 
> bug. I am not sure here, but I think it kind of makes sense to consider the 
> behaviour of 1.7.1 a bug. It has been present throughout all the 1.8 
> rc/beta/apha series.
> 
> So yes it is a change in behaviour whether it is a regression or an integrity 
> improvement is up for discussion. Either way I don’t consider it a blocker.
> 
> Bolke.
> 
>> On 15 Mar 2017, at 14:06, siddharth anand  wrote:
>> 
>> Here's the JIRA :
>> https://issues.apache.org/jira/browse/AIRFLOW-989
>> 
>> I confirmed it is a regression from 1.7.1.3, which I installed via pip and
>> tested against the same DAG in the JIRA.
>> 
>> The issue occurs if a leaf / last / terminal downstream task is not
>> cleared. You won't see this issue if you clear the entire DAG Run or clear
>> a task and all of its downstream tasks. If you truly want to only clear and
>> rerun a task, but not its downstream tasks, you can use the CLI to execute
>> a specific task (e.g. vial airflow run).
>> 
>> This is a change in behavior -- if we do go ahead with the release, then
>> this JIRA should be in a list of JIRAs of known issues related to the new
>> version.
>> -s
>> 
>> On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini 
>> wrote:
>> 
>>> @Sid, does this happen if you clear downstream as well?
>>> 
>>> On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini 
>>> wrote:
>>> 
 Has anyone been able to reproduce Sid's issue?
 
 On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
 wrote:
 
> That is not an airflow error, but a Kerberos error. Try executing the
> kinit command on the command line by yourself.
> 
> Bolke
> 
> Sent from my iPhone
> 
>> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
> wrote:
>> 
>> `airflow kerberos` is broken in 1.8-rc5
>> https://issues.apache.org/jira/browse/AIRFLOW-987
>> Hopefully fix can be part of the 1.8 release.
>> 
>> 
>> 
>> --
>> Ruslan Dautkhanov
>> 
>>> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
> wrote:
>>> 
>>> FYI,
>>> I've just hit a major bug in the release candidate related to "clear
> task"
>>> behavior.
>>> 
>>> I've been running airflow in both stage and prod since yesterday on
> rc5 and
>>> have reproduced this in both environments. I will file a JIRA for
>>> this
>>> tonight, but wanted to send a note over email as well.
>>> 
>>> In my example, I have a 2 task DAG. For a given DAG run that has
> completed
>>> successfully, if I
>>> 1) clear task2 (leaf task in this case), the previously-successful
>>> DAG
> Run
>>> goes back to Running, requeues, and executes the task successfully.
> The DAG
>>> Run the returns from Running to Success.
>>> 2) clear task1 (root task in this case), the previously-successful
>>> DAG
> Run
>>> goes back to Running, DOES NOT requeue or execute the task at all.
>>> The
> DAG
>>> Run the returns from Running to Success though it never ran the task.
>>> 
>>> 1) is expected and previous behavior. 2) is a regression.
>>> 
>>> The only workaround is to use the CLI to run the task cleared. Here
>>> are
>>> some images :
>>> *After Clearing the Tasks*
>>> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
>>> 202017-03-14%2014.09.34.png?dl=0
>>> 
>>> *After DAG Runs return to Success*
>>> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
>>> 202017-03-14%2014.09.49.png?dl=0
>>> 
>>> This is a major regression because it will force everyone to use the
> CLI
>>> for things that they would normally use the UI for.
>>> 
>>> -s
>>> 
>>> 
>>> -s
>>> 
>>> 
 On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
> wrote:
 
 +1 (non-binding)!
 
 On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand <
>>> san...@apache.org>
 wrote:
 
> +1 (binding)
> 
> 
> On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> 
>> +1 (binding)
>> 
>> On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel  wrote:
>> 
>>> +1 (binding)
>>> 
>>> Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS
>>> trigger
> rules
>>> where the parent nodes where joining with a SKIP. But I can of
>>> should

Re: Make Scheduler More Centralized

2017-03-15 Thread Bolke de Bruin
Hi Rui,

We have been discussing this during the hackathon at Airbnb as well. Besides 
the reservations Gerard is documenting, I am also not enthusiastic about this 
design. Currently, the scheduler is our main issue in scaling. Scheduler runs 
will take longer and longer with more DAGs and more complex DAGS (ie. more 
tasks in a DAG). To move more things into the scheduler makes it more difficult 
to move things out again. This is required when we want to move to an event 
driven / snowballing scheduler.

I would suggest documentation and enforcing the contracts between the scheduler 
- executor - task instance. We are lax in that respect and this is where a lot 
of issue stem from. Also the executor is the weak point here as it doesn’t do 
anything with the task state, but it does handle them. The points Gerard makes 
are very valid and we should improve our assumptions of the underlying bus.

Cheers
Bolke

> On 14 Mar 2017, at 15:08, Rui Wang  wrote:
> 
> Hi,
> The design doc below I created is trying to make airflow scheduler more
> centralized. Briefly speaking, I propose moving state change of
> TaskInstance to scheduler. You can see the reasons for this change below.
> 
> 
> Could you take a look and comment if you see anything does not make sense?
> 
> -Rui
> 
> --
> Current The state of TaskInstance is changed by both scheduler and worker.
> On worker side, worker monitors TaskInstance and changes the state to
> RUNNING, SUCCESS, if task succeed, or to UP_FOR_RETRY, FAILED if task fail.
> Worker also does failure email logic and failure callback logic.
> Proposal The general idea is to make a centralized scheduler and make
> workers dumb. Worker should not change state of TaskInstance, but just
> executes what it is assigned and reports the result of the task. Instead,
> the scheduler should make the decision on TaskInstance state change.
> Ideally, workers should not even handle the failure emails and callbacks
> unless the scheduler asks it to do so.
> Why Worker does not have as much information as scheduler has. There were
> bugs observed caused by worker when worker gets into trouble but cannot
> make decision to change task state due to lack of information. Although
> there is airflow metadata DB, it is still not easy to share all information
> that scheduler has with workers.
> 
> We can also ensure a consistent environment. There are slight differences
> in the chef recipes for the different workers which can cause strange
> issues when DAGs parse on one but not the other.
> 
> In the meantime, moving state changes to the scheduler can reduce the
> complexity of airflow. It especially helps when airflow needs to move to
> distributed schedulers. In that case state change everywhere by both
> schedulers and workers are harder to maintain.
> How to change After lots of discussions, following step will be done:
> 
> 1. Add a new column to TaskInstance table. Worker will fill this column
> with the task process exit code.
> 
> 2. Worker will only set TaskInstance state to RUNNING when it is ready to
> run task. There was debate on moving RUNNING to scheduler as well. If
> moving RUNNING to scheduler, either scheduler marks TaskInstance RUNNING
> before it gets into queue, or scheduler checks the status code in column
> above, which is updated by worker when worker is ready to run task. In
> Former case, from user's perspective, it is bad to mark TaskInstance as
> RUNNING when worker is not ready to run. User could be confused. In the
> latter case, scheduler could mark task as RUNNING late due to schedule
> interval. It is still not a good user experience. Since only worker knows
> when is ready to run task, worker should still deliver this message to user
> by setting RUNNING state.
> 
> 3. In any other cases, worker should not change state of TaskInstance, but
> save defined status code into column above.
> 
> 4. Worker still handles failure emails and callbacks because there were
> concern that scheduler could use too much resource to run failure callbacks
> given unpredictable callback sizes. ( I think ideally scheduler should
> treat failure callbacks and emails as tasks, and assign such tasks to
> workers after TaskInstance state changes correspondingly). Eventually this
> logic will be moved to the workers once there is support for multiple
> distributed schedulers.
> 
> 5. In scheduler's loop, scheduler should check TaskInstance status code,
> then change state and retry/fail TaskInstance correspondingly.



Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread siddharth anand
Here's the JIRA :
https://issues.apache.org/jira/browse/AIRFLOW-989

I confirmed it is a regression from 1.7.1.3, which I installed via pip and
tested against the same DAG in the JIRA.

The issue occurs if a leaf / last / terminal downstream task is not
cleared. You won't see this issue if you clear the entire DAG Run or clear
a task and all of its downstream tasks. If you truly want to only clear and
rerun a task, but not its downstream tasks, you can use the CLI to execute
a specific task (e.g. vial airflow run).

This is a change in behavior -- if we do go ahead with the release, then
this JIRA should be in a list of JIRAs of known issues related to the new
version.
-s

On Wed, Mar 15, 2017 at 9:17 AM, Chris Riccomini 
wrote:

> @Sid, does this happen if you clear downstream as well?
>
> On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini 
> wrote:
>
> > Has anyone been able to reproduce Sid's issue?
> >
> > On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
> > wrote:
> >
> >> That is not an airflow error, but a Kerberos error. Try executing the
> >> kinit command on the command line by yourself.
> >>
> >> Bolke
> >>
> >> Sent from my iPhone
> >>
> >> > On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
> >> wrote:
> >> >
> >> > `airflow kerberos` is broken in 1.8-rc5
> >> > https://issues.apache.org/jira/browse/AIRFLOW-987
> >> > Hopefully fix can be part of the 1.8 release.
> >> >
> >> >
> >> >
> >> > --
> >> > Ruslan Dautkhanov
> >> >
> >> >> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
> >> wrote:
> >> >>
> >> >> FYI,
> >> >> I've just hit a major bug in the release candidate related to "clear
> >> task"
> >> >> behavior.
> >> >>
> >> >> I've been running airflow in both stage and prod since yesterday on
> >> rc5 and
> >> >> have reproduced this in both environments. I will file a JIRA for
> this
> >> >> tonight, but wanted to send a note over email as well.
> >> >>
> >> >> In my example, I have a 2 task DAG. For a given DAG run that has
> >> completed
> >> >> successfully, if I
> >> >> 1) clear task2 (leaf task in this case), the previously-successful
> DAG
> >> Run
> >> >> goes back to Running, requeues, and executes the task successfully.
> >> The DAG
> >> >> Run the returns from Running to Success.
> >> >> 2) clear task1 (root task in this case), the previously-successful
> DAG
> >> Run
> >> >> goes back to Running, DOES NOT requeue or execute the task at all.
> The
> >> DAG
> >> >> Run the returns from Running to Success though it never ran the task.
> >> >>
> >> >> 1) is expected and previous behavior. 2) is a regression.
> >> >>
> >> >> The only workaround is to use the CLI to run the task cleared. Here
> are
> >> >> some images :
> >> >> *After Clearing the Tasks*
> >> >> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
> >> >> 202017-03-14%2014.09.34.png?dl=0
> >> >>
> >> >> *After DAG Runs return to Success*
> >> >> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
> >> >> 202017-03-14%2014.09.49.png?dl=0
> >> >>
> >> >> This is a major regression because it will force everyone to use the
> >> CLI
> >> >> for things that they would normally use the UI for.
> >> >>
> >> >> -s
> >> >>
> >> >>
> >> >> -s
> >> >>
> >> >>
> >> >>> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
> >> wrote:
> >> >>>
> >> >>> +1 (non-binding)!
> >> >>>
> >> >>> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand <
> san...@apache.org>
> >> >>> wrote:
> >> >>>
> >>  +1 (binding)
> >> 
> >> 
> >>  On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
> >>  maximebeauche...@gmail.com> wrote:
> >> 
> >> > +1 (binding)
> >> >
> >> > On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel  >
> >>  wrote:
> >> >
> >> >> +1 (binding)
> >> >>
> >> >> Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS
> trigger
> >>  rules
> >> >> where the parent nodes where joining with a SKIP. But I can of
> >> >> should
> >> > have
> >> >> known this was coming. Apart of that I had a successful run last
> >> >>> night.
> >> >>
> >> >>
> >> >> On Tue, Mar 14, 2017 at 1:37 AM siddharth anand <
> san...@apache.org
> >> >>>
> >> > wrote:
> >> >>
> >> >> I'm going to deploy this to staging now. Fab work Bolke!
> >> >> -s
> >> >>
> >> >> On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
> >> >> dan.davy...@airbnb.com
> >> >>> .
> >> >> invalid
> >> >>> wrote:
> >> >>
> >> >>> I'll test this on staging as soon as I get a chance (the testing
> >> >> is
> >> >>> non-blocking on the rc5). Bolke very much in particular :).
> >> >>>
> >> >>> On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
> >> >>> jlo...@apache.org>
> >> >>> wrote:
> >> >>>
> >>  +1 (binding) extremely impressed by the work and diligence all
> >> >>> contributors
> >>  have 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Bolke de Bruin
FYI: When all root tasks (i.e. the last tasks to run) have succeeded the DagRun 
is considered successful and the scheduler will not consider any other tasks in 
the dag run. The code is here: 
https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L4095 
for version 1.8, and hasn’t changed significantly since 1.7: 
https://github.com/apache/incubator-airflow/blob/airbnb_rb1.7.1_4/airflow/models.py#L2678
 . As 1.7 is more task based and 1.8 more dag run based the behaviour between 
1.7 and 1.8 might be different (Sid is investigating).

Thus to make sure the tasks run you will need to clear the Root task - thus 
downstream clearing will definitely work. I’m not sure this is a change in 
behaviour as explained above. As most of the use case will clear downstream 
(its the default), a workaround is available, I don’t consider it a blocker.

- Bolke.

> On 15 Mar 2017, at 09:17, Chris Riccomini  wrote:
> 
> @Sid, does this happen if you clear downstream as well?
> 
> On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini 
> wrote:
> 
>> Has anyone been able to reproduce Sid's issue?
>> 
>> On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
>> wrote:
>> 
>>> That is not an airflow error, but a Kerberos error. Try executing the
>>> kinit command on the command line by yourself.
>>> 
>>> Bolke
>>> 
>>> Sent from my iPhone
>>> 
 On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
>>> wrote:
 
 `airflow kerberos` is broken in 1.8-rc5
 https://issues.apache.org/jira/browse/AIRFLOW-987
 Hopefully fix can be part of the 1.8 release.
 
 
 
 --
 Ruslan Dautkhanov
 
> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
>>> wrote:
> 
> FYI,
> I've just hit a major bug in the release candidate related to "clear
>>> task"
> behavior.
> 
> I've been running airflow in both stage and prod since yesterday on
>>> rc5 and
> have reproduced this in both environments. I will file a JIRA for this
> tonight, but wanted to send a note over email as well.
> 
> In my example, I have a 2 task DAG. For a given DAG run that has
>>> completed
> successfully, if I
> 1) clear task2 (leaf task in this case), the previously-successful DAG
>>> Run
> goes back to Running, requeues, and executes the task successfully.
>>> The DAG
> Run the returns from Running to Success.
> 2) clear task1 (root task in this case), the previously-successful DAG
>>> Run
> goes back to Running, DOES NOT requeue or execute the task at all. The
>>> DAG
> Run the returns from Running to Success though it never ran the task.
> 
> 1) is expected and previous behavior. 2) is a regression.
> 
> The only workaround is to use the CLI to run the task cleared. Here are
> some images :
> *After Clearing the Tasks*
> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
> 202017-03-14%2014.09.34.png?dl=0
> 
> *After DAG Runs return to Success*
> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
> 202017-03-14%2014.09.49.png?dl=0
> 
> This is a major regression because it will force everyone to use the
>>> CLI
> for things that they would normally use the UI for.
> 
> -s
> 
> 
> -s
> 
> 
>> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
>>> wrote:
>> 
>> +1 (non-binding)!
>> 
>> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
>> wrote:
>> 
>>> +1 (binding)
>>> 
>>> 
>>> On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
>>> maximebeauche...@gmail.com> wrote:
>>> 
 +1 (binding)
 
 On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
>>> wrote:
 
> +1 (binding)
> 
> Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
>>> rules
> where the parent nodes where joining with a SKIP. But I can of
> should
 have
> known this was coming. Apart of that I had a successful run last
>> night.
> 
> 
> On Tue, Mar 14, 2017 at 1:37 AM siddharth anand > 
 wrote:
> 
> I'm going to deploy this to staging now. Fab work Bolke!
> -s
> 
> On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
> dan.davy...@airbnb.com
>> .
> invalid
>> wrote:
> 
>> I'll test this on staging as soon as I get a chance (the testing
> is
>> non-blocking on the rc5). Bolke very much in particular :).
>> 
>> On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
>> jlo...@apache.org>
>> wrote:
>> 
>>> +1 (binding) extremely impressed by the work and diligence all
>> contributors
>>> have put in to getting 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Chris Riccomini
+1 (binding)

On Wed, Mar 15, 2017 at 9:19 AM, Bolke de Bruin  wrote:

> I have asked Sid to create and Jira and to make it reproducible.
> Nevertheless, I do not consider it a blocker as a workaround exists and it
> is relatively small in scope (while slightly annoying I understand that).
>
> Let’s get 1.8 out and do bug fixes in 1.8.1. More bugs will inevitably pop
> up :).
>
> - Bolke
>
> > On 15 Mar 2017, at 09:04, Chris Riccomini  wrote:
> >
> > Has anyone been able to reproduce Sid's issue?
> >
> > On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
> wrote:
> >
> >> That is not an airflow error, but a Kerberos error. Try executing the
> >> kinit command on the command line by yourself.
> >>
> >> Bolke
> >>
> >> Sent from my iPhone
> >>
> >>> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
> >> wrote:
> >>>
> >>> `airflow kerberos` is broken in 1.8-rc5
> >>> https://issues.apache.org/jira/browse/AIRFLOW-987
> >>> Hopefully fix can be part of the 1.8 release.
> >>>
> >>>
> >>>
> >>> --
> >>> Ruslan Dautkhanov
> >>>
>  On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
> >> wrote:
> 
>  FYI,
>  I've just hit a major bug in the release candidate related to "clear
> >> task"
>  behavior.
> 
>  I've been running airflow in both stage and prod since yesterday on
> rc5
> >> and
>  have reproduced this in both environments. I will file a JIRA for this
>  tonight, but wanted to send a note over email as well.
> 
>  In my example, I have a 2 task DAG. For a given DAG run that has
> >> completed
>  successfully, if I
>  1) clear task2 (leaf task in this case), the previously-successful DAG
> >> Run
>  goes back to Running, requeues, and executes the task successfully.
> The
> >> DAG
>  Run the returns from Running to Success.
>  2) clear task1 (root task in this case), the previously-successful DAG
> >> Run
>  goes back to Running, DOES NOT requeue or execute the task at all. The
> >> DAG
>  Run the returns from Running to Success though it never ran the task.
> 
>  1) is expected and previous behavior. 2) is a regression.
> 
>  The only workaround is to use the CLI to run the task cleared. Here
> are
>  some images :
>  *After Clearing the Tasks*
>  https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
>  202017-03-14%2014.09.34.png?dl=0
> 
>  *After DAG Runs return to Success*
>  https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
>  202017-03-14%2014.09.49.png?dl=0
> 
>  This is a major regression because it will force everyone to use the
> CLI
>  for things that they would normally use the UI for.
> 
>  -s
> 
> 
>  -s
> 
> 
> > On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
> >> wrote:
> >
> > +1 (non-binding)!
> >
> > On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand  >
> > wrote:
> >
> >> +1 (binding)
> >>
> >>
> >> On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
> >> maximebeauche...@gmail.com> wrote:
> >>
> >>> +1 (binding)
> >>>
> >>> On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
> >> wrote:
> >>>
>  +1 (binding)
> 
>  Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS
> trigger
> >> rules
>  where the parent nodes where joining with a SKIP. But I can of
>  should
> >>> have
>  known this was coming. Apart of that I had a successful run last
> > night.
> 
> 
>  On Tue, Mar 14, 2017 at 1:37 AM siddharth anand <
> san...@apache.org
> >
> >>> wrote:
> 
>  I'm going to deploy this to staging now. Fab work Bolke!
>  -s
> 
>  On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
>  dan.davy...@airbnb.com
> > .
>  invalid
> > wrote:
> 
> > I'll test this on staging as soon as I get a chance (the testing
>  is
> > non-blocking on the rc5). Bolke very much in particular :).
> >
> > On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
> > jlo...@apache.org>
> > wrote:
> >
> >> +1 (binding) extremely impressed by the work and diligence all
> > contributors
> >> have put in to getting these blockers fixed, Bolke in
>  particular.
> >>
> >> On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
> > art...@apache.org>
> > wrote:
> >>
> >>> +1 (binding)
> >>>
> >>> Thanks again for steering us through Bolke.
> >>>
> >>> Best,
> >>> Arthur
> >>>
> >>> On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
> >> bdbr...@gmail.com
> 
> >> wrote:
> >>>
>  Dear All,
> 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Chris Riccomini
@Sid, does this happen if you clear downstream as well?

On Wed, Mar 15, 2017 at 9:04 AM, Chris Riccomini 
wrote:

> Has anyone been able to reproduce Sid's issue?
>
> On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin 
> wrote:
>
>> That is not an airflow error, but a Kerberos error. Try executing the
>> kinit command on the command line by yourself.
>>
>> Bolke
>>
>> Sent from my iPhone
>>
>> > On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
>> wrote:
>> >
>> > `airflow kerberos` is broken in 1.8-rc5
>> > https://issues.apache.org/jira/browse/AIRFLOW-987
>> > Hopefully fix can be part of the 1.8 release.
>> >
>> >
>> >
>> > --
>> > Ruslan Dautkhanov
>> >
>> >> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
>> wrote:
>> >>
>> >> FYI,
>> >> I've just hit a major bug in the release candidate related to "clear
>> task"
>> >> behavior.
>> >>
>> >> I've been running airflow in both stage and prod since yesterday on
>> rc5 and
>> >> have reproduced this in both environments. I will file a JIRA for this
>> >> tonight, but wanted to send a note over email as well.
>> >>
>> >> In my example, I have a 2 task DAG. For a given DAG run that has
>> completed
>> >> successfully, if I
>> >> 1) clear task2 (leaf task in this case), the previously-successful DAG
>> Run
>> >> goes back to Running, requeues, and executes the task successfully.
>> The DAG
>> >> Run the returns from Running to Success.
>> >> 2) clear task1 (root task in this case), the previously-successful DAG
>> Run
>> >> goes back to Running, DOES NOT requeue or execute the task at all. The
>> DAG
>> >> Run the returns from Running to Success though it never ran the task.
>> >>
>> >> 1) is expected and previous behavior. 2) is a regression.
>> >>
>> >> The only workaround is to use the CLI to run the task cleared. Here are
>> >> some images :
>> >> *After Clearing the Tasks*
>> >> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
>> >> 202017-03-14%2014.09.34.png?dl=0
>> >>
>> >> *After DAG Runs return to Success*
>> >> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
>> >> 202017-03-14%2014.09.49.png?dl=0
>> >>
>> >> This is a major regression because it will force everyone to use the
>> CLI
>> >> for things that they would normally use the UI for.
>> >>
>> >> -s
>> >>
>> >>
>> >> -s
>> >>
>> >>
>> >>> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
>> wrote:
>> >>>
>> >>> +1 (non-binding)!
>> >>>
>> >>> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
>> >>> wrote:
>> >>>
>>  +1 (binding)
>> 
>> 
>>  On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
>>  maximebeauche...@gmail.com> wrote:
>> 
>> > +1 (binding)
>> >
>> > On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
>>  wrote:
>> >
>> >> +1 (binding)
>> >>
>> >> Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
>>  rules
>> >> where the parent nodes where joining with a SKIP. But I can of
>> >> should
>> > have
>> >> known this was coming. Apart of that I had a successful run last
>> >>> night.
>> >>
>> >>
>> >> On Tue, Mar 14, 2017 at 1:37 AM siddharth anand > >>>
>> > wrote:
>> >>
>> >> I'm going to deploy this to staging now. Fab work Bolke!
>> >> -s
>> >>
>> >> On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
>> >> dan.davy...@airbnb.com
>> >>> .
>> >> invalid
>> >>> wrote:
>> >>
>> >>> I'll test this on staging as soon as I get a chance (the testing
>> >> is
>> >>> non-blocking on the rc5). Bolke very much in particular :).
>> >>>
>> >>> On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
>> >>> jlo...@apache.org>
>> >>> wrote:
>> >>>
>>  +1 (binding) extremely impressed by the work and diligence all
>> >>> contributors
>>  have put in to getting these blockers fixed, Bolke in
>> >> particular.
>> 
>>  On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
>> >>> art...@apache.org>
>> >>> wrote:
>> 
>> > +1 (binding)
>> >
>> > Thanks again for steering us through Bolke.
>> >
>> > Best,
>> > Arthur
>> >
>> > On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
>>  bdbr...@gmail.com
>> >>
>>  wrote:
>> >
>> >> Dear All,
>> >>
>> >> Finally, I have been able to make the FIFTH RELEASE
>> >> CANDIDATE
>>  of
>>  Airflow
>> >> 1.8.0 available at: https://dist.apache.org/repos/
>> >> dist/dev/incubator/airflow/ > >> repos/dist/dev/incubator/airflow/> , public keys are
>> >>> available
>> > at
>> >> https://dist.apache.org/repos/dist/release/incubator/
>> >>> airflow/
>>  <
>> >> https://dist.apache.org/repos/dist/release/incubator/
>> >>> airflow/>

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Bolke de Bruin
I have asked Sid to create and Jira and to make it reproducible. Nevertheless, 
I do not consider it a blocker as a workaround exists and it is relatively 
small in scope (while slightly annoying I understand that). 

Let’s get 1.8 out and do bug fixes in 1.8.1. More bugs will inevitably pop up 
:).

- Bolke

> On 15 Mar 2017, at 09:04, Chris Riccomini  wrote:
> 
> Has anyone been able to reproduce Sid's issue?
> 
> On Tue, Mar 14, 2017 at 11:17 PM, Bolke de Bruin  wrote:
> 
>> That is not an airflow error, but a Kerberos error. Try executing the
>> kinit command on the command line by yourself.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov 
>> wrote:
>>> 
>>> `airflow kerberos` is broken in 1.8-rc5
>>> https://issues.apache.org/jira/browse/AIRFLOW-987
>>> Hopefully fix can be part of the 1.8 release.
>>> 
>>> 
>>> 
>>> --
>>> Ruslan Dautkhanov
>>> 
 On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand 
>> wrote:
 
 FYI,
 I've just hit a major bug in the release candidate related to "clear
>> task"
 behavior.
 
 I've been running airflow in both stage and prod since yesterday on rc5
>> and
 have reproduced this in both environments. I will file a JIRA for this
 tonight, but wanted to send a note over email as well.
 
 In my example, I have a 2 task DAG. For a given DAG run that has
>> completed
 successfully, if I
 1) clear task2 (leaf task in this case), the previously-successful DAG
>> Run
 goes back to Running, requeues, and executes the task successfully. The
>> DAG
 Run the returns from Running to Success.
 2) clear task1 (root task in this case), the previously-successful DAG
>> Run
 goes back to Running, DOES NOT requeue or execute the task at all. The
>> DAG
 Run the returns from Running to Success though it never ran the task.
 
 1) is expected and previous behavior. 2) is a regression.
 
 The only workaround is to use the CLI to run the task cleared. Here are
 some images :
 *After Clearing the Tasks*
 https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
 202017-03-14%2014.09.34.png?dl=0
 
 *After DAG Runs return to Success*
 https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
 202017-03-14%2014.09.49.png?dl=0
 
 This is a major regression because it will force everyone to use the CLI
 for things that they would normally use the UI for.
 
 -s
 
 
 -s
 
 
> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang 
>> wrote:
> 
> +1 (non-binding)!
> 
> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
> wrote:
> 
>> +1 (binding)
>> 
>> 
>> On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>> 
>>> +1 (binding)
>>> 
>>> On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
>> wrote:
>>> 
 +1 (binding)
 
 Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
>> rules
 where the parent nodes where joining with a SKIP. But I can of
 should
>>> have
 known this was coming. Apart of that I had a successful run last
> night.
 
 
 On Tue, Mar 14, 2017 at 1:37 AM siddharth anand  
>>> wrote:
 
 I'm going to deploy this to staging now. Fab work Bolke!
 -s
 
 On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
 dan.davy...@airbnb.com
> .
 invalid
> wrote:
 
> I'll test this on staging as soon as I get a chance (the testing
 is
> non-blocking on the rc5). Bolke very much in particular :).
> 
> On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
> jlo...@apache.org>
> wrote:
> 
>> +1 (binding) extremely impressed by the work and diligence all
> contributors
>> have put in to getting these blockers fixed, Bolke in
 particular.
>> 
>> On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
> art...@apache.org>
> wrote:
>> 
>>> +1 (binding)
>>> 
>>> Thanks again for steering us through Bolke.
>>> 
>>> Best,
>>> Arthur
>>> 
>>> On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
>> bdbr...@gmail.com
 
>> wrote:
>>> 
 Dear All,
 
 Finally, I have been able to make the FIFTH RELEASE
 CANDIDATE
>> of
>> Airflow
 1.8.0 available at: https://dist.apache.org/repos/
 dist/dev/incubator/airflow/  , public keys are
> available
>>> at
 

Re: Make Scheduler More Centralized

2017-03-15 Thread Gerard Toonstra
Hi Rui,

I worked a bit on the scheduler and added some of my comments below.


On Tue, Mar 14, 2017 at 11:08 PM, Rui Wang 
wrote:

> Hi,
> The design doc below I created is trying to make airflow scheduler more
> centralized. Briefly speaking, I propose moving state change of
> TaskInstance to scheduler. You can see the reasons for this change below.
>
>
> Could you take a look and comment if you see anything does not make sense?
>
> -Rui
>
> 
> --
> Current The state of TaskInstance is changed by both scheduler and worker.
> On worker side, worker monitors TaskInstance and changes the state to
> RUNNING, SUCCESS, if task succeed, or to UP_FOR_RETRY, FAILED if task fail.
> Worker also does failure email logic and failure callback logic.
> Proposal The general idea is to make a centralized scheduler and make
> workers dumb. Worker should not change state of TaskInstance, but just
> executes what it is assigned and reports the result of the task. Instead,
> the scheduler should make the decision on TaskInstance state change.
> Ideally, workers should not even handle the failure emails and callbacks
> unless the scheduler asks it to do so.
>

I had a look at the whole pipeline of scheduling, verifying dependencies
and context,
forwarding to the task queue and then receiving and starting the task at
the worker end,
where some of the previous final step verifications are also being done.

The way things are now is that the design follows a collaborative design,
which
relies on the fact that the underlying messaging framework is active. The
scheduler eventually
kicks off a task, the "guaranteed" MQ layer is active, at the worker end
processing starts
and immediately reports results.

The benefit of having the task set the state
to RUNNING is that you now know some worker actually did pick it up and
started working on it.
The SLA Is responsible for ensuring the task finishes in time; if that gets
violated, manual
intervention is required.

In a more independent design, you send a task for completion, but you have
no idea about
the timeframe that a task should both start and finish. The independent
design makes less
assumptions about the underlying MQ layer, it assumes it may not be active,
but it introduces
the really complex and nasty side-effect that it now must also work out
when a task should have
both started and finished.



> Why Worker does not have as much information as scheduler has. There were
> bugs observed caused by worker when worker gets into trouble but cannot
> make decision to change task state due to lack of information. Although
> there is airflow metadata DB, it is still not easy to share all information
> that scheduler has with workers.
>

Can you give some specific examples here, for example a JIRA?


>
> We can also ensure a consistent environment. There are slight differences
> in the chef recipes for the different workers which can cause strange
> issues when DAGs parse on one but not the other.
>

I don't see how this relates to an independent scheduler, it sounds like a
deployment pipeline issue.


>
> In the meantime, moving state changes to the scheduler can reduce the
> complexity of airflow. It especially helps when airflow needs to move to
> distributed schedulers. In that case state change everywhere by both
> schedulers and workers are harder to maintain.
> How to change After lots of discussions, following step will be done:
>
> 1. Add a new column to TaskInstance table. Worker will fill this column
> with the task process exit code.
>
> This introduces at least another single state that has to be inspected.


> 2. Worker will only set TaskInstance state to RUNNING when it is ready to
> run task. There was debate on moving RUNNING to scheduler as well. If
> moving RUNNING to scheduler, either scheduler marks TaskInstance RUNNING
> before it gets into queue, or scheduler checks the status code in column
> above, which is updated by worker when worker is ready to run task. In
> Former case, from user's perspective, it is bad to mark TaskInstance as
> RUNNING when worker is not ready to run. User could be confused. In the
> latter case, scheduler could mark task as RUNNING late due to schedule
> interval. It is still not a good user experience. Since only worker knows
> when is ready to run task, worker should still deliver this message to user
> by setting RUNNING state.
>

The scheduler should not set the state to RUNNING, because it has no idea
if the
underlying MQ layer has actually forwarded the task and a worker has
successfully
parsed and started the DAG.  QUEUED is helpful to identify issues with
flooding the
MQ layer or maybe inactivity there.

I'd then go for additional messaging on the MQ layer to indicate that a
worker has picked
up a task and started working on it, which I'm not sure can be done on
celery, probably directly
on ActiveMQ, but 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Bolke de Bruin
That is not an airflow error, but a Kerberos error. Try executing the kinit 
command on the command line by yourself. 

Bolke

Sent from my iPhone

> On 14 Mar 2017, at 23:11, Ruslan Dautkhanov  wrote:
> 
> `airflow kerberos` is broken in 1.8-rc5
> https://issues.apache.org/jira/browse/AIRFLOW-987
> Hopefully fix can be part of the 1.8 release.
> 
> 
> 
> -- 
> Ruslan Dautkhanov
> 
>> On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand  wrote:
>> 
>> FYI,
>> I've just hit a major bug in the release candidate related to "clear task"
>> behavior.
>> 
>> I've been running airflow in both stage and prod since yesterday on rc5 and
>> have reproduced this in both environments. I will file a JIRA for this
>> tonight, but wanted to send a note over email as well.
>> 
>> In my example, I have a 2 task DAG. For a given DAG run that has completed
>> successfully, if I
>> 1) clear task2 (leaf task in this case), the previously-successful DAG Run
>> goes back to Running, requeues, and executes the task successfully. The DAG
>> Run the returns from Running to Success.
>> 2) clear task1 (root task in this case), the previously-successful DAG Run
>> goes back to Running, DOES NOT requeue or execute the task at all. The DAG
>> Run the returns from Running to Success though it never ran the task.
>> 
>> 1) is expected and previous behavior. 2) is a regression.
>> 
>> The only workaround is to use the CLI to run the task cleared. Here are
>> some images :
>> *After Clearing the Tasks*
>> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
>> 202017-03-14%2014.09.34.png?dl=0
>> 
>> *After DAG Runs return to Success*
>> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
>> 202017-03-14%2014.09.49.png?dl=0
>> 
>> This is a major regression because it will force everyone to use the CLI
>> for things that they would normally use the UI for.
>> 
>> -s
>> 
>> 
>> -s
>> 
>> 
>>> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang  wrote:
>>> 
>>> +1 (non-binding)!
>>> 
>>> On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
>>> wrote:
>>> 
 +1 (binding)
 
 
 On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
 maximebeauche...@gmail.com> wrote:
 
> +1 (binding)
> 
> On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
 wrote:
> 
>> +1 (binding)
>> 
>> Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
 rules
>> where the parent nodes where joining with a SKIP. But I can of
>> should
> have
>> known this was coming. Apart of that I had a successful run last
>>> night.
>> 
>> 
>> On Tue, Mar 14, 2017 at 1:37 AM siddharth anand >> 
> wrote:
>> 
>> I'm going to deploy this to staging now. Fab work Bolke!
>> -s
>> 
>> On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
>> dan.davy...@airbnb.com
>>> .
>> invalid
>>> wrote:
>> 
>>> I'll test this on staging as soon as I get a chance (the testing
>> is
>>> non-blocking on the rc5). Bolke very much in particular :).
>>> 
>>> On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
>>> jlo...@apache.org>
>>> wrote:
>>> 
 +1 (binding) extremely impressed by the work and diligence all
>>> contributors
 have put in to getting these blockers fixed, Bolke in
>> particular.
 
 On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
>>> art...@apache.org>
>>> wrote:
 
> +1 (binding)
> 
> Thanks again for steering us through Bolke.
> 
> Best,
> Arthur
> 
> On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
 bdbr...@gmail.com
>> 
 wrote:
> 
>> Dear All,
>> 
>> Finally, I have been able to make the FIFTH RELEASE
>> CANDIDATE
 of
 Airflow
>> 1.8.0 available at: https://dist.apache.org/repos/
>> dist/dev/incubator/airflow/ > repos/dist/dev/incubator/airflow/> , public keys are
>>> available
> at
>> https://dist.apache.org/repos/dist/release/incubator/
>>> airflow/
 <
>> https://dist.apache.org/repos/dist/release/incubator/
>>> airflow/>
 .
>> It
>>> is
>> tagged with a local version “apache.incubating” so it
>> allows
>>> upgrading
> from
>> earlier releases.
>> 
>> Issues fixed since rc4:
>> 
>> [AIRFLOW-900] Double trigger should not kill original task
> instance
>> [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run
> protection
>> [AIRFLOW-932] Do not mark tasks removed when backfilling
>> [AIRFLOW-961] run onkill when SIGTERMed
>> [AIRFLOW-910] Use parallel task execution for backfills
>> [AIRFLOW-967] Wrap strings in native for py2 ldap
>>> compatibility
>> [AIRFLOW-941] Use 

Re: [VOTE] Release Airflow 1.8.0 based on Airflow 1.8.0rc5

2017-03-15 Thread Ruslan Dautkhanov
`airflow kerberos` is broken in 1.8-rc5
https://issues.apache.org/jira/browse/AIRFLOW-987
Hopefully fix can be part of the 1.8 release.



-- 
Ruslan Dautkhanov

On Tue, Mar 14, 2017 at 6:19 PM, siddharth anand  wrote:

> FYI,
> I've just hit a major bug in the release candidate related to "clear task"
> behavior.
>
> I've been running airflow in both stage and prod since yesterday on rc5 and
> have reproduced this in both environments. I will file a JIRA for this
> tonight, but wanted to send a note over email as well.
>
> In my example, I have a 2 task DAG. For a given DAG run that has completed
> successfully, if I
> 1) clear task2 (leaf task in this case), the previously-successful DAG Run
> goes back to Running, requeues, and executes the task successfully. The DAG
> Run the returns from Running to Success.
> 2) clear task1 (root task in this case), the previously-successful DAG Run
> goes back to Running, DOES NOT requeue or execute the task at all. The DAG
> Run the returns from Running to Success though it never ran the task.
>
> 1) is expected and previous behavior. 2) is a regression.
>
> The only workaround is to use the CLI to run the task cleared. Here are
> some images :
> *After Clearing the Tasks*
> https://www.dropbox.com/s/wmuxt0krwx6wurr/Screenshot%
> 202017-03-14%2014.09.34.png?dl=0
>
> *After DAG Runs return to Success*
> https://www.dropbox.com/s/qop933rzgdzchpd/Screenshot%
> 202017-03-14%2014.09.49.png?dl=0
>
> This is a major regression because it will force everyone to use the CLI
> for things that they would normally use the UI for.
>
> -s
>
>
> -s
>
>
> On Tue, Mar 14, 2017 at 1:32 PM, Daniel Huang  wrote:
>
> > +1 (non-binding)!
> >
> > On Tue, Mar 14, 2017 at 11:35 AM, siddharth anand 
> > wrote:
> >
> > > +1 (binding)
> > >
> > >
> > > On Tue, Mar 14, 2017 at 8:42 AM, Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Tue, Mar 14, 2017 at 3:59 AM, Alex Van Boxel 
> > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > Note: we had to revert all our ONE_SUCCESS with ALL_SUCCESS trigger
> > > rules
> > > > > where the parent nodes where joining with a SKIP. But I can of
> should
> > > > have
> > > > > known this was coming. Apart of that I had a successful run last
> > night.
> > > > >
> > > > >
> > > > > On Tue, Mar 14, 2017 at 1:37 AM siddharth anand  >
> > > > wrote:
> > > > >
> > > > > I'm going to deploy this to staging now. Fab work Bolke!
> > > > > -s
> > > > >
> > > > > On Mon, Mar 13, 2017 at 2:16 PM, Dan Davydov <
> dan.davy...@airbnb.com
> > .
> > > > > invalid
> > > > > > wrote:
> > > > >
> > > > > > I'll test this on staging as soon as I get a chance (the testing
> is
> > > > > > non-blocking on the rc5). Bolke very much in particular :).
> > > > > >
> > > > > > On Mon, Mar 13, 2017 at 10:46 AM, Jeremiah Lowin <
> > jlo...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (binding) extremely impressed by the work and diligence all
> > > > > > contributors
> > > > > > > have put in to getting these blockers fixed, Bolke in
> particular.
> > > > > > >
> > > > > > > On Mon, Mar 13, 2017 at 1:07 AM Arthur Wiedmer <
> > art...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Thanks again for steering us through Bolke.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Arthur
> > > > > > > >
> > > > > > > > On Sun, Mar 12, 2017 at 9:59 PM, Bolke de Bruin <
> > > bdbr...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Dear All,
> > > > > > > > >
> > > > > > > > > Finally, I have been able to make the FIFTH RELEASE
> CANDIDATE
> > > of
> > > > > > > Airflow
> > > > > > > > > 1.8.0 available at: https://dist.apache.org/repos/
> > > > > > > > > dist/dev/incubator/airflow/  > > > > > > > > repos/dist/dev/incubator/airflow/> , public keys are
> > available
> > > > at
> > > > > > > > > https://dist.apache.org/repos/dist/release/incubator/
> > airflow/
> > > <
> > > > > > > > > https://dist.apache.org/repos/dist/release/incubator/
> > airflow/>
> > > .
> > > > > It
> > > > > > is
> > > > > > > > > tagged with a local version “apache.incubating” so it
> allows
> > > > > > upgrading
> > > > > > > > from
> > > > > > > > > earlier releases.
> > > > > > > > >
> > > > > > > > > Issues fixed since rc4:
> > > > > > > > >
> > > > > > > > > [AIRFLOW-900] Double trigger should not kill original task
> > > > instance
> > > > > > > > > [AIRFLOW-900] Fixes bugs in LocalTaskJob for double run
> > > > protection
> > > > > > > > > [AIRFLOW-932] Do not mark tasks removed when backfilling
> > > > > > > > > [AIRFLOW-961] run onkill when SIGTERMed
> > > > > > > > > [AIRFLOW-910] Use parallel task execution for backfills
> > > > > > > > > [AIRFLOW-967] Wrap strings in native for py2 ldap
> > compatibility
> > > > > > > > >