Re: Concurrent schedulers

2017-05-23 Thread siddharth anand
I did run into "double SLA miss alarms" firing, but that was on 1.7x. I
haven't tested if that is still an issue in 1.8x.

-s

On Tue, May 23, 2017 at 8:46 AM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Awesome. I wasn't aware of DagRun locking, this is even better!
>
> Max
>
> On Mon, May 22, 2017 at 11:39 PM, Bolke de Bruin 
> wrote:
>
> > Hi Max,
> >
> > We seem to be in quite good order already. We are testing with multi
> > master mysql and will also test multi master Postgres. As we are doing
> > dagrun level locking already it does not seem to be required to do
> > DAG-level locking. Also tasks are being locked so if multiple schedulers
> > are running everything seems to be quite fine. If one of the schedulers
> > restarts it starts checking for orphaned tasks by checking the executor
> > queue which is unique for every scheduler. This will result it some tasks
> > being dequeued and then requeued. So airflow is robust enough to stay
> alive
> > then (with my patch for deadlocks applied), but some things are a bit
> > sub-optimal.
> >
> > As mentioned we are still stress testing this setup and we might find
> more.
> >
> > Bolke
> >
> > > On 22 May 2017, at 18:19, Maxime Beauchemin <
> maximebeauche...@gmail.com>
> > wrote:
> > >
> > > Things that might be needed for a correct multi-schedulers setup:
> > > * DAG-level lock while being evaluated
> > > * DAG-level lock expiration to recover from potential situation where
> the
> > > lock wasn't released
> > > * Accumulation of the list of task instances to run into the database
> (as
> > > opposed to cross process communication to master process)
> > > * Define a clear master cycle that would read the list of accumulated
> > task
> > > instances from the DB, dedup, prioritize and schedule. That master
> cycle
> > > should have a lock (and lock expiration) as well.
> > >
> > > Max
> > >
> > > On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin 
> > wrote:
> > >
> > >> Hi Stephen,
> > >>
> > >> We are currently stress testing Airflow for use in a multi-master
> setup.
> > >> One of my team members is doing a write up that should show up online
> > >> shortly. TL;DR; in its current state Airflow will need some patches in
> > >> order to run concurrently. One issue is that Airflow can have a
> database
> > >> deadlock which will stop the scheduler from running. I have a patch
> for
> > >> that out here (https://github.com/apache/incubator-airflow/pull/2267
> <
> > >> https://github.com/apache/incubator-airflow/pull/2267>) that works
> fine
> > >> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations
> of
> > >> sqlite).
> > >>
> > >> Your global scheduler lock (eg. by an active passive configuration)
> > might
> > >> make most sense for now.
> > >>
> > >> Bolke
> > >>
> > >>> On 22 May 2017, at 07:52, Stephen Rigney  wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> We're running airflow in production, but for reliability (n.b. not
> > >>> performance) we'd like to confirm if it is safe to spawn multiple
> > >> instances
> > >>> of the scheduler overlapping in time (otherwise we may need to put
> more
> > >>> effort into assuring two copies aren't ever spawned at once in our
> > >>> environment).
> > >>>
> > >>>
> > >>> It seems this officially wasn't a supported configuration back in
> 2015
> > (
> > >>> https://groups.google.com/d/msg/airbnb_airflow/-
> > 1wKa3OcwME/uATa8y3YDAAJ
> > >> ),
> > >>> but has sufficient intra-airflow locking been added that it is now
> safe
> > >> to
> > >>> start up two temporally overlapping instances of the scheduler for
> the
> > >> same
> > >>> airflow system?
> > >>>
> > >>>
> > >>> Or should we hack in a "global scheduler lock" - we're not looking
> for
> > >>> increased performance by scheduler parallelism, just that if we ever
> > fire
> > >>> up two instances of the scheduler nothing terrible happens?
> > >>>
> > >>>
> > >>> Stephen
> > >>
> > >>
> >
> >
>


Re: Concurrent schedulers

2017-05-23 Thread Maxime Beauchemin
Awesome. I wasn't aware of DagRun locking, this is even better!

Max

On Mon, May 22, 2017 at 11:39 PM, Bolke de Bruin  wrote:

> Hi Max,
>
> We seem to be in quite good order already. We are testing with multi
> master mysql and will also test multi master Postgres. As we are doing
> dagrun level locking already it does not seem to be required to do
> DAG-level locking. Also tasks are being locked so if multiple schedulers
> are running everything seems to be quite fine. If one of the schedulers
> restarts it starts checking for orphaned tasks by checking the executor
> queue which is unique for every scheduler. This will result it some tasks
> being dequeued and then requeued. So airflow is robust enough to stay alive
> then (with my patch for deadlocks applied), but some things are a bit
> sub-optimal.
>
> As mentioned we are still stress testing this setup and we might find more.
>
> Bolke
>
> > On 22 May 2017, at 18:19, Maxime Beauchemin 
> wrote:
> >
> > Things that might be needed for a correct multi-schedulers setup:
> > * DAG-level lock while being evaluated
> > * DAG-level lock expiration to recover from potential situation where the
> > lock wasn't released
> > * Accumulation of the list of task instances to run into the database (as
> > opposed to cross process communication to master process)
> > * Define a clear master cycle that would read the list of accumulated
> task
> > instances from the DB, dedup, prioritize and schedule. That master cycle
> > should have a lock (and lock expiration) as well.
> >
> > Max
> >
> > On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin 
> wrote:
> >
> >> Hi Stephen,
> >>
> >> We are currently stress testing Airflow for use in a multi-master setup.
> >> One of my team members is doing a write up that should show up online
> >> shortly. TL;DR; in its current state Airflow will need some patches in
> >> order to run concurrently. One issue is that Airflow can have a database
> >> deadlock which will stop the scheduler from running. I have a patch for
> >> that out here (https://github.com/apache/incubator-airflow/pull/2267 <
> >> https://github.com/apache/incubator-airflow/pull/2267>) that works fine
> >> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations of
> >> sqlite).
> >>
> >> Your global scheduler lock (eg. by an active passive configuration)
> might
> >> make most sense for now.
> >>
> >> Bolke
> >>
> >>> On 22 May 2017, at 07:52, Stephen Rigney  wrote:
> >>>
> >>> Hi,
> >>>
> >>> We're running airflow in production, but for reliability (n.b. not
> >>> performance) we'd like to confirm if it is safe to spawn multiple
> >> instances
> >>> of the scheduler overlapping in time (otherwise we may need to put more
> >>> effort into assuring two copies aren't ever spawned at once in our
> >>> environment).
> >>>
> >>>
> >>> It seems this officially wasn't a supported configuration back in 2015
> (
> >>> https://groups.google.com/d/msg/airbnb_airflow/-
> 1wKa3OcwME/uATa8y3YDAAJ
> >> ),
> >>> but has sufficient intra-airflow locking been added that it is now safe
> >> to
> >>> start up two temporally overlapping instances of the scheduler for the
> >> same
> >>> airflow system?
> >>>
> >>>
> >>> Or should we hack in a "global scheduler lock" - we're not looking for
> >>> increased performance by scheduler parallelism, just that if we ever
> fire
> >>> up two instances of the scheduler nothing terrible happens?
> >>>
> >>>
> >>> Stephen
> >>
> >>
>
>


Re: Concurrent schedulers

2017-05-23 Thread Bolke de Bruin
Hi Max,

We seem to be in quite good order already. We are testing with multi master 
mysql and will also test multi master Postgres. As we are doing dagrun level 
locking already it does not seem to be required to do DAG-level locking. Also 
tasks are being locked so if multiple schedulers are running everything seems 
to be quite fine. If one of the schedulers restarts it starts checking for 
orphaned tasks by checking the executor queue which is unique for every 
scheduler. This will result it some tasks being dequeued and then requeued. So 
airflow is robust enough to stay alive then (with my patch for deadlocks 
applied), but some things are a bit sub-optimal.

As mentioned we are still stress testing this setup and we might find more.

Bolke

> On 22 May 2017, at 18:19, Maxime Beauchemin  
> wrote:
> 
> Things that might be needed for a correct multi-schedulers setup:
> * DAG-level lock while being evaluated
> * DAG-level lock expiration to recover from potential situation where the
> lock wasn't released
> * Accumulation of the list of task instances to run into the database (as
> opposed to cross process communication to master process)
> * Define a clear master cycle that would read the list of accumulated task
> instances from the DB, dedup, prioritize and schedule. That master cycle
> should have a lock (and lock expiration) as well.
> 
> Max
> 
> On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin  wrote:
> 
>> Hi Stephen,
>> 
>> We are currently stress testing Airflow for use in a multi-master setup.
>> One of my team members is doing a write up that should show up online
>> shortly. TL;DR; in its current state Airflow will need some patches in
>> order to run concurrently. One issue is that Airflow can have a database
>> deadlock which will stop the scheduler from running. I have a patch for
>> that out here (https://github.com/apache/incubator-airflow/pull/2267 <
>> https://github.com/apache/incubator-airflow/pull/2267>) that works fine
>> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations of
>> sqlite).
>> 
>> Your global scheduler lock (eg. by an active passive configuration) might
>> make most sense for now.
>> 
>> Bolke
>> 
>>> On 22 May 2017, at 07:52, Stephen Rigney  wrote:
>>> 
>>> Hi,
>>> 
>>> We're running airflow in production, but for reliability (n.b. not
>>> performance) we'd like to confirm if it is safe to spawn multiple
>> instances
>>> of the scheduler overlapping in time (otherwise we may need to put more
>>> effort into assuring two copies aren't ever spawned at once in our
>>> environment).
>>> 
>>> 
>>> It seems this officially wasn't a supported configuration back in 2015 (
>>> https://groups.google.com/d/msg/airbnb_airflow/-1wKa3OcwME/uATa8y3YDAAJ
>> ),
>>> but has sufficient intra-airflow locking been added that it is now safe
>> to
>>> start up two temporally overlapping instances of the scheduler for the
>> same
>>> airflow system?
>>> 
>>> 
>>> Or should we hack in a "global scheduler lock" - we're not looking for
>>> increased performance by scheduler parallelism, just that if we ever fire
>>> up two instances of the scheduler nothing terrible happens?
>>> 
>>> 
>>> Stephen
>> 
>> 



Re: Concurrent schedulers

2017-05-22 Thread Maxime Beauchemin
Things that might be needed for a correct multi-schedulers setup:
* DAG-level lock while being evaluated
* DAG-level lock expiration to recover from potential situation where the
lock wasn't released
* Accumulation of the list of task instances to run into the database (as
opposed to cross process communication to master process)
* Define a clear master cycle that would read the list of accumulated task
instances from the DB, dedup, prioritize and schedule. That master cycle
should have a lock (and lock expiration) as well.

Max

On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin  wrote:

> Hi Stephen,
>
> We are currently stress testing Airflow for use in a multi-master setup.
> One of my team members is doing a write up that should show up online
> shortly. TL;DR; in its current state Airflow will need some patches in
> order to run concurrently. One issue is that Airflow can have a database
> deadlock which will stop the scheduler from running. I have a patch for
> that out here (https://github.com/apache/incubator-airflow/pull/2267 <
> https://github.com/apache/incubator-airflow/pull/2267>) that works fine
> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations of
> sqlite).
>
> Your global scheduler lock (eg. by an active passive configuration) might
> make most sense for now.
>
> Bolke
>
> > On 22 May 2017, at 07:52, Stephen Rigney  wrote:
> >
> > Hi,
> >
> > We're running airflow in production, but for reliability (n.b. not
> > performance) we'd like to confirm if it is safe to spawn multiple
> instances
> > of the scheduler overlapping in time (otherwise we may need to put more
> > effort into assuring two copies aren't ever spawned at once in our
> > environment).
> >
> >
> > It seems this officially wasn't a supported configuration back in 2015 (
> > https://groups.google.com/d/msg/airbnb_airflow/-1wKa3OcwME/uATa8y3YDAAJ
> ),
> > but has sufficient intra-airflow locking been added that it is now safe
> to
> > start up two temporally overlapping instances of the scheduler for the
> same
> > airflow system?
> >
> >
> > Or should we hack in a "global scheduler lock" - we're not looking for
> > increased performance by scheduler parallelism, just that if we ever fire
> > up two instances of the scheduler nothing terrible happens?
> >
> >
> > Stephen
>
>


Re: Concurrent schedulers

2017-05-22 Thread Bolke de Bruin
Hi Stephen,

We are currently stress testing Airflow for use in a multi-master setup. One of 
my team members is doing a write up that should show up online shortly. TL;DR; 
in its current state Airflow will need some patches in order to run 
concurrently. One issue is that Airflow can have a database deadlock which will 
stop the scheduler from running. I have a patch for that out here 
(https://github.com/apache/incubator-airflow/pull/2267 
) that works fine on 
Postgres/MySql (tests don’t pass on sqlite yet due to limitations of sqlite). 

Your global scheduler lock (eg. by an active passive configuration) might make 
most sense for now.

Bolke

> On 22 May 2017, at 07:52, Stephen Rigney  wrote:
> 
> Hi,
> 
> We're running airflow in production, but for reliability (n.b. not
> performance) we'd like to confirm if it is safe to spawn multiple instances
> of the scheduler overlapping in time (otherwise we may need to put more
> effort into assuring two copies aren't ever spawned at once in our
> environment).
> 
> 
> It seems this officially wasn't a supported configuration back in 2015 (
> https://groups.google.com/d/msg/airbnb_airflow/-1wKa3OcwME/uATa8y3YDAAJ ),
> but has sufficient intra-airflow locking been added that it is now safe to
> start up two temporally overlapping instances of the scheduler for the same
> airflow system?
> 
> 
> Or should we hack in a "global scheduler lock" - we're not looking for
> increased performance by scheduler parallelism, just that if we ever fire
> up two instances of the scheduler nothing terrible happens?
> 
> 
> Stephen