Awesome. I wasn't aware of DagRun locking, this is even better! Max
On Mon, May 22, 2017 at 11:39 PM, Bolke de Bruin <bdbr...@gmail.com> wrote: > Hi Max, > > We seem to be in quite good order already. We are testing with multi > master mysql and will also test multi master Postgres. As we are doing > dagrun level locking already it does not seem to be required to do > DAG-level locking. Also tasks are being locked so if multiple schedulers > are running everything seems to be quite fine. If one of the schedulers > restarts it starts checking for orphaned tasks by checking the executor > queue which is unique for every scheduler. This will result it some tasks > being dequeued and then requeued. So airflow is robust enough to stay alive > then (with my patch for deadlocks applied), but some things are a bit > sub-optimal. > > As mentioned we are still stress testing this setup and we might find more. > > Bolke > > > On 22 May 2017, at 18:19, Maxime Beauchemin <maximebeauche...@gmail.com> > wrote: > > > > Things that might be needed for a correct multi-schedulers setup: > > * DAG-level lock while being evaluated > > * DAG-level lock expiration to recover from potential situation where the > > lock wasn't released > > * Accumulation of the list of task instances to run into the database (as > > opposed to cross process communication to master process) > > * Define a clear master cycle that would read the list of accumulated > task > > instances from the DB, dedup, prioritize and schedule. That master cycle > > should have a lock (and lock expiration) as well. > > > > Max > > > > On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin <bdbr...@gmail.com> > wrote: > > > >> Hi Stephen, > >> > >> We are currently stress testing Airflow for use in a multi-master setup. > >> One of my team members is doing a write up that should show up online > >> shortly. TL;DR; in its current state Airflow will need some patches in > >> order to run concurrently. One issue is that Airflow can have a database > >> deadlock which will stop the scheduler from running. I have a patch for > >> that out here (https://github.com/apache/incubator-airflow/pull/2267 < > >> https://github.com/apache/incubator-airflow/pull/2267>) that works fine > >> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations of > >> sqlite). > >> > >> Your global scheduler lock (eg. by an active passive configuration) > might > >> make most sense for now. > >> > >> Bolke > >> > >>> On 22 May 2017, at 07:52, Stephen Rigney <sjrig...@gmail.com> wrote: > >>> > >>> Hi, > >>> > >>> We're running airflow in production, but for reliability (n.b. not > >>> performance) we'd like to confirm if it is safe to spawn multiple > >> instances > >>> of the scheduler overlapping in time (otherwise we may need to put more > >>> effort into assuring two copies aren't ever spawned at once in our > >>> environment). > >>> > >>> > >>> It seems this officially wasn't a supported configuration back in 2015 > ( > >>> https://groups.google.com/d/msg/airbnb_airflow/- > 1wKa3OcwME/uATa8y3YDAAJ > >> ), > >>> but has sufficient intra-airflow locking been added that it is now safe > >> to > >>> start up two temporally overlapping instances of the scheduler for the > >> same > >>> airflow system? > >>> > >>> > >>> Or should we hack in a "global scheduler lock" - we're not looking for > >>> increased performance by scheduler parallelism, just that if we ever > fire > >>> up two instances of the scheduler nothing terrible happens? > >>> > >>> > >>> Stephen > >> > >> > >