A few related thoughts:
* there may be hiccups around concurrency (pools, queues), though the
worker should double-checks that the constraints are still met when firing
the task, so in theory this should be ok
* there may be more "misfires" meaning the task gets sent to the worker,
but by the time it starts the conditions aren't met anymore because of a
race condition with one of the other schedulers. Here I'm assuming recent
versions of Airflow will simply eventually re-fire the misfires and heal
* cross DAG prioritization can't really take place anymore as there's not a
shared "ready-to-run" list of task instances that can be sorted by
priority_weight. Whichever scheduler instance fires first is likely to get
the open slots first.

Max


On Wed, Oct 31, 2018 at 1:00 PM Kevin Yang <yrql...@gmail.com> wrote:

> Finally we start to talk about this seriously? Yeah! :D
>
> For your approach, a few thoughts:
>
>    1. Shard by # of files may not yield same load--even very different load
>    since we may have some framework DAG file producing 500 DAG and take
>    forever to parse.
>    2. I think Alex Guziel <https://github.com/saguziel> had previously
>    talked about using apache helix to shard the scheduler. I haven't look a
>    lot into it but may be something you're interested in. I personally like
>    that idea because we don't need to reinvent the wheel about a lot stuff(
>    less code to maintain also ;) ).
>    3. About the DB part, I should be contributing back some changes that
>    can dramatically drop the DB CPU usage. Afterwards I think we should
> have
>    plenty of headroom( assuming the traffic is ~4000 DAG files and ~40k
>    concurrency running task instances) so we should probly be fine here.
>
> Also I'm kinda curious about your setup and want to understand why do you
> need to shard the scheduler, since the scheduler can now scale up pretty
> high actually.
>
> Thank you for initiate the discussion, I think it can turn out to be a very
> valuable and critical discussion--many people have been thinking/discussing
> about this and I can't wait to hear the ideas :D
>
> Cheers,
> Kevin Y
>
> On Wed, Oct 31, 2018 at 7:38 AM Deng Xiaodong <xd.den...@gmail.com> wrote:
>
> > Hi Folks,
> >
> > Previously I initiated a discussion about the best practice of Airflow
> > setting-up, and it was agreed by a few folks that scheduler may become
> one
> > of the bottleneck component (we can only run one scheduler instance, can
> > only scale vertically rather than horizontally, etc.). Especially when we
> > have thousands of DAGs, the scheduling latency may be high.
> >
> > In our team, we have experimented a naive multiple-scheduler
> architecture.
> > Would like to share here, and also seek inputs from you.
> >
> > **1. Background**
> > - Inside DAG_Folder, we can have sub-folders.
> > - When we initiate scheduler instance, we can specify “--subdir” for it,
> > which will specify the specific directory that the scheduler is going to
> > “scan” (https://airflow.apache.org/cli.html#scheduler).
> >
> > **2. Our Naive Idea**
> > Say we have 2,000 DAGs. If we run one single scheduler instance, one
> > scheduling loop will traverse all 2K DAGs.
> >
> > Our idea is:
> > Step-1: Create multiple sub-directories, say five, under DAG_Folder
> > (subdir1, subdir2, …, subdir5)
> > Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs
> in
> > each)
> > Step-3: then we can start scheduler instance on 5 different machines,
> > using command `airflow scheduler --subdir subdir<i>` on machine <i>.
> >
> > Hence eventually, each scheduler only needs to take care of 400 DAGs.
> >
> > **3. Test & Results**
> > - We have done a testing using 2,000 DAGs (3 tasks in each DAG).
> > - DAGs are stored using network attached storage (the same drive mounted
> > to all nodes), so we don’t concern about the DAG_Folder synchronization.
> > - No conflict observed (each DAG file will only be parsed & scheduled by
> > one scheduler instance).
> > - The scheduling speed improves almost linearly. Demonstrated that we can
> > scale scheduler horizontally.
> >
> > **4. Highlight**
> > - This naive idea doesn’t address scheduler availability.
> > - As Kelvin Yang shared earlier in another thread, the database may be
> > another bottleneck when the load is high. But this is not considered here
> > yet.
> >
> >
> > Kindly share your thoughts on this naive idea. Thanks.
> >
> >
> >
> > Best regards,
> > XD
> >
> >
> >
> >
> >
>

Reply via email to