A few related thoughts: * there may be hiccups around concurrency (pools, queues), though the worker should double-checks that the constraints are still met when firing the task, so in theory this should be ok * there may be more "misfires" meaning the task gets sent to the worker, but by the time it starts the conditions aren't met anymore because of a race condition with one of the other schedulers. Here I'm assuming recent versions of Airflow will simply eventually re-fire the misfires and heal * cross DAG prioritization can't really take place anymore as there's not a shared "ready-to-run" list of task instances that can be sorted by priority_weight. Whichever scheduler instance fires first is likely to get the open slots first.
Max On Wed, Oct 31, 2018 at 1:00 PM Kevin Yang <yrql...@gmail.com> wrote: > Finally we start to talk about this seriously? Yeah! :D > > For your approach, a few thoughts: > > 1. Shard by # of files may not yield same load--even very different load > since we may have some framework DAG file producing 500 DAG and take > forever to parse. > 2. I think Alex Guziel <https://github.com/saguziel> had previously > talked about using apache helix to shard the scheduler. I haven't look a > lot into it but may be something you're interested in. I personally like > that idea because we don't need to reinvent the wheel about a lot stuff( > less code to maintain also ;) ). > 3. About the DB part, I should be contributing back some changes that > can dramatically drop the DB CPU usage. Afterwards I think we should > have > plenty of headroom( assuming the traffic is ~4000 DAG files and ~40k > concurrency running task instances) so we should probly be fine here. > > Also I'm kinda curious about your setup and want to understand why do you > need to shard the scheduler, since the scheduler can now scale up pretty > high actually. > > Thank you for initiate the discussion, I think it can turn out to be a very > valuable and critical discussion--many people have been thinking/discussing > about this and I can't wait to hear the ideas :D > > Cheers, > Kevin Y > > On Wed, Oct 31, 2018 at 7:38 AM Deng Xiaodong <xd.den...@gmail.com> wrote: > > > Hi Folks, > > > > Previously I initiated a discussion about the best practice of Airflow > > setting-up, and it was agreed by a few folks that scheduler may become > one > > of the bottleneck component (we can only run one scheduler instance, can > > only scale vertically rather than horizontally, etc.). Especially when we > > have thousands of DAGs, the scheduling latency may be high. > > > > In our team, we have experimented a naive multiple-scheduler > architecture. > > Would like to share here, and also seek inputs from you. > > > > **1. Background** > > - Inside DAG_Folder, we can have sub-folders. > > - When we initiate scheduler instance, we can specify “--subdir” for it, > > which will specify the specific directory that the scheduler is going to > > “scan” (https://airflow.apache.org/cli.html#scheduler). > > > > **2. Our Naive Idea** > > Say we have 2,000 DAGs. If we run one single scheduler instance, one > > scheduling loop will traverse all 2K DAGs. > > > > Our idea is: > > Step-1: Create multiple sub-directories, say five, under DAG_Folder > > (subdir1, subdir2, …, subdir5) > > Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs > in > > each) > > Step-3: then we can start scheduler instance on 5 different machines, > > using command `airflow scheduler --subdir subdir<i>` on machine <i>. > > > > Hence eventually, each scheduler only needs to take care of 400 DAGs. > > > > **3. Test & Results** > > - We have done a testing using 2,000 DAGs (3 tasks in each DAG). > > - DAGs are stored using network attached storage (the same drive mounted > > to all nodes), so we don’t concern about the DAG_Folder synchronization. > > - No conflict observed (each DAG file will only be parsed & scheduled by > > one scheduler instance). > > - The scheduling speed improves almost linearly. Demonstrated that we can > > scale scheduler horizontally. > > > > **4. Highlight** > > - This naive idea doesn’t address scheduler availability. > > - As Kelvin Yang shared earlier in another thread, the database may be > > another bottleneck when the load is high. But this is not considered here > > yet. > > > > > > Kindly share your thoughts on this naive idea. Thanks. > > > > > > > > Best regards, > > XD > > > > > > > > > > >