Thank you Xiaodong for bringing this up and pardon me for being late on
this thread. Sharing the setup within Airbnb and some ideas/progresses,
which should benefit people who's interested in this topic.

*- Setting-up*:
One-time on 1.8 with cherry-picks, planning to move to containerization
after releasing 1.10 internally.

*- Executors*:
CeleryExecutor.

*- Scale*:
400 worker nodes with celery concurrency on each nodes varies from 2 to
200( depends on the queue it is serving).

*- Queues*:
We have 9 queues and 2 of them are to serve task with special env
dependency( need GPU, need special packages, etc) and other
7 are to serve tasks with different resource consumptions, which leads to
worker nodes with different celery concurrency and cgroup sizes.

*- SLA*:
5 mins. 3 mins is possible( our current scheduling delay stays in 0.5-3 min
range) but we wanted some more headroom. (Our SLA is evaluated by a
monitoring task that is running
on the cluster with 5m interval. It will compare the current timestamp
against the expected scheduling timestamp(execution_date + 5m)
and send the time diff in min as one data point)

*- # of DAGs/Tasks*:
We're maintaining ~9,500 active DAGs produced by ~1,600 DAG files( # of DAG
file is actually the biggest scheduling bottleneck now). During peak hours
we have ~15,000 task running at the same time.


Xiaodong, you had a very good point about scheduler being the performance
bottleneck and we need HA for it. Looking forward for your contribution on
the scheduelr HA topic!

About scheduler performance/scaling, I've previously sent a proposal with
title "[Proposal] Scale Airflow" to the dev mailing list and currently I
have one open PR <https://github.com/apache/incubator-airflow/pull/3830>
improving performance with celery executor querying,
one WIP PR <https://github.com/yrqls21/incubator-airflow/pull/3>( fixing
the CI, ETA this weekend) improving performance of scheduelr and one PR in
my backlog( have an interval PR, need to open source it) improving
performance with enqueuing in both scheduler and
celery executor. From our internal stress test result, with all three PRs
together, our cluster can handle *4,000 DAG files, 30,000 peak concurrent
running tasks(even when they all need to be scheduled at the same time)*
*within our 5 min SLA*( calculated using our real DAG file parsing time,
which can be as long as 90 seconds for one DAG file). I think we will have
enough headroom after the changes have been merged and thus
some longer term improvements( separate DAG parsing component/service,
scheduler sharding, distribute scheduler responsibility to worker, etc) can
wait a bit. But of course I'm open of hear other opinions that can
better scale Airflow.
[image: Screen Shot 2018-08-31 at 3.26.11 PM.png]
Also I do want to mention that with faster scheduling and DAG parsing, DB
might become the bottleneck for performance. With our stress test setup we
can handle the DB load with an AWS RDS r3.4xlarge instance( only
with the improvement PRs). And webserver is not scaling very well as it is
parsing all DAG files in a single process fashion, which is my planned next
item to work on.


Cheers,
Kevin Y

On Thu, Sep 6, 2018 at 11:14 PM ramandu...@gmail.com <ramandu...@gmail.com>
wrote:

> Yeah, we are seeing scheduler becoming bottleneck as number of DAG files
> increase as scheduler can scale vertically and not horizontally.
> We are trying with multiple independent airflow setup and are distributing
> the load between them.
> But managing these many airflow clusters is becoming a challenge.
>
> Thanks,
> Raman Gupta
>
> On 2018/09/06 14:55:26, Deng Xiaodong <xd.den...@gmail.com> wrote:
> > Thanks for sharing, Raman.
> >
> > Based on what you shared, I think there are two points that may be worth
> > further discussing/thinking.
> >
> > *Scaling up (given thousands of DAGs):*
> > If you have thousands of DAGs, you may encounter longer scheduling
> latency
> > (actual start time minus planned start time).
> > For workers, we can scale horizontally by adding more worker nodes, which
> > is relatively straightforward.
> > But *Scheduler* may become another bottleneck.Scheduler can only be
> running
> > on one node (please correct me if I'm wrong). Even if we can use multiple
> > threads for it, it has its limit. HA is another concern. This is also
> what
> > our team is looking into at this moment, since scheduler is the biggest
> > "bottleneck" identified by us so far (anyone has experience tuning
> > scheduler performance?).
> >
> > *Broker for Celery Executor*:
> > you may want to try RabbitMQ rather than Redis/SQL as broker? Actually
> the
> > Celery community had the proposal to deprecate Redis as broker (of course
> > this proposal was rejected eventually) [
> > https://github.com/celery/celery/issues/3274].
> >
> >
> > Regards,
> > XD
> >
> >
> >
> >
> >
> > On Thu, Sep 6, 2018 at 6:10 PM ramandu...@gmail.com <
> ramandu...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > We have a requirement to scale to run 1000(s) concurrent dags. With
> celery
> > > executor we observed that
> > > Airflow worker gets stuck sometimes if connection to redis/mysql breaks
> > > (https://github.com/celery/celery/issues/3932
> > > https://github.com/celery/celery/issues/4457)
> > > Currently we are using Airflow 1.9 with LocalExecutor but planning to
> > > switch to Airflow 1.10 with K8 Executor.
> > >
> > > Thanks,
> > > Raman Gupta
> > >
> > >
> > > On 2018/09/05 12:56:38, Deng Xiaodong <xd.den...@gmail.com> wrote:
> > > > Hi folks,
> > > >
> > > > May you kindly share how your organization is setting up Airflow and
> > > using
> > > > it? Especially in terms of architecture. For example,
> > > >
> > > > - *Setting-Up*: Do you install Airflow in a "one-time" fashion, or
> > > > containerization fashion?
> > > > - *Executor:* Which executor are you using (*LocalExecutor*,
> > > > *CeleryExecutor*, etc)? I believe most production environments are
> using
> > > > *CeleryExecutor*?
> > > > - *Scale*: If using Celery, normally how many worker nodes do you
> add?
> > > (for
> > > > sure this is up to workloads and performance of your worker nodes).
> > > > - *Queue*: if Queue feature
> > > > <https://airflow.apache.org/concepts.html#queues> is used in your
> > > > architecture? For what advantage? (for example, explicitly assign
> > > > network-bound tasks to a worker node whose parallelism can be much
> higher
> > > > than its # of cores)
> > > > - *SLA*: do you have any SLA for your scheduling? (this is inspired
> by
> > > > @yrqls21's PR 3830 <
> > > https://github.com/apache/incubator-airflow/pull/3830>)
> > > > - etc.
> > > >
> > > > Airflow's setting-up can be quite flexible, but I believe there is
> some
> > > > sort of best practice, especially in the organisations where
> scalability
> > > is
> > > > essential.
> > > >
> > > > Thanks for sharing in advance!
> > > >
> > > >
> > > > Best regards,
> > > > XD
> > > >
> > >
> >
>

Reply via email to