Yeah, we are seeing scheduler becoming bottleneck as number of DAG files 
increase as scheduler can scale vertically and not horizontally.
We are trying with multiple independent airflow setup and are distributing the 
load between them.
But managing these many airflow clusters is becoming a challenge.

Thanks,
Raman Gupta

On 2018/09/06 14:55:26, Deng Xiaodong <xd.den...@gmail.com> wrote: 
> Thanks for sharing, Raman.
> 
> Based on what you shared, I think there are two points that may be worth
> further discussing/thinking.
> 
> *Scaling up (given thousands of DAGs):*
> If you have thousands of DAGs, you may encounter longer scheduling latency
> (actual start time minus planned start time).
> For workers, we can scale horizontally by adding more worker nodes, which
> is relatively straightforward.
> But *Scheduler* may become another bottleneck.Scheduler can only be running
> on one node (please correct me if I'm wrong). Even if we can use multiple
> threads for it, it has its limit. HA is another concern. This is also what
> our team is looking into at this moment, since scheduler is the biggest
> "bottleneck" identified by us so far (anyone has experience tuning
> scheduler performance?).
> 
> *Broker for Celery Executor*:
> you may want to try RabbitMQ rather than Redis/SQL as broker? Actually the
> Celery community had the proposal to deprecate Redis as broker (of course
> this proposal was rejected eventually) [
> https://github.com/celery/celery/issues/3274].
> 
> 
> Regards,
> XD
> 
> 
> 
> 
> 
> On Thu, Sep 6, 2018 at 6:10 PM ramandu...@gmail.com <ramandu...@gmail.com>
> wrote:
> 
> > Hi,
> > We have a requirement to scale to run 1000(s) concurrent dags. With celery
> > executor we observed that
> > Airflow worker gets stuck sometimes if connection to redis/mysql breaks
> > (https://github.com/celery/celery/issues/3932
> > https://github.com/celery/celery/issues/4457)
> > Currently we are using Airflow 1.9 with LocalExecutor but planning to
> > switch to Airflow 1.10 with K8 Executor.
> >
> > Thanks,
> > Raman Gupta
> >
> >
> > On 2018/09/05 12:56:38, Deng Xiaodong <xd.den...@gmail.com> wrote:
> > > Hi folks,
> > >
> > > May you kindly share how your organization is setting up Airflow and
> > using
> > > it? Especially in terms of architecture. For example,
> > >
> > > - *Setting-Up*: Do you install Airflow in a "one-time" fashion, or
> > > containerization fashion?
> > > - *Executor:* Which executor are you using (*LocalExecutor*,
> > > *CeleryExecutor*, etc)? I believe most production environments are using
> > > *CeleryExecutor*?
> > > - *Scale*: If using Celery, normally how many worker nodes do you add?
> > (for
> > > sure this is up to workloads and performance of your worker nodes).
> > > - *Queue*: if Queue feature
> > > <https://airflow.apache.org/concepts.html#queues> is used in your
> > > architecture? For what advantage? (for example, explicitly assign
> > > network-bound tasks to a worker node whose parallelism can be much higher
> > > than its # of cores)
> > > - *SLA*: do you have any SLA for your scheduling? (this is inspired by
> > > @yrqls21's PR 3830 <
> > https://github.com/apache/incubator-airflow/pull/3830>)
> > > - etc.
> > >
> > > Airflow's setting-up can be quite flexible, but I believe there is some
> > > sort of best practice, especially in the organisations where scalability
> > is
> > > essential.
> > >
> > > Thanks for sharing in advance!
> > >
> > >
> > > Best regards,
> > > XD
> > >
> >
> 

Reply via email to