"celeryd_concurrency" and "parallelism" serve different purposes
"celeryd_concurrency" determines how many worker subprocesses will be spun up on an airflow worker instance (how many concurrent tasks per machine) "parallelism" determines how many tasks airflow will schedule at once; running and queued tasks count against this limit Generally, you will want to set "celeryd_concurrency" to as high a value as your worker nodes can handle (or a lower value and use low powered workers) As a general rule of thumb, you will want to set "parallelism" to be (number of workers) * ("celeryd_concurrency"). A lower "parallelism" value will leave workers idle. If you increase "parallelism", you will get tasks backed up in your celery queue; this is great for high throughput and makes sure workers aren't starved for work but can slightly increase the latency of individual tasks. Once you have those values tuned, you also need to be aware of dag concurrency limits and task pool limits. Each DAG has a concurrency parameter, which defaults to the "dag_concurrency" configuration value. Each Task has a pool parameter, which defaults to `None`. These each decrease the number of running tasks below the parallelism limit. DAG concurrency behaves exactly the same as parallelism but sets a limit on the individual DAG. Depending on how many DAGs you have, when they're scheduled, etc., you will want to set the "dag_concurrency" to a level that doesn't prevent DAGs from stepping on each others' resources too much. You can also set individual limits for individual DAGs. Pools also behave the same as parallelism but operate across DAGs. They are useful for things like limiting the number of concurrent tasks hitting a single database. If you do not specify a pool, the tasks will be placed in the default pool, which has a size defined by the "non_pooled_task_slot_count" configuration variable. If you don't want to think too much about things, try: celeryd_concurrency = 4 or 8 parallelism = celeryd_concurrency * number of workers num_pooled_task_slot_count = parallelism dag_concurrency = parallelism / 2 and then pretend pools and individual dag concurrency limits don't exist On Wed, Jun 13, 2018 at 4:11 AM ramandu...@gmail.com <ramandu...@gmail.com> wrote: > Hi All, > > There seems to be couple of settings in airflow.cfg which controls the > number of tasks that can run in parallel on Airflow( "parallelism" and > "celeryd_concurrency") > In case of celeryExecutor which one is honoured. Do we need to set both or > setting only celeryd_concurrency would work. > > Thanks, > Raman > >