Currently I've deployed my solution, that includes Airflow to the Google
Compute Platform. I have the following deployed:
* One instance: airflow web server & scheduler running
* 100 instances of a worker: each piece of work is resource intense and
takes about ~5 minutes.
* Using the Celery Executor
* Airflow Version: 1.7.1.2

I would like for each worker to process in parallel one piece of work at a
time. Each worker consists of 2 DAGs, each DAG has 1 Task/Operator
(duration ~5 minutes). The Operator indicates in my Postgres DB when it's
'WORKING' and what it's working on; and when it's 'IDLE'. I have the worker
start up with a concurrency of 1 and a specific airflow queue for both
DAGs. Concurrency of 1 is for managing resources.

The highest worker concurrency I've seen is about 20 (of 100) workers to
'WORK' simultaneously, and on average ~12 instances are working.

Airflow Configuration:
* parallelism = 100
* dag_concurrency = 100
* max_active_runs_per_dag = 100
* I see on the scheduler instance, 1 (of 4) CPUs pegged at 100%; it
switches between CPUs, but always 99-100%.

DAG Definition:
yesterday = datetime.combine(datetime.today() - timedelta(1),
datetime.min.time())
airflow_queue = 'airflow_worker'
default_args = {
    'owner': 'me',
    'depends_on_past': False,
    'start_date': yesterday,
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=1)
}

dag = DAG('name-of-my-dag',
          default_args=default_args,
          schedule_interval=timedelta(minutes=1))

The more I look at this, it appears the scheduler "stalls" (sometimes in
about 5 minutes). I've put in a every 15 minute restart on my
airflow-scheduler service, this appears to jump start processing again. At
the end of 15 minutes, the number of 'WORKING' processors drops to 1.

My Worker Database:
              now
-------------------------------
 2016-06-12 20:32:06.582423+00

select status, count(*), min(last_modified), max(last_modified) from worker
where simon_says = 'WORK' group by status;
 status  | count |            min             |            max
---------+-------+----------------------------+----------------------------
 IDLE    |    92 | 2016-06-12 19:27:45.476924 | 2016-06-12 20:31:12.896776
 WORKING |     8 | 2016-06-12 19:54:08.796312 | 2016-06-12 20:31:44.461265

There's one instance that hasn't done anything for about an hour.
​

-- 
*Randy How*
i-cubed: information integration & imaging, LLC
1600 Prospect Park Way, Suite 109
Fort Collins, CO 80525 | Office: +1-970-482-4400 | Desk: +1-970-372-6180
[email protected] | www.i3.com

Reply via email to