Re: airflow (1.7.0): cpu utilization reaches 70% and above

harish singh Mon, 13 Jun 2016 10:36:06 -0700

So I changed the scheduler heartbeat to 60 sec

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 60


As expected, we did see a spiky cpu utilization.


But I see a different problem (not so much of a problem, but just
putting it here so that it may help someone who may need to do
something similar)

Suppose I have a DAG with Tasks as below:A-->B-->C

If A starts and completes its execution in 10 secs.

The next task B cannot start until the next scheduler heartbeat. That
means, I may have to wait roughly around 60 seconds in the worst case.

So one option is to have a heartbeat of around 30 seconds
 (or duration of whatever you believe is your least time consuming
task). This would just be a rough optimization to make sure we make
progress soon enough after the end of a task.


Thanks,Harish


On Mon, Jun 13, 2016 at 9:50 AM, harish singh <[email protected]>
wrote:

> yup. it is the scheduler process using cpu:
>
>
> The below is the usage with default out-of-box heartbeat settings.
>  I will paste some numbers in something, as I play with airflow.cfg.
> This is the output of `top`
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> *   65 airflow   20   0  445488 112648  13308 R  61.3  0.7 834:21.23
> airflow*
>     1 airflow   20   0   17968   1912   1668 S   0.0  0.0   0:00.04
> entrypoint.sh
>    64 airflow   20   0  407972  74876  13328 S   0.0  0.5   0:01.12 airflow
>   154 airflow   20   0  404404  62528   4344 S   0.0  0.4   0:00.11 airflow
>   155 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11 airflow
>   156 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.12 airflow
>   157 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11 airflow
>   158 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11 airflow
>   159 airflow   20   0  404404  62532   4344 S   0.0  0.4   0:00.11 airflow
>   160 airflow   20   0  404404  62488   4300 S   0.0  0.4   0:00.12 airflow
>   161 airflow   20   0  404404  62468   4280 S   0.0  0.4   0:00.11 airflow
>   163 airflow   20   0   50356  16680   5956 S   0.0  0.1   0:08.86
> gunicorn: maste
>   168 airflow   20   0  457204 119940  13080 S   0.0  0.8   0:18.86
> gunicorn: worke
>   170 airflow   20   0  463168 126028  13080 S   0.0  0.8   0:13.81
> gunicorn: worke
>   171 airflow   20   0  464936 127672  13080 S   0.0  0.8   0:08.53
> gunicorn: worke
>   174 airflow   20   0  467460 130240  13080 S   0.0  0.8   0:08.52
> gunicorn: worke
>  7627 airflow   20   0   18208   3192   2672 S   0.0  0.0   0:00.04 bash
>  8091 airflow   20   0   30312   7556   4784 S   0.0  0.0   0:00.00 python
> 28808 airflow   20   0   18208   3352   2836 S   0.0  0.0   0:00.08 bash
> 28822 airflow   20   0   19844   2340   2020 R   0.0  0.0   0:00.01 top
>
>
> *PID 65 is the scheduler.*
>
>
> airflow@80be4e775e55:~$ ps -ef
> UID        PID  PPID  C STIME TTY          TIME CMD
> airflow      1     0  0 Jun12 ?        00:00:00 /bin/bash ./entrypoint.sh
> airflow     64     1  0 Jun12 ?        00:00:01 /usr/bin/python
> /usr/local/bin/airflow webserver -p 8080
> airflow     65     1 62 Jun12 ?        13:54:22 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    154    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    155    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    156    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    157    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    158    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    159    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    160    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    161    65  0 Jun12 ?        00:00:00 /usr/bin/python
> /usr/local/bin/airflow scheduler
> airflow    163    64  0 Jun12 ?        00:00:08 gunicorn: master
> [airflow.www.app:cached_app()]
> airflow    168   163  0 Jun12 ?        00:00:18 gunicorn: worker
> [airflow.www.app:cached_app()]
> airflow    170   163  0 Jun12 ?        00:00:13 gunicorn: worker
> [airflow.www.app:cached_app()]
> airflow    171   163  0 Jun12 ?        00:00:08 gunicorn: worker
> [airflow.www.app:cached_app()]
> airflow    174   163  0 Jun12 ?        00:00:08 gunicorn: worker
> [airflow.www.app:cached_app()]
> airflow   7627     0  0 Jun12 ?        00:00:00 bash
> airflow   8091  7627  0 Jun12 ?        00:00:00 python
> airflow  28808     0  0 16:44 ?        00:00:00 bash
> airflow  28823 28808  0 16:44 ?        00:00:00 ps -ef
>
> On Mon, Jun 13, 2016 at 8:42 AM, Maxime Beauchemin <
> [email protected]> wrote:
>
>> Can you confirm that it's the scheduler process using that CPU?
>>
>> The SCHEDULER_HEARTBEAT_SEC configuration defines a minimum duration for
>> scheduling cycles, where the scheduler evaluates all active DagRun and
>> attempts to kick off task instances whose dependencies are met. Once the
>> cycle is done, the scheduler should sleep until the next heartbeat, so CPU
>> should look spiky.
>>
>> Max
>>
>> On Mon, Jun 13, 2016 at 8:26 AM, harish singh <[email protected]>
>> wrote:
>>
>> > Yup, I tried changing the scheduler heartbeat to 60 seconds..
>> > Apart from not getting any update for 60 seconds, What are the side
>> effects
>> > of changing the two heartbeats? Shouldn't impact performance?
>> >
>> > Also, I understand this cpu usage if there are 100s of dags. But with
>> just
>> > one active dag, doesnt 70% seem high? Esp in my case where there are
>> only
>> > 10 tasks in the dag making only curls (BashOperators).
>> >
>> > Also, a side now, in a different environment where we have 10 dags
>> active,
>> > the cpu usage stays in the same 70-80% range.
>> >
>> > On Mon, Jun 13, 2016, 8:14 AM Maxime Beauchemin <
>> > [email protected]>
>> > wrote:
>> >
>> > > The scheduler constantly attempts to schedule tasks, interacting with
>> the
>> > > database and reloading DAG definition. In most larg-ish environments,
>> > > burning up to a CPU to run the scheduler doesn't seem outrageous to
>> me.
>> > >
>> > > If you want to reduce the CPU load related to the scheduler check out
>> > > SCHEDULER_HEARTBEAT_SEC and MAX_THREADS in the scheduler section of
>> > > `airflow.cfg`
>> > >
>> > > Max
>> > >
>> > > On Sun, Jun 12, 2016 at 1:24 PM, harish singh <
>> [email protected]>
>> > > wrote:
>> > >
>> > > > Hi guys,
>> > > >
>> > > > We are running airflow (for about 3 months now) inside a docker
>> > container
>> > > > on aws.
>> > > >
>> > > > I just did a docker stats to check whats going on. The cpu
>> consumption
>> > is
>> > > > huge.
>> > > > We have around 15 DAGS. Only one DAG is turned ON. the remaining are
>> > OFF.
>> > > > The DAG runs with a HOURLY schedule.
>> > > >
>> > > > Right now, airflow is consuming almost 1 complete core.
>> > > > It seems there is some unnecessary spinning?
>> > > > This doesnt look like the right behavior.
>> > > > Is there a bug already filed for this? Or am not sure if there is
>> > > something
>> > > > incorrect in the way I am using the airflow configuration.
>> > > >
>> > > > CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %
>> > > >       NET I/O               BLOCK I/O
>> > > > CCC                         68.17%              619.7 MB / 2.147 GB
>> > > > 28.85%              1.408 GB / 939.4 MB   7.856 MB / 0 B
>> > > > XXX                          64.36%              619.4 MB / 2.147 GB
>> > > > 28.84%              1.211 GB / 807.6 MB   7.856 MB / 0 B
>> > > >
>> > > >
>> > > > Ariflow version 1.7.0
>> > > >
>> > > > Airflow config:
>> > > >
>> > > > sql_alchemy_pool_size = 5
>> > > > sql_alchemy_pool_recycle = 3600
>> > > > parallelism = 8
>> > > > dag_concurrency = 8
>> > > > max_active_runs_per_dag = 8
>> > > >
>> > > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Harish
>> > > >
>> > >
>> >
>>
>
>

Re: airflow (1.7.0): cpu utilization reaches 70% and above

Reply via email to