[ https://issues.apache.org/jira/browse/AIRFLOW-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089818#comment-16089818 ]
Rick Otten edited comment on AIRFLOW-401 at 7/17/17 1:28 PM: ------------------------------------------------------------- According to the logs, ever since we set it up, it restarts itself regularly. Before we had a lot of tasks, it was more often than since we added a lot of tasks. This is what the logs look like during a routine restart (which cleans up all of the dead children): {code:none} Jul 15 12:54:38 myserver airflow[4467]: [2017-07-15 12:54:38,253] {jobs.py:1313} INFO - Exited execute loop Jul 15 12:54:43 myserver systemd[1]: airflow-scheduler.service: Service hold-off time over, scheduling restart. Jul 15 12:54:43 myserver systemd[1]: Stopped Airflow scheduler daemon. Jul 15 12:54:43 myserver systemd[1]: Started Airflow scheduler daemon. Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,364] {__init__.py:57} INFO - Using executor LocalExecutor Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,450] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,476] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt Jul 15 12:54:44 myserver airflow[20289]: ____________ _____________ Jul 15 12:54:44 myserver airflow[20289]: ____ |__( )_________ __/__ /________ __ Jul 15 12:54:44 myserver airflow[20289]: ____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / / Jul 15 12:54:44 myserver airflow[20289]: ___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ / Jul 15 12:54:44 myserver airflow[20289]: _/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/ Jul 15 12:54:44 myserver airflow[20289]: Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,777] {jobs.py:1264} INFO - Starting the scheduler Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1280} INFO - Processing files using up to 4 processes at a time Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1282} INFO - Running execute loop for -1 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1284} INFO - Processing each file at most 100 times Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1286} INFO - Process each file at most once every 120 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1288} INFO - Checking for new files in /home/airflow/airflow/dags every 300 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1291} INFO - Searching for files in /home/airflow/airflow/dags Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,779] {jobs.py:1294} INFO - There are 20 files in /home/airflow/airflow/dags Jul 15 12:54:44 myserver airflow[20289]: /home/airflow/airflow/lib/python3.5/site-packages/airflow/www/app.py:23: FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and will be removed in 1.0. Jul 15 12:54:44 myserver airflow[20289]: csrf = CsrfProtect() Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,994] {jobs.py:1356} INFO - Resetting state for orphaned tasks Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,008] {jobs.py:1366} INFO - Resetting backup_clarivoypg 2017-07-14 06:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,018] {jobs.py:1366} INFO - Resetting daily_web_response_processor 2017-07-14 09:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,023] {jobs.py:1366} INFO - Resetting ping_pgbouncers 2017-07-15 09:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,027] {jobs.py:1407} INFO - Heartbeating the process manager Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,032] {dag_processing.py:627} INFO - Started a process (PID: 20462) to generate tasks for /home/airflow/airflow/dags/my_daily.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_daily.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,036] {dag_processing.py:627} INFO - Started a process (PID: 20463) to generate tasks for /home/airflow/airflow/dags/my_hourly.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_hourly.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,042] {dag_processing.py:627} INFO - Started a process (PID: 20464) to generate tasks for /home/airflow/airflow/dags/my_odd_job.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_odd_job.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,047] {dag_processing.py:627} INFO - Started a process (PID: 20465) to generate tasks for /home/airflow/airflow/dags/my_other_odd_job.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_other_odd_job.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,049] {jobs.py:1443} INFO - Heartbeating the executor {code} was (Author: rotten): According to the logs, ever since we set it up, it restarts itself regularly. Before we had a lot of tasks, it was more often than since we added a lot of tasks. This is what the logs look like during a routine restart (which cleans up all of the dead children): {code:plain text} Jul 15 12:54:38 myserver airflow[4467]: [2017-07-15 12:54:38,253] {jobs.py:1313} INFO - Exited execute loop Jul 15 12:54:43 myserver systemd[1]: airflow-scheduler.service: Service hold-off time over, scheduling restart. Jul 15 12:54:43 myserver systemd[1]: Stopped Airflow scheduler daemon. Jul 15 12:54:43 myserver systemd[1]: Started Airflow scheduler daemon. Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,364] {__init__.py:57} INFO - Using executor LocalExecutor Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,450] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/Grammar.txt Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,476] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python3.5/lib2to3/PatternGrammar.txt Jul 15 12:54:44 myserver airflow[20289]: ____________ _____________ Jul 15 12:54:44 myserver airflow[20289]: ____ |__( )_________ __/__ /________ __ Jul 15 12:54:44 myserver airflow[20289]: ____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / / Jul 15 12:54:44 myserver airflow[20289]: ___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ / Jul 15 12:54:44 myserver airflow[20289]: _/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/ Jul 15 12:54:44 myserver airflow[20289]: Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,777] {jobs.py:1264} INFO - Starting the scheduler Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1280} INFO - Processing files using up to 4 processes at a time Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1282} INFO - Running execute loop for -1 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1284} INFO - Processing each file at most 100 times Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1286} INFO - Process each file at most once every 120 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1288} INFO - Checking for new files in /home/airflow/airflow/dags every 300 seconds Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,778] {jobs.py:1291} INFO - Searching for files in /home/airflow/airflow/dags Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,779] {jobs.py:1294} INFO - There are 20 files in /home/airflow/airflow/dags Jul 15 12:54:44 myserver airflow[20289]: /home/airflow/airflow/lib/python3.5/site-packages/airflow/www/app.py:23: FlaskWTFDeprecationWarning: "flask_wtf.CsrfProtect" has been renamed to "CSRFProtect" and will be removed in 1.0. Jul 15 12:54:44 myserver airflow[20289]: csrf = CsrfProtect() Jul 15 12:54:44 myserver airflow[20289]: [2017-07-15 12:54:44,994] {jobs.py:1356} INFO - Resetting state for orphaned tasks Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,008] {jobs.py:1366} INFO - Resetting backup_clarivoypg 2017-07-14 06:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,018] {jobs.py:1366} INFO - Resetting daily_web_response_processor 2017-07-14 09:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,023] {jobs.py:1366} INFO - Resetting ping_pgbouncers 2017-07-15 09:00:00 Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,027] {jobs.py:1407} INFO - Heartbeating the process manager Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,032] {dag_processing.py:627} INFO - Started a process (PID: 20462) to generate tasks for /home/airflow/airflow/dags/my_daily.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_daily.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,036] {dag_processing.py:627} INFO - Started a process (PID: 20463) to generate tasks for /home/airflow/airflow/dags/my_hourly.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_hourly.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,042] {dag_processing.py:627} INFO - Started a process (PID: 20464) to generate tasks for /home/airflow/airflow/dags/my_odd_job.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_odd_job.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,047] {dag_processing.py:627} INFO - Started a process (PID: 20465) to generate tasks for /home/airflow/airflow/dags/my_other_odd_job.py - logging into /home/airflow/airflow/logs/scheduler/2017-07-15/my_other_odd_job.py.log Jul 15 12:54:45 myserver airflow[20289]: [2017-07-15 12:54:45,049] {jobs.py:1443} INFO - Heartbeating the executor {code} > scheduler gets stuck without a trace > ------------------------------------ > > Key: AIRFLOW-401 > URL: https://issues.apache.org/jira/browse/AIRFLOW-401 > Project: Apache Airflow > Issue Type: Bug > Components: executor, scheduler > Affects Versions: Airflow 1.7.1.3 > Reporter: Nadeem Ahmed Nazeer > Assignee: Bolke de Bruin > Priority: Minor > Labels: celery, kombu > Attachments: Dag_code.txt, schduler_cpu100%.png, > scheduler_stuck_7hours.png, scheduler_stuck.png > > > The scheduler gets stuck without a trace or error. When this happens, the CPU > usage of scheduler service is at 100%. No jobs get submitted and everything > comes to a halt. Looks it goes into some kind of infinite loop. > The only way I could make it run again is by manually restarting the > scheduler service. But again, after running some tasks it gets stuck. I've > tried with both Celery and Local executors but same issue occurs. I am using > the -n 3 parameter while starting scheduler. > Scheduler configs, > job_heartbeat_sec = 5 > scheduler_heartbeat_sec = 5 > executor = LocalExecutor > parallelism = 32 > Please help. I would be happy to provide any other information needed -- This message was sent by Atlassian JIRA (v6.4.14#64029)