Hey all, I've upgraded on production. Things seem to be working so far (only been an hour), but I am seeing this in the scheduler logs:
File Path PID Runtime Last Runtime Last Run ------------------------------------------------------------------ ----- --------- -------------- ------------------- ... /etc/airflow/dags/dags/elt/el/db.py 24793 43.41s 986.63s 2017-01-23T20:04:09 ... It seems to be taking more than 15 minutes to parse this DAG. Any idea what's causing this? Scheduler config: [scheduler] job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 max_threads = 2 child_process_log_directory = /var/log/airflow/scheduler The db.py file, itself, doesn't interact with any outside systems, so I would have expected this to not take so long. It does, however, programmatically generate many DAGs within the single .py file. A snippet of the scheduler log is here: https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669e8ffd65 Note how there are 10-15 second gaps occasionally. Any idea what's going on? Cheers, Chris On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbr...@gmail.com> wrote: > I created: > > - AIRFLOW-791: At start up all running dag_runs are being checked, but not > fixed > - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t get fixed > - AIRFLOW-788: Context unexpectedly added to hive conf > - AIRFLOW-792: Allow fixing of schedule when wrong start_date / interval > was specified > > I created AIRFLOW-789 to update UPDATING.md, it is also out as a PR. > > Please note that I don't consider any of these blockers for a release of > 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an RC on Feb > 2. However if people are using a restarting scheduler (run_duration is set) > and have a lot of running dag runs they won’t like AIRFLOW-791. So a > workaround for this would be nice (we just updated dag_runs directly in the > database to say ‘finished’ before a certain date, but we are also not using > the run_duration). > > Bolke > > > > > On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com> wrote: > > > > Will do. And thanks. > > > > Adding another issue: > > > > * Some of our DAGs are not getting scheduled for some unknown reason. > > Need to investigate why. > > > > Related but not root cause: > > * Logging is so chatty that it gets really hard to find the real issue > > > > Bolke. > > > >> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davy...@airbnb.com.INVALID> > wrote: > >> > >> I'd be happy to lend a hand fixing these issues and hopefully some > others > >> are too. Do you mind creating jiras for these since you have the full > >> context? I have created a JIRA for (1) and have assigned it to myself: > >> https://issues.apache.org/jira/browse/AIRFLOW-780 > >> > >> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <bdbr...@gmail.com> > wrote: > >> > >>> This is to report back on some of the (early) experiences we have with > >>> Airflow 1.8.0 (beta 1 at the moment): > >>> > >>> 1. The UI does not show faulty DAG, leading to confusion for > developers. > >>> When a faulty dag is placed in the dags folder the UI would report a > >>> parsing error. Now it doesn’t due to the separate parising (but not > >>> reporting back errors) > >>> > >>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive > >>> We run in a secure environment which requires this variable to be > >>> whitelisted if it is modified (needs to be added to UPDATING.md) > >>> > >>> 3. DagRuns do not exist for certain tasks, but don’t get fixed > >>> Log gets flooded without a suggestion what to do > >>> > >>> 4. At start up all running dag_runs are being checked, we seemed to > have a > >>> lot of “left over” dag_runs (couple of thousand) > >>> - Checking was logged to INFO -> requires a fsync for every log message > >>> making it very slow > >>> - Checking would happen at every restart, but dag_runs’ states were not > >>> being updated > >>> - These dag_runs would never er be marked anything else than running > for > >>> some reason > >>> -> Applied work around to update all dag_run in sql before a certain > date > >>> to -> finished > >>> -> need to investigate why dag_runs did not get marked > “finished/failed” > >>> > >>> 5. Our umask is set to 027 > >>> > >>> > > > >