Hey all,

I've upgraded on production. Things seem to be working so far (only been an
hour), but I am seeing this in the scheduler logs:

File Path                                                             PID
 Runtime    Last Runtime    Last Run
------------------------------------------------------------------  -----
 ---------  --------------  -------------------
...
/etc/airflow/dags/dags/elt/el/db.py                                 24793
 43.41s     986.63s         2017-01-23T20:04:09
...

It seems to be taking more than 15 minutes to parse this DAG. Any idea
what's causing this? Scheduler config:

[scheduler]
job_heartbeat_sec = 5
scheduler_heartbeat_sec = 5
max_threads = 2
child_process_log_directory = /var/log/airflow/scheduler

The db.py file, itself, doesn't interact with any outside systems, so I
would have expected this to not take so long. It does, however,
programmatically generate many DAGs within the single .py file.

A snippet of the scheduler log is here:

https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669e8ffd65

Note how there are 10-15 second gaps occasionally. Any idea what's going on?

Cheers,
Chris

On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> I created:
>
> - AIRFLOW-791: At start up all running dag_runs are being checked, but not
> fixed
> - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t get fixed
> - AIRFLOW-788: Context unexpectedly added to hive conf
> - AIRFLOW-792: Allow fixing of schedule when wrong start_date / interval
> was specified
>
> I created AIRFLOW-789 to update UPDATING.md, it is also out as a PR.
>
> Please note that I don't consider any of these blockers for a release of
> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an RC on Feb
> 2. However if people are using a restarting scheduler (run_duration is set)
> and have a lot of running dag runs they won’t like AIRFLOW-791. So a
> workaround for this would be nice (we just updated dag_runs directly in the
> database to say ‘finished’ before a certain date, but we are also not using
> the run_duration).
>
> Bolke
>
>
>
> > On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >
> > Will do. And thanks.
> >
> > Adding another issue:
> >
> > * Some of our DAGs are not getting scheduled for some unknown reason.
> > Need to investigate why.
> >
> > Related but not root cause:
> > * Logging is so chatty that it gets really hard to find the real issue
> >
> > Bolke.
> >
> >> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> wrote:
> >>
> >> I'd be happy to lend a hand fixing these issues and hopefully some
> others
> >> are too. Do you mind creating jiras for these since you have the full
> >> context? I have created a JIRA for (1) and have assigned it to myself:
> >> https://issues.apache.org/jira/browse/AIRFLOW-780
> >>
> >> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> >>
> >>> This is to report back on some of the (early) experiences we have with
> >>> Airflow 1.8.0 (beta 1 at the moment):
> >>>
> >>> 1. The UI does not show faulty DAG, leading to confusion for
> developers.
> >>> When a faulty dag is placed in the dags folder the UI would report a
> >>> parsing error. Now it doesn’t due to the separate parising (but not
> >>> reporting back errors)
> >>>
> >>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
> >>> We run in a secure environment which requires this variable to be
> >>> whitelisted if it is modified (needs to be added to UPDATING.md)
> >>>
> >>> 3. DagRuns do not exist for certain tasks, but don’t get fixed
> >>> Log gets flooded without a suggestion what to do
> >>>
> >>> 4. At start up all running dag_runs are being checked, we seemed to
> have a
> >>> lot of “left over” dag_runs (couple of thousand)
> >>> - Checking was logged to INFO -> requires a fsync for every log message
> >>> making it very slow
> >>> - Checking would happen at every restart, but dag_runs’ states were not
> >>> being updated
> >>> - These dag_runs would never er be marked anything else than running
> for
> >>> some reason
> >>> -> Applied work around to update all dag_run in sql before a certain
> date
> >>> to -> finished
> >>> -> need to investigate why dag_runs did not get marked
> “finished/failed”
> >>>
> >>> 5. Our umask is set to 027
> >>>
> >>>
> >
>
>

Reply via email to