Re: dag file processing times

Dan Davydov Mon, 24 Apr 2017 15:06:13 -0700

Was talking with Alex about the DB case offline, for those we could support
a force refresh arg with an interval param.


Manifests would need to be hierarchal but I feel like it would spin out
into a full blown build system inevitably.

On Mon, Apr 24, 2017 at 3:02 PM, Arthur Wiedmer <arthur.wied...@gmail.com>
wrote:

> What if the DAG actually depends on configuration that only exists in a
> database and is retrieved by the Python code generating the DAG?
>
> Just asking because we have this case in production here. It is slowly
> changing, so still fits within the Airflow framework, but you cannot just
> watch a file...
>
> Best,
> Arthur
>
> On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> > Inotify can work without a daemon. Just fire a call to the API when a
> file
> > changes. Just a few lines in bash.
> >
> > If you bundle you dependencies in a zip you should be fine with the
> above.
> > Or if we start using manifests that list the files that are needed in a
> > dag...
> >
> >
> > Sent from my iPhone
> >
> > > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davy...@airbnb.com.INVALID>
> > wrote:
> > >
> > > One idea to solve this is to use a daemon that uses inotify to watch
> for
> > > changes in files and then reprocesses just those files. The hard part
> is
> > > without any kind of dependency/build system for DAGs it can be hard to
> > tell
> > > which DAGs depend on which files.
> > >
> > > On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <gtoons...@gmail.com>
> > > wrote:
> > >
> > >> Hey,
> > >>
> > >> I've seen some people complain about DAG file processing times. An
> issue
> > >> was raised about this today:
> > >>
> > >> https://issues.apache.org/jira/browse/AIRFLOW-1139
> > >>
> > >> I attempted to provide a good explanation what's going on. Feel free
> to
> > >> validate and comment.
> > >>
> > >>
> > >> I'm noticing that the file processor is a bit naive in the way it
> > >> reprocesses DAGs. It doesn't look at the DAG interval for example, so
> it
> > >> looks like it reprocesses all files continuously in one big batch,
> even
> > if
> > >> we can determine that the next "schedule"  for all its dags are in the
> > >> future?
> > >>
> > >>
> > >> Wondering if a change in the DagFileProcessingManager could optimize
> > things
> > >> a bit here.
> > >>
> > >> In the part where it gets the simple_dags from a file it's currently
> > >> processing:
> > >>
> > >>                for simple_dag in processor.result:
> > >>                    simple_dags.append(simple_dag)
> > >>
> > >> the file_path is in the context and the simple_dags should be able to
> > >> provide the next interval date for each dag in the file.
> > >>
> > >> The idea is to add files to a sorted deque by "next_schedule_datetime"
> > (the
> > >> minimum next interval date), so that when we build the list
> > >> "files_paths_to_queue", it can remove files that have dags that we
> know
> > >> won't have a new dagrun for a while.
> > >>
> > >> One gotcha to resolve after that is to deal with files getting updated
> > with
> > >> new dags or changed dag definitions and renames and different interval
> > >> schedules.
> > >>
> > >> Worth a PR to glance over?
> > >>
> > >> Rgds,
> > >>
> > >> Gerard
> > >>
> >
>

Re: dag file processing times

Reply via email to