Was talking with Alex about the DB case offline, for those we could support a force refresh arg with an interval param.
Manifests would need to be hierarchal but I feel like it would spin out into a full blown build system inevitably. On Mon, Apr 24, 2017 at 3:02 PM, Arthur Wiedmer <arthur.wied...@gmail.com> wrote: > What if the DAG actually depends on configuration that only exists in a > database and is retrieved by the Python code generating the DAG? > > Just asking because we have this case in production here. It is slowly > changing, so still fits within the Airflow framework, but you cannot just > watch a file... > > Best, > Arthur > > On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbr...@gmail.com> wrote: > > > Inotify can work without a daemon. Just fire a call to the API when a > file > > changes. Just a few lines in bash. > > > > If you bundle you dependencies in a zip you should be fine with the > above. > > Or if we start using manifests that list the files that are needed in a > > dag... > > > > > > Sent from my iPhone > > > > > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davy...@airbnb.com.INVALID> > > wrote: > > > > > > One idea to solve this is to use a daemon that uses inotify to watch > for > > > changes in files and then reprocesses just those files. The hard part > is > > > without any kind of dependency/build system for DAGs it can be hard to > > tell > > > which DAGs depend on which files. > > > > > > On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <gtoons...@gmail.com> > > > wrote: > > > > > >> Hey, > > >> > > >> I've seen some people complain about DAG file processing times. An > issue > > >> was raised about this today: > > >> > > >> https://issues.apache.org/jira/browse/AIRFLOW-1139 > > >> > > >> I attempted to provide a good explanation what's going on. Feel free > to > > >> validate and comment. > > >> > > >> > > >> I'm noticing that the file processor is a bit naive in the way it > > >> reprocesses DAGs. It doesn't look at the DAG interval for example, so > it > > >> looks like it reprocesses all files continuously in one big batch, > even > > if > > >> we can determine that the next "schedule" for all its dags are in the > > >> future? > > >> > > >> > > >> Wondering if a change in the DagFileProcessingManager could optimize > > things > > >> a bit here. > > >> > > >> In the part where it gets the simple_dags from a file it's currently > > >> processing: > > >> > > >> for simple_dag in processor.result: > > >> simple_dags.append(simple_dag) > > >> > > >> the file_path is in the context and the simple_dags should be able to > > >> provide the next interval date for each dag in the file. > > >> > > >> The idea is to add files to a sorted deque by "next_schedule_datetime" > > (the > > >> minimum next interval date), so that when we build the list > > >> "files_paths_to_queue", it can remove files that have dags that we > know > > >> won't have a new dagrun for a while. > > >> > > >> One gotcha to resolve after that is to deal with files getting updated > > with > > >> new dags or changed dag definitions and renames and different interval > > >> schedules. > > >> > > >> Worth a PR to glance over? > > >> > > >> Rgds, > > >> > > >> Gerard > > >> > > >