[I wrote this while offline without having received the full conversation,
sorry if it's a bit off and looks like it's disregarding previous comments]

On Mon, Apr 24, 2017 at 4:09 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> With configuration as code, you can't really know whether the DAG
> definition has changed based on whether the module was altered. This python
> module could be importing other modules that have been changed, could have
> read a config file somewhere on the drive that might have changed, or read
> from a DB that is constantly getting mutated.
>
> There are also issues around the fact that Python caches modules in
> `sys.modules`, so even though the crawler is re-interpreting modules,
> imported modules wouldn't get re-interpreted [as our DAG authors expected]
>
> For these reasons [and others I won't get into here], we decided that the
> scheduler would use a subprocess pool and re-interpret the DAGs from
> scratch at every cycle, insulating the different DAGs and guaranteeing no
> interpreter caching.
>
> Side note: yaml parsing is much more expensive than other markup languages
> and would recommend working around it to store DAG configuration. Our
> longest-to-parse DAGs at Airbnb were reading yaml to build build a DAG, and
> I believe someone wrote custom logic to avoid reparsing the yaml at every
> cycle. Parsing equivalent json or hocon was an order of magnitude faster.
>
> Max
>
> On Mon, Apr 24, 2017 at 2:55 PM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
>> Inotify can work without a daemon. Just fire a call to the API when a
>> file changes. Just a few lines in bash.
>>
>> If you bundle you dependencies in a zip you should be fine with the
>> above. Or if we start using manifests that list the files that are needed
>> in a dag...
>>
>>
>> Sent from my iPhone
>>
>> > On 24 Apr 2017, at 22:46, Dan Davydov <dan.davy...@airbnb.com.INVALID>
>> wrote:
>> >
>> > One idea to solve this is to use a daemon that uses inotify to watch for
>> > changes in files and then reprocesses just those files. The hard part is
>> > without any kind of dependency/build system for DAGs it can be hard to
>> tell
>> > which DAGs depend on which files.
>> >
>> > On Mon, Apr 24, 2017 at 1:21 PM, Gerard Toonstra <gtoons...@gmail.com>
>> > wrote:
>> >
>> >> Hey,
>> >>
>> >> I've seen some people complain about DAG file processing times. An
>> issue
>> >> was raised about this today:
>> >>
>> >> https://issues.apache.org/jira/browse/AIRFLOW-1139
>> >>
>> >> I attempted to provide a good explanation what's going on. Feel free to
>> >> validate and comment.
>> >>
>> >>
>> >> I'm noticing that the file processor is a bit naive in the way it
>> >> reprocesses DAGs. It doesn't look at the DAG interval for example, so
>> it
>> >> looks like it reprocesses all files continuously in one big batch,
>> even if
>> >> we can determine that the next "schedule"  for all its dags are in the
>> >> future?
>> >>
>> >>
>> >> Wondering if a change in the DagFileProcessingManager could optimize
>> things
>> >> a bit here.
>> >>
>> >> In the part where it gets the simple_dags from a file it's currently
>> >> processing:
>> >>
>> >>                for simple_dag in processor.result:
>> >>                    simple_dags.append(simple_dag)
>> >>
>> >> the file_path is in the context and the simple_dags should be able to
>> >> provide the next interval date for each dag in the file.
>> >>
>> >> The idea is to add files to a sorted deque by "next_schedule_datetime"
>> (the
>> >> minimum next interval date), so that when we build the list
>> >> "files_paths_to_queue", it can remove files that have dags that we know
>> >> won't have a new dagrun for a while.
>> >>
>> >> One gotcha to resolve after that is to deal with files getting updated
>> with
>> >> new dags or changed dag definitions and renames and different interval
>> >> schedules.
>> >>
>> >> Worth a PR to glance over?
>> >>
>> >> Rgds,
>> >>
>> >> Gerard
>> >>
>>
>
>

Reply via email to