Hey,

I've seen some people complain about DAG file processing times. An issue
was raised about this today:

https://issues.apache.org/jira/browse/AIRFLOW-1139

I attempted to provide a good explanation what's going on. Feel free to
validate and comment.


I'm noticing that the file processor is a bit naive in the way it
reprocesses DAGs. It doesn't look at the DAG interval for example, so it
looks like it reprocesses all files continuously in one big batch, even if
we can determine that the next "schedule"  for all its dags are in the
future?


Wondering if a change in the DagFileProcessingManager could optimize things
a bit here.

In the part where it gets the simple_dags from a file it's currently
processing:

                for simple_dag in processor.result:
                    simple_dags.append(simple_dag)

the file_path is in the context and the simple_dags should be able to
provide the next interval date for each dag in the file.

The idea is to add files to a sorted deque by "next_schedule_datetime" (the
minimum next interval date), so that when we build the list
"files_paths_to_queue", it can remove files that have dags that we know
won't have a new dagrun for a while.

One gotcha to resolve after that is to deal with files getting updated with
new dags or changed dag definitions and renames and different interval
schedules.

Worth a PR to glance over?

Rgds,

Gerard

Reply via email to