[ https://issues.apache.org/jira/browse/AIRFLOW-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541666#comment-16541666 ]
Ash Berlin-Taylor commented on AIRFLOW-1729: -------------------------------------------- Closer to a fuller fix is this diff: {code:python} diff --git a/airflow/models.py b/airflow/models.py index 089befef..e722b609 100755 --- a/airflow/models.py +++ b/airflow/models.py @@ -522,12 +522,27 @@ class DagBag(BaseDagBag, LoggingMixin): if os.path.isfile(dag_folder): self.process_file(dag_folder, only_if_updated=only_if_updated) elif os.path.isdir(dag_folder): + patterns_by_dir = {} for root, dirs, files in os.walk(dag_folder, followlinks=True): - patterns = [] + patterns = patterns_by_dir.get(root, []).copy() + self.log.info("Root %s dirs %r patterns %r", root, dirs, patterns) ignore_file = os.path.join(root, '.airflowignore') if os.path.isfile(ignore_file): + self.log.info("Loading %s", ignore_file) with open(ignore_file, 'r') as f: patterns += [p for p in f.read().split('\n') if p] + #dirs[:] = list[d for d in dirs if not any([re.findall(p, os.path.join(root, d)) for p in patterns])] + + # If we can ignore any subdirs entirely we should - fewer paths + # to walk is better. We have to modify the ``dirs`` array in + # place for this to affect os.walk + dirs[:] = [d for d in dirs if not any(re.findall(p, os.path.join(root, d)) for p in patterns)] + + # We want patterns defined in a parent folder's .airflowignore to + # apply to subdirs too + for d in dirs: + patterns_by_dir[os.path.join(root, d)] = patterns + for f in files: try: filepath = os.path.join(root, f) {code} Reasons I haven't just opened a PR with that: We need to add tests for this so it doesn't break again; we should de-duplicate between this code and the almost identical code in airflow.utils.dag_processing. > Ignore whole directories in .airflowignore > ------------------------------------------ > > Key: AIRFLOW-1729 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1729 > Project: Apache Airflow > Issue Type: Improvement > Components: core > Affects Versions: Airflow 2.0 > Reporter: Cedric Hourcade > Assignee: Kamil Sambor > Priority: Minor > Fix For: 2.0.0 > > > The .airflowignore file allows to prevent scanning files for DAG. But even if > we blacklist fulldirectory the {{os.walk}} will still go through them no > matter how deep they are and skip files one by one, which can be an issue > when you keep around big .git or virtualvenv directories. > I suggest to add something like: > {code} > dirs[:] = [d for d in dirs if not any([re.findall(p, os.path.join(root, d)) > for p in patterns])] > {code} > to prune the directories here: > https://github.com/apache/incubator-airflow/blob/cfc2f73c445074e1e09d6ef6a056cd2b33a945da/airflow/utils/dag_processing.py#L208-L209 > and in {{list_py_file_paths}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)