[ 
https://issues.apache.org/jira/browse/AIRFLOW-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541666#comment-16541666
 ] 

Ash Berlin-Taylor commented on AIRFLOW-1729:
--------------------------------------------

Closer to a fuller fix is this diff:

{code:python}
diff --git a/airflow/models.py b/airflow/models.py
index 089befef..e722b609 100755
--- a/airflow/models.py
+++ b/airflow/models.py
@@ -522,12 +522,27 @@ class DagBag(BaseDagBag, LoggingMixin):
         if os.path.isfile(dag_folder):
             self.process_file(dag_folder, only_if_updated=only_if_updated)
         elif os.path.isdir(dag_folder):
+            patterns_by_dir = {}
             for root, dirs, files in os.walk(dag_folder, followlinks=True):
-                patterns = []
+                patterns = patterns_by_dir.get(root, []).copy()
+                self.log.info("Root %s dirs %r patterns %r", root, dirs, 
patterns)
                 ignore_file = os.path.join(root, '.airflowignore')
                 if os.path.isfile(ignore_file):
+                    self.log.info("Loading %s", ignore_file)
                     with open(ignore_file, 'r') as f:
                         patterns += [p for p in f.read().split('\n') if p]
+                    #dirs[:] = list[d for d in dirs if not any([re.findall(p, 
os.path.join(root, d)) for p in patterns])]
+
+                # If we can ignore any subdirs entirely we should - fewer paths
+                # to walk is better. We have to modify the ``dirs`` array in
+                # place for this to affect os.walk
+                dirs[:] = [d for d in dirs if not any(re.findall(p, 
os.path.join(root, d)) for p in patterns)]
+
+                # We want patterns defined in a parent folder's .airflowignore 
to
+                # apply to subdirs too
+                for d in dirs:
+                    patterns_by_dir[os.path.join(root, d)] = patterns
+
                 for f in files:
                     try:
                         filepath = os.path.join(root, f)
{code}

Reasons I haven't just opened a PR with that: We need to add tests for this so 
it doesn't break again; we should de-duplicate between this code and the almost 
identical code in airflow.utils.dag_processing.

> Ignore whole directories in .airflowignore
> ------------------------------------------
>
>                 Key: AIRFLOW-1729
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1729
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: Airflow 2.0
>            Reporter: Cedric Hourcade
>            Assignee: Kamil Sambor
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> The .airflowignore file allows to prevent scanning files for DAG. But even if 
> we blacklist fulldirectory the {{os.walk}} will still go through them no 
> matter how deep they are and skip files one by one, which can be an issue 
> when you keep around big .git or virtualvenv directories.
> I suggest to add something like:
> {code}
> dirs[:] = [d for d in dirs if not any([re.findall(p, os.path.join(root, d)) 
> for p in patterns])]
> {code}
> to prune the directories here: 
> https://github.com/apache/incubator-airflow/blob/cfc2f73c445074e1e09d6ef6a056cd2b33a945da/airflow/utils/dag_processing.py#L208-L209
>  and in {{list_py_file_paths}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to