[ 
https://issues.apache.org/jira/browse/AIRFLOW-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Bruce updated AIRFLOW-6796:
-----------------------------------
    Description: 
With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` 
called from `DagFileProcessManager.refresh_dag_dir` can delete the 
serialization of DAGs if they were loaded via a DagBag and globals in a 
different `.py` file:

Consider something like this:
 {{/home/airflow/dags/loader.py}}
{code:python}
dag_bags = []
dag_bags.append(models.DagBag('/home/airflow/project-a/dags')
dag_bags.append(models.DagBag('/home/airflow/project-b/dags')

for dag_bag in dag_bags:
    for dag in dag_bag:
      globals()[dag.dag_id] = dag{code}
with files:

```

{{/home/airflow/project-a/dags/dag-a.py}}
{{/home/airflow/project-b/dags/dag-b.py}}

```

The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} is 
only going to contain {{/home/airflow/dags/loader.py}} and the method will 
remove the serializations for the DAGs in dag-a.py and dag-b.py

With non-serialized DAGs, airflow seems to mark DAGs as inactive based on when 
the scheduler last processed them - I wonder if we should make these two 
methods consistent?

  was:
With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` 
called from `DagFileProcessManager.refresh_dag_dir` can delete the 
serialization of DAGs if they were loaded via a DagBag and globals in a 
different `.py` file:

Consider something like this:
 {{/home/airflow/dags/loader.py}}
{code:python}
dag_bags = []
dag_bags.append(models.DagBag('/home/airflow/project-a/dags')
dag_bags.append(models.DagBag('/home/airflow/project-b/dags')

for dag_bag in dag_bags:
    for dag in dag_bag:
      globals()[dag.dag_id] = dag{code}
with files:
{code:python}
 
{code}
{{/home/airflow/project-a/dags/dag-a.py}}
{code:python}
 
{code}
{{/home/airflow/project-b/dags/dag-b.py}}

The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} is 
only going to contain {{/home/airflow/dags/loader.py}} and the method will 
remove the serializations for the DAGs in dag-a.py and dag-b.py

With non-serialized DAGs, airflow seems to mark DAGs as inactive based on when 
the scheduler last processed them - I wonder if we should make these two 
methods consistent?


> Serialized DAGs can be incorrectly deleted
> ------------------------------------------
>
>                 Key: AIRFLOW-6796
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-6796
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: serialization
>    Affects Versions: 1.10.9
>            Reporter: Matthew Bruce
>            Priority: Major
>
> With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` 
> called from `DagFileProcessManager.refresh_dag_dir` can delete the 
> serialization of DAGs if they were loaded via a DagBag and globals in a 
> different `.py` file:
> Consider something like this:
>  {{/home/airflow/dags/loader.py}}
> {code:python}
> dag_bags = []
> dag_bags.append(models.DagBag('/home/airflow/project-a/dags')
> dag_bags.append(models.DagBag('/home/airflow/project-b/dags')
> for dag_bag in dag_bags:
>     for dag in dag_bag:
>       globals()[dag.dag_id] = dag{code}
> with files:
> ```
> {{/home/airflow/project-a/dags/dag-a.py}}
> {{/home/airflow/project-b/dags/dag-b.py}}
> ```
> The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} 
> is only going to contain {{/home/airflow/dags/loader.py}} and the method will 
> remove the serializations for the DAGs in dag-a.py and dag-b.py
> With non-serialized DAGs, airflow seems to mark DAGs as inactive based on 
> when the scheduler last processed them - I wonder if we should make these two 
> methods consistent?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to