[ https://issues.apache.org/jira/browse/AIRFLOW-6796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Bruce updated AIRFLOW-6796: ----------------------------------- Description: With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` called from `DagFileProcessManager.refresh_dag_dir` can delete the serialization of DAGs if they were loaded via a DagBag and globals in a different `.py` file: Consider something like this: {{/home/airflow/dags/loader.py}} {code:python} dag_bags = [] dag_bags.append(models.DagBag('/home/airflow/project-a/dags') dag_bags.append(models.DagBag('/home/airflow/project-b/dags') for dag_bag in dag_bags: for dag in dag_bag: globals()[dag.dag_id] = dag{code} with files: ``` {{/home/airflow/project-a/dags/dag-a.py}} {{/home/airflow/project-b/dags/dag-b.py}} ``` The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} is only going to contain {{/home/airflow/dags/loader.py}} and the method will remove the serializations for the DAGs in dag-a.py and dag-b.py With non-serialized DAGs, airflow seems to mark DAGs as inactive based on when the scheduler last processed them - I wonder if we should make these two methods consistent? was: With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` called from `DagFileProcessManager.refresh_dag_dir` can delete the serialization of DAGs if they were loaded via a DagBag and globals in a different `.py` file: Consider something like this: {{/home/airflow/dags/loader.py}} {code:python} dag_bags = [] dag_bags.append(models.DagBag('/home/airflow/project-a/dags') dag_bags.append(models.DagBag('/home/airflow/project-b/dags') for dag_bag in dag_bags: for dag in dag_bag: globals()[dag.dag_id] = dag{code} with files: {code:python} {code} {{/home/airflow/project-a/dags/dag-a.py}} {code:python} {code} {{/home/airflow/project-b/dags/dag-b.py}} The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} is only going to contain {{/home/airflow/dags/loader.py}} and the method will remove the serializations for the DAGs in dag-a.py and dag-b.py With non-serialized DAGs, airflow seems to mark DAGs as inactive based on when the scheduler last processed them - I wonder if we should make these two methods consistent? > Serialized DAGs can be incorrectly deleted > ------------------------------------------ > > Key: AIRFLOW-6796 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6796 > Project: Apache Airflow > Issue Type: Bug > Components: serialization > Affects Versions: 1.10.9 > Reporter: Matthew Bruce > Priority: Major > > With serialization of DAGs enabled, `SerializedDagModel.remove_deleted_dags` > called from `DagFileProcessManager.refresh_dag_dir` can delete the > serialization of DAGs if they were loaded via a DagBag and globals in a > different `.py` file: > Consider something like this: > {{/home/airflow/dags/loader.py}} > {code:python} > dag_bags = [] > dag_bags.append(models.DagBag('/home/airflow/project-a/dags') > dag_bags.append(models.DagBag('/home/airflow/project-b/dags') > for dag_bag in dag_bags: > for dag in dag_bag: > globals()[dag.dag_id] = dag{code} > with files: > ``` > {{/home/airflow/project-a/dags/dag-a.py}} > {{/home/airflow/project-b/dags/dag-b.py}} > ``` > The list of file paths passed to {{SerializedDagModel.remove_deleted_dags}} > is only going to contain {{/home/airflow/dags/loader.py}} and the method will > remove the serializations for the DAGs in dag-a.py and dag-b.py > With non-serialized DAGs, airflow seems to mark DAGs as inactive based on > when the scheduler last processed them - I wonder if we should make these two > methods consistent? -- This message was sent by Atlassian Jira (v8.3.4#803005)