dstandish commented on a change in pull request #21879:
URL: https://github.com/apache/airflow/pull/21879#discussion_r816482866
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
conn_uri = conn.get_uri()
with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more
DAG and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with the command ``airflow db
clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any
operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such
maintenance. One way to do so would be to set the param ``[scheduler] >
use_job_schedule`` to ``False`` and wait for any running DAGs to complete;
after this no new DAG runs will be created unless externally triggered.
+
+Another way to accomplish roughly the same thing is to use the ``dags pause``
command. You *must* keep track of the DAGs that are paused before you begin
this operation, otherwise when it comes time to unpause, you won't know which
ones should remain paused! So first run ``airflow dags list``, then store the
list of unpaused DAGs, and keep this list somewhere so that later you can
unpause only these.
+
+Upgrades
+^^^^^^^^
+
+Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command.
Review comment:
```suggestion
Some database migrations can be time-consuming. If your metadata database
is very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade.
```
##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to
manipulate the data using c
"sd": "2020-11-29T14:53:56.931243+00:00",
"ed": "2020-11-29T14:53:57.126306+00:00"
}
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+ It's strongly recommended that you backup the metadata database before
running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older
than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list
of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary
tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON
DELETE CASCADE`` so deletes in one table may trigger deletes in others. For
example, the ``task_instance`` table keys to the ``dag_run`` table, so if a
DagRun record is deleted, all of its associated task instances will also be
deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest
DagRun. If you delete all DAG runs, Airflow may schedule an old DAG run that
was already completed, e.g. if you have set ``catchup=True``. So the ``db
clean`` will preserve the latest non-manually-triggered DAG run to preserve
continuity in scheduling.
Review comment:
```suggestion
Commonly, Airflow determines which DagRun to run next by looking up the
latest DagRun. If you delete all DAG runs, Airflow may schedule an old DAG run
that was already completed, e.g. if you have set ``catchup=True``. So the ``db
clean`` command will preserve the latest non-manually-triggered DAG run to
preserve continuity in scheduling.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]