dstandish commented on a change in pull request #21879:
URL: https://github.com/apache/airflow/pull/21879#discussion_r816482866



##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -577,3 +577,41 @@ For connection, use :envvar:`AIRFLOW_CONN_{CONN_ID}`.
     conn_uri = conn.get_uri()
     with mock.patch.dict("os.environ", AIRFLOW_CONN_MY_CONN=conn_uri):
         assert "cat" == Connection.get("my_conn").login
+
+Metadata DB maintenance
+-----------------------
+
+Over time, the metadata database will increase its storage footprint as more 
DAG and task runs and event logs accumulate.
+
+You can use the Airflow CLI to purge old data with the command ``airflow db 
clean``.
+
+See :ref:`db clean usage<cli-db-clean>` for more details.
+
+Upgrades and downgrades
+-----------------------
+
+Backup your database
+^^^^^^^^^^^^^^^^^^^^
+
+It's always a wise idea to backup the metadata database before undertaking any 
operation modifying the database.
+
+Disabling the scheduler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+You might consider disabling the Airflow cluster while you perform such 
maintenance.  One way to do so would be to set the param ``[scheduler] > 
use_job_schedule`` to ``False`` and wait for any running DAGs to complete; 
after this no new DAG runs will be created unless externally triggered.
+
+Another way to accomplish roughly the same thing is to use the ``dags pause`` 
command.  You *must* keep track of the DAGs that are paused before you begin 
this operation, otherwise when it comes time to unpause, you won't know which 
ones should remain paused!  So first run ``airflow dags list``, then store the 
list of unpaused DAGs, and keep this list somewhere so that later you can 
unpause only these.
+
+Upgrades
+^^^^^^^^
+
+Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command.

Review comment:
       ```suggestion
   Some database migrations can be time-consuming.  If your metadata database 
is very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.
   ```

##########
File path: docs/apache-airflow/usage-cli.rst
##########
@@ -199,3 +199,52 @@ Both ``json`` and ``yaml`` formats make it easier to 
manipulate the data using c
     "sd": "2020-11-29T14:53:56.931243+00:00",
     "ed": "2020-11-29T14:53:57.126306+00:00"
   }
+
+.. _cli-db-clean:
+
+Purge history from metadata database
+------------------------------------
+
+.. note::
+
+  It's strongly recommended that you backup the metadata database before 
running the ``db clean`` command.
+
+The ``db clean`` command works by deleting from each table the records older 
than the provided ``--clean-before-timestamp``.
+
+You can optionally provide a list of tables to perform deletes on. If no list 
of tables is supplied, all tables will be included.
+
+You can use the ``--dry-run`` option to print the row counts in the primary 
tables to be cleaned.
+
+Beware cascading deletes
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Keep in mind that some tables have foreign key relationships defined with ``ON 
DELETE CASCADE`` so deletes in one table may trigger deletes in others.  For 
example, the ``task_instance`` table keys to the ``dag_run`` table, so if a 
DagRun record is deleted, all of its associated task instances will also be 
deleted.
+
+
+Special handling for dag runs
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Commonly, Airflow determines which DagRun to run next by looking up the latest 
DagRun.  If you delete all DAG runs, Airflow may schedule an old DAG run that 
was already completed, e.g. if you have set ``catchup=True``.  So the ``db 
clean`` will preserve the latest non-manually-triggered DAG run to preserve 
continuity in scheduling.

Review comment:
       ```suggestion
   Commonly, Airflow determines which DagRun to run next by looking up the 
latest DagRun.  If you delete all DAG runs, Airflow may schedule an old DAG run 
that was already completed, e.g. if you have set ``catchup=True``.  So the ``db 
clean`` command will preserve the latest non-manually-triggered DAG run to 
preserve continuity in scheduling.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to