*tl;dr* 1. The GitHub Action for building docs is running out of space. I think we should archive really old documentation for large packages to cloud storage. 2. Contributing to and building Airflow docs is hard. We should migrate to a framework, preferably one that uses markdown (although I acknowledge rst -> md will be a massive overhaul).
*Problem Summary* I recently set out to implement what I thought would be a straightforward feature: warn users when they are viewing documentation for non-current versions of Airflow and link them to the current/stable version <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the airflow-site <https://github.com/apache/airflow-site> repo, which contains all of the archived docs (that is, documentation for non-current versions), and from there, I ran into a brick wall. I want to raise some concerns that I've developed after trying to contribute what feel like a couple reasonably small docs updates: 1. airflow-site 1. Elad pointed out the problem posed by the sheer size of archived docs <https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943> (more on this later). 2. The airflow-site repo is confusing, and rather poorly documented. 1. Hugo (static site generator) exists, but appears to only be used for the landing pages 2. In order to view any documentation locally other than the landing pages, you'll need to run the site.sh script then copy the output from one dir to another? 3. All of the archived docs are raw HTML, making migrating to a static site generator a significant challenge, which makes it difficult to prevent the archived docs from continuing to grow and grow. Perhaps this is the wheel Khaleesi was referring to <https://www.youtube.com/watch?v=J-rxmk6zPxA>? 2. airflow 1. Building Airflow docs is a challenge. It takes several minutes and doesn't support auto-build, so the slightest issue could require waiting again and again until the changes are just so. I tried implementing sphinx-autobuild <https://github.com/executablebooks/sphinx-autobuild> to no avail. 2. Sphinx/restructured text has a steep learning curve. *The most acute issue: disk space* The size of the archived docs is causing the docs build GitHub Action to almost run out of space. From the "Build site" Action from a couple weeks ago <https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458> (expand the build site step, scroll all the way to the bottom, expand the `df -h` command), we can see the GitHub Action runner (or whatever it's called) is nearly running out of space: df -h *Filesystem Size Used Avail Use% Mounted on* /dev/root 84G 82G 2.1G 98% / The available space is down to 1.8G on the most recent Action <https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176>. If we assume that trend is accurate, we have about two months before the Action runner runs out of disk space. Here's a breakdown of the space consumed by the 10 largest package documentation directories: du -h -d 1 docs-archive/ | sort -h -r * 14G* docs-archive/ *4.0G* docs-archive//apache-airflow-providers-google *3.2G* docs-archive//apache-airflow *1.7G* docs-archive//apache-airflow-providers-amazon *560M* docs-archive//apache-airflow-providers-microsoft-azure *254M* docs-archive//apache-airflow-providers-cncf-kubernetes *192M* docs-archive//apache-airflow-providers-apache-hive *153M* docs-archive//apache-airflow-providers-snowflake *139M* docs-archive//apache-airflow-providers-databricks *104M* docs-archive//apache-airflow-providers-docker *101M* docs-archive//apache-airflow-providers-mysql *Proposed solution: Archive old docs html for large packages to cloud storage* I'm wondering if it would be reasonable to truly archive the docs for some of the older versions of these packages. Perhaps the last 18 months? Maybe we could drop the html in a blob storage bucket with instructions for building the docs if absolutely necessary? *Improving docs building moving forward* There's an open Issue <https://github.com/apache/airflow-site/issues/719> for migrating the docs to a framework, but it's not at all a straightforward task for the archived docs. I think that we should institute a policy of archiving old documentation to cloud storage after X time and use a framework for building docs in a scalable and sustainable way moving forward. Maybe we could chat with iceberg folks about how they moved from mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616> Shoutout to Utkarsh for helping me through all this!
