Yes. Moving the old version to somewhere that we can keep/archive static historical versions of those historical docs and publish them from there. What you proposed is exactly the solution I thought might be best as well.
It would be a great task to contribute to the stability of our docs generation in the future. I don't think it's a matter of discussing in detail how to do it (18 months is a good start and you can parameterize it), It's the matter of someone committing to it and doing it simply :). So yes I personally am all for it and if I understand correctly that you are looking for agreement on doing it, big +1 from my side - happy to help with providing access to our S3 buckets. J. On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter <[email protected]> wrote: > *tl;dr* > > 1. The GitHub Action for building docs is running out of space. I think > we should archive really old documentation for large packages to cloud > storage. > 2. Contributing to and building Airflow docs is hard. We should migrate > to a framework, preferably one that uses markdown (although I > acknowledge > rst -> md will be a massive overhaul). > > *Problem Summary* > I recently set out to implement what I thought would be a straightforward > feature: warn users when they are viewing documentation for non-current > versions of Airflow and link them to the current/stable version > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the > airflow-site <https://github.com/apache/airflow-site> repo, which contains > all of the archived docs (that is, documentation for non-current versions), > and from there, I ran into a brick wall. > > I want to raise some concerns that I've developed after trying to > contribute what feel like a couple reasonably small docs updates: > > 1. airflow-site > 1. Elad pointed out the problem posed by the sheer size of archived > docs > < > https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943 > > > (more > on this later). > 2. The airflow-site repo is confusing, and rather poorly documented. > 1. Hugo (static site generator) exists, but appears to only be > used for the landing pages > 2. In order to view any documentation locally other than the > landing pages, you'll need to run the site.sh script then > copy the output > from one dir to another? > 3. All of the archived docs are raw HTML, making migrating to a > static site generator a significant challenge, which makes it > difficult to > prevent the archived docs from continuing to grow and grow. > Perhaps this is the > wheel Khaleesi was referring to > <https://www.youtube.com/watch?v=J-rxmk6zPxA>? > 2. airflow > 1. Building Airflow docs is a challenge. It takes several minutes and > doesn't support auto-build, so the slightest issue could require > waiting > again and again until the changes are just so. I tried implementing > sphinx-autobuild < > https://github.com/executablebooks/sphinx-autobuild> > to no avail. > 2. Sphinx/restructured text has a steep learning curve. > > *The most acute issue: disk space* > The size of the archived docs is causing the docs build GitHub Action to > almost run out of space. From the "Build site" Action from a couple weeks > ago > < > https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458 > > > (expand > the build site step, scroll all the way to the bottom, expand the `df -h` > command), we can see the GitHub Action runner (or whatever it's called) is > nearly running out of space: > > df -h > *Filesystem Size Used Avail Use% Mounted on* > /dev/root 84G 82G 2.1G 98% / > > > The available space is down to 1.8G on the most recent Action > < > https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176 > >. > If we assume that trend is accurate, we have about two months before the > Action runner runs out of disk space. Here's a breakdown of the space > consumed by the 10 largest package documentation directories: > > du -h -d 1 docs-archive/ | sort -h -r > * 14G* docs-archive/ > *4.0G* docs-archive//apache-airflow-providers-google > *3.2G* docs-archive//apache-airflow > *1.7G* docs-archive//apache-airflow-providers-amazon > *560M* docs-archive//apache-airflow-providers-microsoft-azure > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes > *192M* docs-archive//apache-airflow-providers-apache-hive > *153M* docs-archive//apache-airflow-providers-snowflake > *139M* docs-archive//apache-airflow-providers-databricks > *104M* docs-archive//apache-airflow-providers-docker > *101M* docs-archive//apache-airflow-providers-mysql > > > *Proposed solution: Archive old docs html for large packages to cloud > storage* > I'm wondering if it would be reasonable to truly archive the docs for some > of the older versions of these packages. Perhaps the last 18 months? Maybe > we could drop the html in a blob storage bucket with instructions for > building the docs if absolutely necessary? > > *Improving docs building moving forward* > There's an open Issue <https://github.com/apache/airflow-site/issues/719> > for > migrating the docs to a framework, but it's not at all a straightforward > task for the archived docs. I think that we should institute a policy of > archiving old documentation to cloud storage after X time and use a > framework for building docs in a scalable and sustainable way moving > forward. Maybe we could chat with iceberg folks about how they moved from > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616> > > > Shoutout to Utkarsh for helping me through all this! >
