Yes. Moving the old version to somewhere that we can keep/archive static
historical versions of those historical docs and publish them from there.
What you proposed is exactly the solution I thought might be best as well.

It would be a great task to contribute to the stability of our docs
generation in the future.

I don't think it's a matter of discussing in detail how to do it (18 months
is a good start and you can parameterize it), It's the matter of
someone committing to it and doing it simply :).

So yes I personally am all for it and if I understand correctly that you
are looking for agreement on doing it, big +1 from my side - happy to help
with providing access to our S3 buckets.

J.

On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter
<[email protected]> wrote:

> *tl;dr*
>
>    1. The GitHub Action for building docs is running out of space. I think
>    we should archive really old documentation for large packages to cloud
>    storage.
>    2. Contributing to and building Airflow docs is hard. We should migrate
>    to a framework, preferably one that uses markdown (although I
> acknowledge
>    rst -> md will be a massive overhaul).
>
> *Problem Summary*
> I recently set out to implement what I thought would be a straightforward
> feature: warn users when they are viewing documentation for non-current
> versions of Airflow and link them to the current/stable version
> <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the
> airflow-site <https://github.com/apache/airflow-site> repo, which contains
> all of the archived docs (that is, documentation for non-current versions),
> and from there, I ran into a brick wall.
>
> I want to raise some concerns that I've developed after trying to
> contribute what feel like a couple reasonably small docs updates:
>
>    1. airflow-site
>       1. Elad pointed out the problem posed by the sheer size of archived
>       docs
>       <
> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943
> >
> (more
>       on this later).
>       2. The airflow-site repo is confusing, and rather poorly documented.
>          1. Hugo (static site generator) exists, but appears to only be
>          used for the landing pages
>          2. In order to view any documentation locally other than the
>          landing pages, you'll need to run the site.sh script then
> copy the output
>          from one dir to another?
>       3. All of the archived docs are raw HTML, making migrating to a
>       static site generator a significant challenge, which makes it
> difficult to
>       prevent the archived docs from continuing to grow and grow.
> Perhaps this is the
>       wheel Khaleesi was referring to
>       <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
>    2. airflow
>       1. Building Airflow docs is a challenge. It takes several minutes and
>       doesn't support auto-build, so the slightest issue could require
> waiting
>       again and again until the changes are just so. I tried implementing
>       sphinx-autobuild <
> https://github.com/executablebooks/sphinx-autobuild>
>       to no avail.
>       2. Sphinx/restructured text has a steep learning curve.
>
> *The most acute issue: disk space*
> The size of the archived docs is causing the docs build GitHub Action to
> almost run out of space. From the "Build site" Action from a couple weeks
> ago
> <
> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458
> >
> (expand
> the build site step, scroll all the way to the bottom, expand the `df -h`
> command), we can see the GitHub Action runner (or whatever it's called) is
> nearly running out of space:
>
> df -h
>   *Filesystem      Size  Used Avail Use% Mounted on*
>   /dev/root        84G   82G  2.1G  98% /
>
>
> The available space is down to 1.8G on the most recent Action
> <
> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176
> >.
> If we assume that trend is accurate, we have about two months before the
> Action runner runs out of disk space. Here's a breakdown of the space
> consumed by the 10 largest package documentation directories:
>
> du -h -d 1 docs-archive/ | sort -h -r
> * 14G* docs-archive/
> *4.0G* docs-archive//apache-airflow-providers-google
> *3.2G* docs-archive//apache-airflow
> *1.7G* docs-archive//apache-airflow-providers-amazon
> *560M* docs-archive//apache-airflow-providers-microsoft-azure
> *254M* docs-archive//apache-airflow-providers-cncf-kubernetes
> *192M* docs-archive//apache-airflow-providers-apache-hive
> *153M* docs-archive//apache-airflow-providers-snowflake
> *139M* docs-archive//apache-airflow-providers-databricks
> *104M* docs-archive//apache-airflow-providers-docker
> *101M* docs-archive//apache-airflow-providers-mysql
>
>
> *Proposed solution: Archive old docs html for large packages to cloud
> storage*
> I'm wondering if it would be reasonable to truly archive the docs for some
> of the older versions of these packages. Perhaps the last 18 months? Maybe
> we could drop the html in a blob storage bucket with instructions for
> building the docs if absolutely necessary?
>
> *Improving docs building moving forward*
> There's an open Issue <https://github.com/apache/airflow-site/issues/719>
> for
> migrating the docs to a framework, but it's not at all a straightforward
> task for the archived docs. I think that we should institute a policy of
> archiving old documentation to cloud storage after X time and use a
> framework for building docs in a scalable and sustainable way moving
> forward. Maybe we could chat with iceberg folks about how they moved from
> mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>
>
>
> Shoutout to Utkarsh for helping me through all this!
>

Reply via email to