+1 from moving archived docs outside of airflow-site. Even if that might mean a little more maintenance in case we need to propagate changes to all historical versions, we would have to handle 2 repositories, but that seems like a minor downside compared to the quality of life improvement that it would bring for airflow-site contributions.
Le jeu. 19 oct. 2023 à 16:11, Jarek Potiuk <ja...@potiuk.com> a écrit : > Let me just clarify (because that could be unclear) what my +1 was about. > > I was not talking (and I believe Ryan was not talking either) about > removing the old docs but about archiving them and serving from elsewhere > (cloud storage). > > I think discussing changing to more shared HTML/JS/CSS is also a good idea > to optimise it, but possibly can be handled separately as a longer effort > of redesigning how the docs are built. But by all means we could also work > on that. > > Maybe I jumped to conclusions, but the easiest, tactical solution (for the > most acute issue - size) is we just move the old generated HTML docs from > the git repository of "airflow-site" and in the "github_pages" branch we > replace it with redirecting of those pages to the files served from the > cloud storage (and I believe this is what Ryan hinted at). > > Those redirects could be automatically generated for all > historical versions and they will be small. We are already doing it for > individual pages for navigating between versions, but we could easily > replace all the historical docs with "<html><head><meta > http-equiv="refresh" content="0; url= > https://new-archive-docs-airflow-url/airflow/version/document.url" > "/></head></html>". Low-tech, surely and "legacy", but it will solve the > size problem instantly. We currently have 115.148 such files which will go > down to about ~20 MB of files which is peanuts, compared to the current > 17GB (!) we have. > > We can also inject into the moved "storage" docs, the header that informs > that this is an old/archived documentation with single redirect to > "live"/"stable" site for newer versions of docs (which I believe sparked > Ryan's work). This can be done at least as the "quick" remediation for the > size issue and something that might allow the current scheme to > work without ever-growing repo/size and using space for the build action. > If we have such an automated mechanism in place, we could periodically > archive old docs. All that without changing the build process of ours and > simply keep old "past" docs elsewhere (still accessible for users). > > Not much should change for the users IMHO - if they go to the old version > of the docs or use old, archived URLs, they would end up seeing the > same content/navigation they see today (with extra information it's an old > version and served from a different URL). > When they go to the "old" version of documentation they could be redirected > to the new one - same HTML but hosted on cloud storage, fully statically. > We already do that with "redirect" mechanism. > > In the meantime, someone could also work on a strategic solution - and > changing the current build process, but this is - I think a different - > and much more complex and requiring a lot of effort - step. And it could > simply end up with regenerating whatever is left as "live" documentation > (leaving the archive docs intact). > > That's at least what I see as a possible set of steps to take. > > J. > > On Thu, Oct 19, 2023 at 2:14 PM utkarsh sharma <utkarshar...@gmail.com> > wrote: > > > Hey everyone, > > > > Thanks, Ryan for stating the thread :) > > > > Big +1 For archiving docs older than 18 months. We can still make the > older > > docs available in `rst` doc form. > > > > But eventually, we might again run into this problem because of the > growing > > no. of providers. I think the main reason for this issue is the generated > > static HTML pages and the way we cater to them using GitHub Pages. The > > generated pages have lots of common code > > HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS which is > repeated > > for every provider and every version of that provider. If we have a more > > dynamic way(Django/Flask Servers) of catering the documents we can save > all > > the space for common HTML/CSS/JS. > > > > But the downsides of this approach are: > > 1. We need to have a server > > 2. Also require changes in the existing document build process to only > > produce partial HTML documents. > > > > Thanks, > > Utkarsh Sharma > > > > On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > Yes. Moving the old version to somewhere that we can keep/archive > static > > > historical versions of those historical docs and publish them from > there. > > > What you proposed is exactly the solution I thought might be best as > > well. > > > > > > It would be a great task to contribute to the stability of our docs > > > generation in the future. > > > > > > I don't think it's a matter of discussing in detail how to do it (18 > > months > > > is a good start and you can parameterize it), It's the matter of > > > someone committing to it and doing it simply :). > > > > > > So yes I personally am all for it and if I understand correctly that > you > > > are looking for agreement on doing it, big +1 from my side - happy to > > help > > > with providing access to our S3 buckets. > > > > > > J. > > > > > > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter > > > <ryan.hat...@astronomer.io.invalid> wrote: > > > > > > > *tl;dr* > > > > > > > > 1. The GitHub Action for building docs is running out of space. I > > > think > > > > we should archive really old documentation for large packages to > > cloud > > > > storage. > > > > 2. Contributing to and building Airflow docs is hard. We should > > > migrate > > > > to a framework, preferably one that uses markdown (although I > > > > acknowledge > > > > rst -> md will be a massive overhaul). > > > > > > > > *Problem Summary* > > > > I recently set out to implement what I thought would be a > > straightforward > > > > feature: warn users when they are viewing documentation for > non-current > > > > versions of Airflow and link them to the current/stable version > > > > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to > the > > > > airflow-site <https://github.com/apache/airflow-site> repo, which > > > contains > > > > all of the archived docs (that is, documentation for non-current > > > versions), > > > > and from there, I ran into a brick wall. > > > > > > > > I want to raise some concerns that I've developed after trying to > > > > contribute what feel like a couple reasonably small docs updates: > > > > > > > > 1. airflow-site > > > > 1. Elad pointed out the problem posed by the sheer size of > > archived > > > > docs > > > > < > > > > > > > > > > https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943 > > > > > > > > > (more > > > > on this later). > > > > 2. The airflow-site repo is confusing, and rather poorly > > > documented. > > > > 1. Hugo (static site generator) exists, but appears to only > be > > > > used for the landing pages > > > > 2. In order to view any documentation locally other than the > > > > landing pages, you'll need to run the site.sh script then > > > > copy the output > > > > from one dir to another? > > > > 3. All of the archived docs are raw HTML, making migrating to a > > > > static site generator a significant challenge, which makes it > > > > difficult to > > > > prevent the archived docs from continuing to grow and grow. > > > > Perhaps this is the > > > > wheel Khaleesi was referring to > > > > <https://www.youtube.com/watch?v=J-rxmk6zPxA>? > > > > 2. airflow > > > > 1. Building Airflow docs is a challenge. It takes several > minutes > > > and > > > > doesn't support auto-build, so the slightest issue could > require > > > > waiting > > > > again and again until the changes are just so. I tried > > implementing > > > > sphinx-autobuild < > > > > https://github.com/executablebooks/sphinx-autobuild> > > > > to no avail. > > > > 2. Sphinx/restructured text has a steep learning curve. > > > > > > > > *The most acute issue: disk space* > > > > The size of the archived docs is causing the docs build GitHub Action > > to > > > > almost run out of space. From the "Build site" Action from a couple > > weeks > > > > ago > > > > < > > > > > > > > > > https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458 > > > > > > > > > (expand > > > > the build site step, scroll all the way to the bottom, expand the `df > > -h` > > > > command), we can see the GitHub Action runner (or whatever it's > called) > > > is > > > > nearly running out of space: > > > > > > > > df -h > > > > *Filesystem Size Used Avail Use% Mounted on* > > > > /dev/root 84G 82G 2.1G 98% / > > > > > > > > > > > > The available space is down to 1.8G on the most recent Action > > > > < > > > > > > > > > > https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176 > > > > >. > > > > If we assume that trend is accurate, we have about two months before > > the > > > > Action runner runs out of disk space. Here's a breakdown of the space > > > > consumed by the 10 largest package documentation directories: > > > > > > > > du -h -d 1 docs-archive/ | sort -h -r > > > > * 14G* docs-archive/ > > > > *4.0G* docs-archive//apache-airflow-providers-google > > > > *3.2G* docs-archive//apache-airflow > > > > *1.7G* docs-archive//apache-airflow-providers-amazon > > > > *560M* docs-archive//apache-airflow-providers-microsoft-azure > > > > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes > > > > *192M* docs-archive//apache-airflow-providers-apache-hive > > > > *153M* docs-archive//apache-airflow-providers-snowflake > > > > *139M* docs-archive//apache-airflow-providers-databricks > > > > *104M* docs-archive//apache-airflow-providers-docker > > > > *101M* docs-archive//apache-airflow-providers-mysql > > > > > > > > > > > > *Proposed solution: Archive old docs html for large packages to cloud > > > > storage* > > > > I'm wondering if it would be reasonable to truly archive the docs for > > > some > > > > of the older versions of these packages. Perhaps the last 18 months? > > > Maybe > > > > we could drop the html in a blob storage bucket with instructions for > > > > building the docs if absolutely necessary? > > > > > > > > *Improving docs building moving forward* > > > > There's an open Issue < > > https://github.com/apache/airflow-site/issues/719 > > > > > > > > for > > > > migrating the docs to a framework, but it's not at all a > > straightforward > > > > task for the archived docs. I think that we should institute a policy > > of > > > > archiving old documentation to cloud storage after X time and use a > > > > framework for building docs in a scalable and sustainable way moving > > > > forward. Maybe we could chat with iceberg folks about how they moved > > from > > > > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616> > > > > > > > > > > > > Shoutout to Utkarsh for helping me through all this! > > > > > > > > > >