potiuk commented on issue #44373: URL: https://github.com/apache/airflow/issues/44373#issuecomment-2597193486
> [@potiuk](https://github.com/potiuk) Do you know how breeze serves the container when running `breeze build-docs`? I'm trying to use [sphinx-autobuild](https://github.com/sphinx-doc/sphinx-autobuild/tree/main) and port forward the docs page, but I can't quite figure out how. That's quite unlikely to be easy without quite a redesign. Note I am not the author of it, I merely suffled around the scripts and code that was there from the beginning, so I do not know more than just by looking and reverse engineering of how it works, And if someone would like to redesign this (on top of what is already planned in fixing the infrastructuce pieces described here - they are absolutely welcome. In short (you can see the sourcesin links: 1) `breeze build docs` takes the parameters passed via breeze CLI and convert them into Build Params that are then used as arguments of a shell script /opt/airflow/scripts/in_container/run_docs_build.sh that is run inside the container. https://github.com/apache/airflow/blob/main/dev/breeze/src/airflow_breeze/commands/developer_commands.py#L700 Also that breeze command makes sure that image is ready and rebuild. But essentially it calls this shell script: ```python cmd = "/opt/airflow/scripts/in_container/run_docs_build.sh " + " ".join( [shlex.quote(arg) for arg in doc_builder.args_doc_builder] ``` 2) This script does some housekeeping and cleanup/removal of stuff at completion but essentially it calls (in container): ``` python -m docs.build_docs "${@}" ``` So passing the parameters passed to `build_docs` python script defined in https://github.com/apache/airflow/blob/main/docs/build_docs.py 3) This one can even be run directly with `--help` and show the parameters it can take (defined in argparse here https://github.com/apache/airflow/blob/main/docs/build_docs.py#L465) . That includes potential selection of the packages for which the documentation should be built This script does quite a few more things: * it fetches inventories from public inter-sphinx inventories so that links to source code referred from libraries can be automatically linked from sphinx * it also fetches "our" package inventories (prepared in canary run and published in Airflow's amazon s3 bucket) - so that for example if you only build one provider and refer to another or to airlfow, sphinx can properly build-inter-sphinx links to those "external" documents. Each provider, airflow, helm , "proiders index" are separate "sphinx packages" linked between each other via inter-sphinx inventories. When you build a package locally, the inventoy is regenerated and produced as part of the build, so when you build several packages locally they can refer to each-other's new APIs and pages added. This for example allows us to see that some links are missing when some pages or links are moved and we need to refer to them - with intersphinx we will see warnings (and error out) when such links are wrong * It then selects packages to build - prioritising those that do not have inventories - because those should be build first - so that other packages can use the inventories when they are built together * then the packages are built - the build is parallelised to allow to use multiple processors - each package is build by one of the N = CPU processors - this way building whole documentation on a 16 core machine will take less than 10 minutes rather than 1.5h - if they were run sequentially * then there is interesting mechanism to attempt to retry building packages to allow to link to other locally built packages - sometimes when a package is being build it has a new page that other packages refer to (refactors and such) then such packages will fail until inventories are build for the source package. As packages are build in parallel it might mean that some packages might fail in the first pass, and they need 2nd or 3rd pass in order to succeed (depends how much "circular" or transitive package dependencies we have . This happens up to 3 times. See here: https://github.com/apache/airflow/blob/main/docs/build_docs.py#L534 * Buidling itself happens in the document builder class: https://github.com/apache/airflow/blob/main/docs/exts/docs_build/docs_builder.py - that clas prepares all the parameters needed to run sphinx command to run build for such a package. This command is derived here (for each package separately): https://github.com/apache/airflow/blob/main/docs/exts/docs_build/docs_builder.py#L237 Essentially this: ```python build_cmd = [ "sphinx-build", "-T", # show full traceback on exception "--color", # do emit colored output "-b", # builder to use "html", "-d", # path for the cached environment and doctree files self._doctree_dir, "-c", DOCS_DIR, "-w", # write warnings (and errors) to given file self.log_build_warning_filename, self._src_dir, self._build_dir, # path to output directory ] ``` * But then, each individual package is build via pretty complex sphinx extensions (and here I think where the complexity lies about making autobuild to work - as we have a lot of extensions build and complex build configuration that might make it rather difficult and cumbersome to run such autobuild. Essentially the extenasions are configured here: https://github.com/apache/airflow/blob/main/docs/conf.py And this conf.py is comples conference retrieving piece that sphinx loads when it is invoked above - and it determines what exactly will be configured, which internal parameters are used an which extensions will be used when a concrete package is build. There are lots of ifs, exclusions config params etc. that are dynamicallly calculated based on which package is built. Finally, all the extensions used by sphinx are here https://github.com/apache/airflow/tree/main/docs/exts - there are quite a few of those - loading data from provider.yaml, exampleincludes, intersphinx extensions, operators and hooks references, handling .txt redirects, templates to generate some summary or overview pages and so on. I think that's about it (though every time I look i find something new and interesting so I likely did not cover everything). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org