potiuk commented on issue #44373:
URL: https://github.com/apache/airflow/issues/44373#issuecomment-2597193486

   > [@potiuk](https://github.com/potiuk) Do you know how breeze serves the 
container when running `breeze build-docs`? I'm trying to use 
[sphinx-autobuild](https://github.com/sphinx-doc/sphinx-autobuild/tree/main) 
and port forward the docs page, but I can't quite figure out how.
   
   That's quite unlikely to be easy without  quite a redesign. Note I am not 
the author of it, I merely suffled around the scripts and code that was there 
from the beginning, so I do not know more than just by looking and reverse 
engineering of how it works, And if someone would like to redesign this (on top 
of what is already planned in fixing the infrastructuce pieces described here - 
they are absolutely welcome.
   
   In short (you can see the sourcesin links:
   
   1) `breeze build docs` takes the parameters passed via breeze CLI and 
convert them into Build Params that are then used as arguments of a shell 
script /opt/airflow/scripts/in_container/run_docs_build.sh that is run inside 
the container.  
   
   
https://github.com/apache/airflow/blob/main/dev/breeze/src/airflow_breeze/commands/developer_commands.py#L700
   
   Also that breeze command makes sure that image is ready and rebuild. But 
essentially it calls this shell script:
   
   ```python
       cmd = "/opt/airflow/scripts/in_container/run_docs_build.sh " + " ".join(
           [shlex.quote(arg) for arg in doc_builder.args_doc_builder]
   ```
   
   2) This script does some housekeeping and cleanup/removal of stuff at 
completion but essentially it calls (in container):
   
   ```
   python -m docs.build_docs "${@}" 
   ```
   
   So passing the parameters passed to `build_docs` python script defined in 
https://github.com/apache/airflow/blob/main/docs/build_docs.py
   
   3) This one can even be run directly with `--help` and show the parameters 
it can take (defined in argparse here 
https://github.com/apache/airflow/blob/main/docs/build_docs.py#L465) . That 
includes potential selection of the packages for which the documentation should 
be built
   
   This script does quite a few more things:
   
   * it fetches inventories from public inter-sphinx inventories so that links 
to source code referred from libraries can be automatically linked from sphinx
   
   * it also fetches "our" package inventories (prepared in canary run and 
published in Airflow's amazon s3 bucket)  - so that for example if you only 
build one provider and refer to another or to airlfow, sphinx can properly 
build-inter-sphinx links to those "external" documents. Each provider, airflow, 
helm , "proiders index" are separate "sphinx packages" linked between each 
other via inter-sphinx inventories. When you build a package locally, the 
inventoy is regenerated and produced as part of the build, so when you build 
several packages locally they can refer to each-other's new APIs and pages 
added. This for example allows us to see that some links are missing when some 
pages or links are moved and we need to refer to them - with intersphinx we 
will see warnings (and error out) when such links are wrong
   
   * It then selects packages to build - prioritising those that do not have 
inventories - because those should be build first - so that other packages can 
use the inventories when they are built together
   
   * then the packages are built - the build is parallelised to allow to use 
multiple processors - each package is build by one of the N = CPU processors - 
this way building whole documentation on a 16 core machine will take less than 
10 minutes rather than 1.5h - if they were run sequentially
    
   * then there is interesting mechanism to attempt to retry building packages 
to allow to link to other locally built packages - sometimes when a package is 
being build it has a new page that other packages refer to (refactors and such) 
then such packages will fail until inventories are build for the source 
package. As packages are build in parallel it might mean that some packages 
might fail in the first pass, and they need 2nd or 3rd pass in order to succeed 
(depends how much "circular" or transitive package dependencies we have . This 
happens up to 3 times. See here: 
https://github.com/apache/airflow/blob/main/docs/build_docs.py#L534
    
   * Buidling itself happens in the document builder class: 
https://github.com/apache/airflow/blob/main/docs/exts/docs_build/docs_builder.py
 - that clas prepares all the parameters needed to run sphinx command to run 
build for such a package. This command is derived here (for each package 
separately):
   
   
https://github.com/apache/airflow/blob/main/docs/exts/docs_build/docs_builder.py#L237
   
   Essentially this:
   
   ```python
           build_cmd = [
               "sphinx-build",
               "-T",  # show full traceback on exception
               "--color",  # do emit colored output
               "-b",  # builder to use
               "html",
               "-d",  # path for the cached environment and doctree files
               self._doctree_dir,
               "-c",
               DOCS_DIR,
               "-w",  # write warnings (and errors) to given file
               self.log_build_warning_filename,
               self._src_dir,
               self._build_dir,  # path to output directory
           ]
   ```
   
   
   * But then, each individual package is build via pretty complex sphinx 
extensions (and here I think where the complexity lies about making autobuild 
to work - as we have  a lot of extensions build and complex build configuration 
that might make it rather difficult and cumbersome to run such autobuild.
   
   Essentially the extenasions are configured here:
   
   https://github.com/apache/airflow/blob/main/docs/conf.py
   
   And this conf.py is comples conference retrieving piece that sphinx loads 
when it is invoked above - and it determines what exactly will be configured, 
which internal parameters are used an which extensions will be used when a 
concrete package is build. There are lots of ifs, exclusions config params etc. 
that are dynamicallly calculated based on which package is built. 
   
   Finally, all the extensions used by sphinx are here 
https://github.com/apache/airflow/tree/main/docs/exts - there are quite a few 
of those - loading data from provider.yaml, exampleincludes, intersphinx 
extensions, operators and hooks references, handling .txt redirects, templates 
to generate some summary or overview pages and so on.
   
   
   I think that's about it (though every time I look i find something new and 
interesting so I likely did not cover everything).
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to