Jarek, I really appreciate you sharing your experience and expertise here. I think Beam would benefit from adopting some of these practices. Kerry
On Fri, Aug 26, 2022, 7:35 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >> I'm curious Jarek, does Airflow take any dependencies on popular >> libraries like pandas, numpy, pyarrow, scipy, etc... which users are likely >> to have their own dependency on? I think these dependencies are challenging >> in a different way than the client libraries - ideally we would support a >> wide version range so as not to require users to upgrade those libraries in >> lockstep with Beam. However in some cases our dependency is pretty tight >> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to >> explicitly test with multiple different versions. Does Airflow have any >> similar issues? >> > > Yes we do (all of those I think :) ). Complete set of all our deps can be > found here > https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt > (continuously updated and we have different sets for different python > versions). > > We took a rather interesting and unusual approach (more details in my > talk) - mainly because Airflow is both an application to install (for > users) and library to use (for DAG authors) and both have contradicting > expectations (installation stability versus flexibility in > upgrading/downgrading dependencies). Our approach is really smart in making > sure water and fire play well with each other. > > Most of those dependencies are coming from optional extras (list of all > extras here: > https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html). > More often than not the "problematic" dependencies you mention are > transitive dependencies through some client libraries we use (for example > Apache Beam SDK is a big contributor to those :). > > Airflow "core" itself has far less dependencies > https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt > (175 currently) and we actively made sure that all "pandas" of this world > are only optional extra deps. > > Now - the interesting thing is that we use "constraints'' (the links you > with dependencies that I posted are those constraints) to pin versions of > the dependencies that are "golden" - i.e. we test those continuously in our > CI and we automatically upgrade the constraints when all the unit and > integration tests pass. > There is a little bit of complexity and sometimes conflicts to handle (as > `pip` has to find the right set of deps that will work for all our optional > extras), but eventually we have really one "golden" set of constraints at > any moment in time main (or v2-x branch - we have a separate set for each > branch) that we are dealing with. And this is the only "set" of dependency > versions that Airflow gets tested with. Note - these are *constraints *not > *requirements *- that makes a whole world of difference. > > Then when we release airflow, we "freeze" the constraints with the version > tag. We know they work because all our tests pass with them in CI. > > Then we communicate to our users (and we use it in our Docker image) that > the only "supported" way of installing airflow is with using `pip` and > constraints > https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html. > And we do not support poetry, pipenv - we leave it up to users to handle > them (until poetry/pipenv will support constraints - which we are waiting > for and there is an issue where I explained why it is useful). It looks > like that `pip install "apache-airflow==2.3.4" --constraint " > https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"` > (different constraints for different airflow version and Python version you > have) > > Constraints have this nice feature that they are only used during the "pip > install" phase and thrown out immediately after the install is complete. > They do not create "hard" requirements for airflow. Airflow still has a > number of "lower-bound" limits for a number of constraints but we try to > avoid putting upper-bounds at all (only in specific cases and documenting > them) and our bounds are rather relaxed. This way we achieve two things: > > 1) when someone does not use constraints and has a problem with broken > dependency - we tell them to use constraints - this is what we as a > community commit to and support > 2) but by using constraints mechanism we do not limit our users if they > want to upgrade or downgrade any dependencies. They are free to do it (as > long as it fits the - rather relaxed lower/upper bounds of Airflow). But > "with great powers come great responsibilities" - if they want to do that., > THEY have to make sure that airflow will work. We make no guarantees there. > 3) we are not limited by the 3rd-party libraries that come as extras - if > you do not use those, the limits do not apply > > I think this works really well - but it is rather complex to setup and > maintain - I built a whole complex set of scripts and I have the whole > `breeze` ("It's a breeze to develop airflow" is the theme) development/CI > environment based on docker and docker-compose that allows us to automate > all of that. > > J. > > >