Happy to help and I hope we can work together with Valentyn and others to get the "google clients" approach improved :)
J. On Fri, Aug 26, 2022 at 3:40 PM Kerry Donny-Clark via dev < dev@beam.apache.org> wrote: > Jarek, I really appreciate you sharing your experience and expertise here. > I think Beam would benefit from adopting some of these practices. > Kerry > > On Fri, Aug 26, 2022, 7:35 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >> >>> I'm curious Jarek, does Airflow take any dependencies on popular >>> libraries like pandas, numpy, pyarrow, scipy, etc... which users are likely >>> to have their own dependency on? I think these dependencies are challenging >>> in a different way than the client libraries - ideally we would support a >>> wide version range so as not to require users to upgrade those libraries in >>> lockstep with Beam. However in some cases our dependency is pretty tight >>> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to >>> explicitly test with multiple different versions. Does Airflow have any >>> similar issues? >>> >> >> Yes we do (all of those I think :) ). Complete set of all our deps can be >> found here >> https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt >> (continuously updated and we have different sets for different python >> versions). >> >> We took a rather interesting and unusual approach (more details in my >> talk) - mainly because Airflow is both an application to install (for >> users) and library to use (for DAG authors) and both have contradicting >> expectations (installation stability versus flexibility in >> upgrading/downgrading dependencies). Our approach is really smart in making >> sure water and fire play well with each other. >> >> Most of those dependencies are coming from optional extras (list of all >> extras here: >> https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html). >> More often than not the "problematic" dependencies you mention are >> transitive dependencies through some client libraries we use (for example >> Apache Beam SDK is a big contributor to those :). >> >> Airflow "core" itself has far less dependencies >> https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt >> (175 currently) and we actively made sure that all "pandas" of this world >> are only optional extra deps. >> >> Now - the interesting thing is that we use "constraints'' (the links you >> with dependencies that I posted are those constraints) to pin versions of >> the dependencies that are "golden" - i.e. we test those continuously in our >> CI and we automatically upgrade the constraints when all the unit and >> integration tests pass. >> There is a little bit of complexity and sometimes conflicts to handle (as >> `pip` has to find the right set of deps that will work for all our optional >> extras), but eventually we have really one "golden" set of constraints at >> any moment in time main (or v2-x branch - we have a separate set for each >> branch) that we are dealing with. And this is the only "set" of dependency >> versions that Airflow gets tested with. Note - these are *constraints *not >> *requirements *- that makes a whole world of difference. >> >> Then when we release airflow, we "freeze" the constraints with the >> version tag. We know they work because all our tests pass with them in CI. >> >> Then we communicate to our users (and we use it in our Docker image) that >> the only "supported" way of installing airflow is with using `pip` and >> constraints >> https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html. >> And we do not support poetry, pipenv - we leave it up to users to handle >> them (until poetry/pipenv will support constraints - which we are waiting >> for and there is an issue where I explained why it is useful). It looks >> like that `pip install "apache-airflow==2.3.4" --constraint " >> https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"` >> (different constraints for different airflow version and Python version you >> have) >> >> Constraints have this nice feature that they are only used during the >> "pip install" phase and thrown out immediately after the install is >> complete. They do not create "hard" requirements for airflow. Airflow still >> has a number of "lower-bound" limits for a number of constraints but we try >> to avoid putting upper-bounds at all (only in specific cases and >> documenting them) and our bounds are rather relaxed. This way we achieve >> two things: >> >> 1) when someone does not use constraints and has a problem with broken >> dependency - we tell them to use constraints - this is what we as a >> community commit to and support >> 2) but by using constraints mechanism we do not limit our users if they >> want to upgrade or downgrade any dependencies. They are free to do it (as >> long as it fits the - rather relaxed lower/upper bounds of Airflow). But >> "with great powers come great responsibilities" - if they want to do that., >> THEY have to make sure that airflow will work. We make no guarantees there. >> 3) we are not limited by the 3rd-party libraries that come as extras - if >> you do not use those, the limits do not apply >> >> I think this works really well - but it is rather complex to setup and >> maintain - I built a whole complex set of scripts and I have the whole >> `breeze` ("It's a breeze to develop airflow" is the theme) development/CI >> environment based on docker and docker-compose that allows us to automate >> all of that. >> >> J. >> >> >> >