Jarek, I really appreciate you sharing your experience and expertise here.
I think Beam would benefit from adopting some of these practices.
Kerry

On Fri, Aug 26, 2022, 7:35 AM Jarek Potiuk <ja...@potiuk.com> wrote:

>
>> I'm curious Jarek, does Airflow take any dependencies on popular
>> libraries like pandas, numpy, pyarrow, scipy, etc... which users are likely
>> to have their own dependency on? I think these dependencies are challenging
>> in a different way than the client libraries - ideally we would support a
>> wide version range so as not to require users to upgrade those libraries in
>> lockstep with Beam. However in some cases our dependency is pretty tight
>> (e.g. the DataFrame API's dependency on pandas), so we need to make sure to
>> explicitly test with multiple different versions. Does Airflow have any
>> similar issues?
>>
>
> Yes we do (all of those I think :) ). Complete set of all our deps can be
> found here
> https://github.com/apache/airflow/blob/constraints-main/constraints-3.9.txt
> (continuously updated and we have different sets for different python
> versions).
>
> We took a rather interesting and unusual approach (more details in my
> talk) - mainly because Airflow is both an application to install (for
> users) and library to use (for DAG authors) and both have contradicting
> expectations (installation stability versus flexibility in
> upgrading/downgrading dependencies). Our approach is really smart in making
> sure water and fire play well with each other.
>
> Most of those dependencies are coming from optional extras (list of all
> extras here:
> https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html).
> More often than not the "problematic" dependencies you mention are
> transitive dependencies through some client libraries we use (for example
> Apache Beam SDK is a big contributor to those :).
>
> Airflow "core" itself has far less dependencies
> https://github.com/apache/airflow/blob/constraints-main/constraints-no-providers-3.9.txt
> (175 currently) and we actively made sure that all "pandas" of this world
> are only optional extra deps.
>
> Now - the interesting thing is that we use "constraints'' (the links you
> with dependencies that I posted are those constraints) to pin versions of
> the dependencies that are "golden" - i.e. we test those continuously in our
> CI and we automatically upgrade the constraints when all the unit and
> integration tests pass.
> There is a little bit of complexity and sometimes conflicts to handle (as
> `pip` has to find the right set of deps that will work for all our optional
> extras), but eventually we have really one "golden" set of constraints at
> any moment in time main (or v2-x branch - we have a separate set for each
> branch) that we are dealing with. And this is the only "set" of dependency
> versions that Airflow gets tested with. Note - these are *constraints *not
> *requirements *- that makes a whole world of difference.
>
> Then when we release airflow, we "freeze" the constraints with the version
> tag. We know they work because all our tests pass with them in CI.
>
> Then we communicate to our users (and we use it in our Docker image) that
> the only "supported" way of installing airflow is with using `pip` and
> constraints
> https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html.
> And we do not support poetry, pipenv - we leave it up to users to handle
> them (until poetry/pipenv will support constraints - which we are waiting
> for and there is an issue where I explained  why it is useful). It looks
> like that `pip install "apache-airflow==2.3.4" --constraint "
> https://raw.githubusercontent.com/apache/airflow/constraints-2.3.4/constraints-3.9.txt"`
> (different constraints for different airflow version and Python version you
> have)
>
> Constraints have this nice feature that they are only used during the "pip
> install" phase and thrown out immediately after the install is complete.
> They do not create "hard" requirements for airflow. Airflow still has a
> number of "lower-bound" limits for a number of constraints but we try to
> avoid putting upper-bounds at all (only in specific cases and documenting
> them) and our bounds are rather relaxed. This way we achieve two things:
>
> 1) when someone does not use constraints and has a problem with broken
> dependency - we tell them to use constraints - this is what we as a
> community commit to and support
> 2) but by using constraints mechanism we do not limit our users if they
> want to upgrade or downgrade any dependencies. They are free to do it (as
> long as it fits the - rather relaxed lower/upper bounds of Airflow). But
> "with great powers come great responsibilities" - if they want to do that.,
> THEY have to make sure that airflow will work. We make no guarantees there.
> 3) we are not limited by the 3rd-party libraries that come as extras - if
> you do not use those, the limits do not apply
>
> I think this works really well - but it is rather complex to setup and
> maintain - I built a whole complex set of scripts and I have the whole
> `breeze` ("It's a breeze to develop airflow" is the theme) development/CI
> environment based on docker and docker-compose that allows us to automate
> all of that.
>
> J.
>
>
>

Reply via email to