Hey Ash (and everyone).

I think we are really close to some consensus, but we are coming from two
directions. I do agree there will be some of the deps that are more
important to keep "in-check" than the rest.
I think it makes perfect sense to distinguish the dependencies we really
"astrongly feel" are important for airflow vs. those that "are important to
our users".
I am perfectly ok with upper-bounding deps that we really think are
important.

But with >500 deps total for aifrflow (transitive) I think we should have.
shortlist of those we care only - and the rest should be upper-open by
default.
And "short" is a good name, because those are the dependencies that we take
responsibility of manually upgrading when needed.

I would be happy if we can come up wit the list that is really important
and "core" to airflow - and "really likely" to break things with new major
release. And those are the things we care more as an "application" that our
users care as "libraries". This is my subjective list (those are all things
taht we care :

* sqlalchemy
* alembic
* flask (and all flask libs)
* flask-app-builder
* werkzeug  (yeah. not surprisingly - including last meetup I spoke about
Airflow, my previous presented mentioned that Werkzeug comes up as
"problematic case" and I was only able to confirm that :).

+ all those that we know break things (for example we know a bunch of
google deps <2.0.0 that we know are breaking things already - but we can
easily describe those)

I think if we upper-bound those (and make appropriate comment in our
setup.py/cfg)  this would be pretty "good" setup. And having a shortlist of
those that we want to keep upper-bound makes sense and is manageable.

WDYT?

J


On Mon, Jan 24, 2022 at 10:30 PM Collin McNulty
<col...@astronomer.io.invalid> wrote:

> Ash,
>
> I think your code snippet will result in getting the latest version of
> pandas even if there is an upper constraint on pandas in apache-airflow. I
> simulated with these commands on the latest pip available.
>
> ```
> pip install apache-airflow "pandas<1.4"
> pip freeze | grep pandas
> pip install -U pandas
> pip freeze | grep pandas
> ```
>
> Collin McNulty
>
> On Mon, Jan 24, 2022 at 3:11 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
>> To your inital thoughts:
>> - lower bounds: yeah, introduce them when we need to, so a new package
>> could be added with no lower bound, and then a bound placed only when we
>> discover we need a minium version (include bug fix etc. PR author can
>> choose wether or not to have a lower bound initially.)
>> - not have upper bound by default: see below
>> - only introduce upper bound if breaking change: see below
>> - specify reason for upper bound. yes, 100% to this
>>
>>
>> I think you have created a strawman argument here to say that the
>> '<2.0'constraint is useless', as Panda's documentation[1] say they follow
>> SemVer so 2.0 is going to break _something_. It didn't help us with this
>> specific problem.
>>
>> As you mentioned (possibly only in slack) constraints protect against
>> initial install only, but not future in-place upgrades. Without an upper
>> constraint I could do
>>
>> ```
>> pip install --constraint $FILE apache-airflow
>> pip install -U pandas
>> ```
>>
>> And then airflow could be broken. And being able to upgrade without
>> constraints files is the only way to apply a security upgrade to a module
>> (otherwise we'd have to update the constraint files for all supported
>> versions when ever any of our transitive dependencies has a security
>> update, not something we are in a position to do).
>>
>> I think it also depends how we use the dependency in Airflow -- some of
>> them are so core that we don't want anything to break, but others (such as
>> Pandas in this case) where our use of them in Airflow is actually fairly
>> superficial.
>>
>> So I think I would just tone down your middle suggestions slightly --
>> Upper limits dependencies can be optional. I think the only place we really
>> differ is what is "foreseen".
>>
>> -ash
>>
>> [1]: https://pandas.pydata.org/docs/development/policies.html
>>
>>
>> On Sun, Jan 23 2022 at 17:35:04 +0100, Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>>
>> Just to illustrate it better. Here is an extremely good example from
>> today:
>>
>> Our main build started to fail today (
>> https://github.com/apache/airflow/runs/4913148486?check_suite_focus=true)
>> because Pandas released a new 1.4.0 version last  night:
>> https://pypi.org/project/pandas/1.4.0/
>> The root cause of the failure was that Pandas 1.4.0 requires SqlAlchemy
>> 1.4.0 or above. But surprisingly -  there is no hard limit in the
>> released Pandas. What's more - Pandas does not even mention that
>> it **might** need sqlalchemy at all (and in version >=1.4.0 !).
>>
>> What happens is that in runtime, during running our tests we get this
>> error:
>> `Pandas requires version '1.4.0' or newer of 'sqlalchemy' (version
>> '1.3.24' currently installed).`
>>
>> We already had a limitation (not documented why) on Pandas
>>
>> pandas_requirement = 'pandas>=0.17.1, <2.0'
>>
>> However this `<2.0` was completely useless in this case, because it's 1.4
>> version that broke parts of Airflow.
>> We assumed that the <2.0 limitation will protect us (apparently) but it
>> did not. However, the failure in tests protected our constraints from being
>> upgraded and in our "main" pandas is still 1.3.4. Our users who follow "use
>> constraints" and those who use Airflow images, are fully protected against
>> those kinds of problems.
>>
>> This particular problem will likely be solved in a few days when Flask
>> Application Builder 3.4.4rc1 (
>> https://pypi.org/project/Flask-AppBuilder/3.4.4rc1/|) will be released
>> (It moves the upper bound limit for sqlalchemy to <1.5 and that was the
>> only thing that blocked us from getting sqlalchemy 1.4).
>>
>> My point is that simply we do not know if any future version of any
>> dependency will break Airflow. And any reasonable assumptions about that
>> guessing from "Major" version is IMHO just wild guessing. And also by
>> adding such limits - we are limiting our users to update to higher version
>> of dependencies when it is harmless (very similarly as Flask Application
>> Builder blocked us in this case from migration to sqlalchemy 1.4).
>>
>> The temporary fix (until FAB 3.4.4 is released) is here:
>>
>> https://github.com/apache/airflow/pull/21045
>>
>> I believe after FAB 3.4.4 is released we remove the fix we should just
>> get rid of the pandas limitation. Constraints of ours protect our users
>> when they install Airflow "according to our instructions with constraints"
>> - which is the only official way of installing Airflow. We are protecting
>> our users from those kind of events, but at the same time we should relax
>> pretty much all the upper bounds to not block our users in the future in
>> case they need to move to higher versions of dependencies released in the
>> future that we have no idea if they will break, or not Airflow..
>>
>> Let me know your thoughts - I think this Pandas case is great to
>> illustrate my point.
>>
>> J.
>>
>>
>>
>>
>> On Sun, Jan 23, 2022 at 1:40 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Hello everyone,
>>>
>>> TL;DR; I think there is one thing about dependencies that we should have
>>> some agreement on to make some common approach. Namely about using upper
>>> bounds on our dependencies.
>>> Should we set some rules on when we set upper-bounds on the deps ? What
>>> rules should they be?
>>>
>>> Currently we use constraints to make sure Airflow in specific version
>>> can be installed using "golden" set of dependencies. Those are part of our
>>> CI, automatically updated and tests which makes it really nice as they are
>>> "fixed" for installation of particular version but they continuously
>>> upgrade (even across major versions) when they pass tests in CI.
>>>
>>> This all happens pretty much automatically by our CI, except the cases
>>> where our dependencies are upper-limited. We occasionally do some
>>> "setup.py", "setup.cfg" changes manually to bump some upper limits, but
>>> this is more the result of some external request or "occasional" cleanup to
>>> do it - rather than a regular process.
>>>
>>> Now, we have some dependencies that are upper-bound for good reasons and
>>> they are documented. We also have a number of dependencies that are not
>>> upper-limited. But I think this pretty inconsistent. Small excerpt from
>>> setup.cfg:
>>>
>>>     # Logging is broken with itsdangerous > 2
>>>     itsdangerous>=1.1.0, <2.0
>>>     # Jinja2 3.1 will remove the 'autoescape' and 'with' extensions,
>>> which would
>>>     # break Flask 1.x, so we limit this for future compatibility. Remove
>>> this
>>>     # when bumping Flask to >=2.
>>>     jinja2>=2.10.1,<3.1
>>>     ...
>>>     # python daemon crashes with 'socket operation on non-socket' for
>>> python 3.8+ in version < 2.2.4
>>>     # https://pagure.io/python-daemon/issue/34
>>>     python-daemon>=2.2.4
>>>     python-dateutil>=2.3, <3
>>>     python-nvd3~=0.15.0
>>>
>>> * itsdangerous is upper limited and the reason is specified in the
>>> comment (though we do not know when we could remove the limit)
>>>
>>> * Jinja is upper-limited and we know not only why but also when it can
>>> be removed
>>> * Python-daemon is not upper-limited but it has a comment why it is
>>> "lower-limited" (which is pretty useless IMHO)
>>> * python-dateutil is upper-limited but we do not know why
>>> * python-nvd3 is also upper limited (~0.15.0 - limits it to any 0.15.x
>>> version but 0.16 could not be installed)
>>>
>>> I think there are many inconsistencies and the way we treat dependency
>>> limits is pretty inconsistent - we have no rules agreed.
>>>
>>> I would love to discuss how we can standardize it or at least set some
>>> rules we can follow.
>>>
>>> My initial thoughts are:
>>>
>>> * we can (but do not have to) have lower bounds if we need to
>>> protect and we know about some limitations, but we do not need to document
>>> those. Just set the limit. The nice thing about lower bounds that they do
>>> not "decay" over time. By default latest "eligible" version is installed by
>>> PIP (but of course if other deps have conflicting upper bounds, keeping
>>> lower limits when unnecessary is a problem.
>>> * by default we should not have upper-bounds. We know it limits our
>>> users and constraints + CI tests are nicely handling the scenario when
>>> things are breaking.
>>> * we should only introduce upper bounds if we know that there is
>>> breaking change (or that it is very likely and "foreseen" -
>>> betas/rc2/discussions about upcoming breaking changes in the dependencies
>>> * when we introduce upper-bounds we should always specify why and what
>>> is the condition to remove them
>>>
>>> WDYT?
>>>
>>> J.
>>>
>>>
>>>
>>>

Reply via email to