Re: [DISCUSS] Back to (some) dependency pinning

Jarek Potiuk Sat, 16 Nov 2019 03:22:39 -0800

After some discussions and some tests by myself and Astronomer docker, I
think I have finally a complete proposal for good, consistent solution for
our dependency management.

It uses well-known standards, does not introduce any new tools (like
poetry) and I think it serves all the use cases we have very well.

I wanted to summarise it here before I create it as an official AIP. I will
need to finalise the approach to complete the work on the official
production image.

Here are some basic assumptions:

1) We keep the current process of setup.py which is used to install Airflow
for development *'pip install -e .[extras]'*. It's nice and modular. It has
open-dependencies (some of them constrained due to compatibilities). This
will break from time to time for fresh install when some transitive
dependencies are changed (or like with yesterday's pymssql drama).  No
changes whatsoever for people who continue installing it this way. The same
setup.py is used to install airflow in production (but see point 7 below
for constrained builds).

2) On top of setup.py, we will have a standard *requirements.txt* file as
well. This will be a snapshot of current "working" dependencies (including
all transitive dependencies!). This file might be generated automatically
by pip freeze (with some scripting) and possibly put as part of pre-commit
checks to make sure it is self-maintainable when new dependencies are
added. With the current Breeze + Pre-commit setup we can do it fully
automatically, so that it will happen behind the scenes (and it will be
verified in CI).  This requirements.txt file will be a set of "KNOWN TO BE
GOOD" requirements. This requirements.txt file will have ONLY
'dependency==VERSION' - specific versions of each requirement . Nice
side-effect of the requirements.txt file is that if you use IDEs like
Pycharm, VSCode, it will automatically suggest installing missing/outdated
requirements in your virtualenv so you do not have to remember about it.

3) By having *requirements.txt *we can add a dependabot
<https://dependabot.com/> that will upgrade dependencies in this file
automatically (it works in a way that it will create a PR every time one of
the dependencies is released). It works nicely with  some of our other
projects - see for example Oozie-2-Airflow here
<https://github.com/GoogleCloudPlatform/oozie-to-airflow/commits/master> .
It has the benefit that it prepares super-nice, isolated commits/PRs. Those
PRs (example here
<https://github.com/GoogleCloudPlatform/oozie-to-airflow/pull/438>) have
all the information about changes and commits. Those are nice and
actionable in terms of checking if we have breaking changes before merging
it (example below). The nice thing about dependabot is that it will create
a PR and our CI will test it automatically. So committers will have to just
double-check, take a look at the release notes and will be able to merge it
straight away. I will have to check if dependabot can be used for Apache
repos though (dependabot is owned by Github). It's an optional step and we
can do it manually though if we cannot use dependabot (but it requires some
more complex scripting).

4) We will use the requirements.txt file in CI in PRs so that we know that
CI always uses a "known to be good" set of requirements. Every PR will have
to have good set of requirements to be merged. No more transitive
dependency changes.

5) We have a special CRON job run in CI daily and this job will use '*pip
install -e . [all]*'  (not using the requirements.txt) - this way we will
have a mechanism to detect early problems with transitive dependencies -
without breaking PR builds.

6) We can use the requirements.txt file as constraints file as well (!).
For example you should be able to run '*pip install -e . [gcp]
--constraints requirements.txt*' and it will install the "known to be good"
versions but only for those extras we want.

7) I believe I can customise setup.py to add extra "*constrained*" extra
which will also use the information from the requirements.txt. This way the
users will be able to do `*pip install
apache-airflow==1.10.6[gcp,aws,constrained]*` and have "known to be good"
installation of Airflow - even if some transitive dependencies break the
"clean" install. I am not 100% sure if it is possible exactly this way, but
I think in the worst case it will be `*pip install
apache-airflow=1.10.6[constrained,gcp-constrained,aws-constrained]*` - we
can generate a "-constrained" version of each of the extras automatically
based on the requirements.txt. I already had a POC for that. And then it
should be possible to upgrade each of the dependencies on its own.

8) Last but not least - we will use the requirements.txt file to build the
production image. One of the requirements for a good official image
<https://github.com/docker-library/official-images/blob/master/README.md#repeatability>
is repeatability - which boils down to dependency-pinning. Currently we
have no mechanism to enforce that. With requirements.txt this will be
possible.

Let me know what you think!

J.

Appendix: example dependabot-generated commit:

Bump pylint from 2.4.3 to 2.4.4

Bumps [pylint](https://github.com/PyCQA/pylint) from 2.4.3 to 2.4.4.
- [Release notes](https://github.com/PyCQA/pylint/releases)
- [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog)
- [Commits](PyCQA/pylint@pylint-2.4.3...pylint-2.4.4
<https://github.com/PyCQA/pylint/compare/pylint-2.4.3...pylint-2.4.4>)

On Fri, Aug 2, 2019 at 8:34 AM Felix Uellendall <felue...@pm.me.invalid>
wrote:

> I understand what Ash and Jarek are saying. I actually use airflow with
> custom plugins to have a end-product that fully satisfy our needs and when
> writing new hooks and operators I don't want to see "airflow uses
> requirement foo=1 but you have foo=2" - but actually that sometimes also
> just works. So why not let the developer decide if he want to risk breaking
> it?
> I like the idea of firstly defining groups for core and non-core
> dependencies. I think only need to manage the core ones to always work will
> be a lot easier.
> So I would suggest for core ones we have ranges including minor releases
> and non-core ones getting pinned by default installation (and ci).
>
> I think the more dependencies we add (which comes automatically as airflow
> is growing) the more we are willing to set to more specific version ranges
> and maybe even pin them at the end.
>
> Felix
>
> Sent from ProtonMail mobile
>
> -------- Original Message --------
> On Aug 2, 2019, 06:10, Jarek Potiuk wrote:
>
> > Ash is totally right - that's exactly the difficulty we face. Airflow is
> > both a library and end product and this makes the usual advice (pin if
> you
> > are end-product, don't pin if you are library) not really useful. From
> the
> > very beginning of my adventures with Airflow I was for pinning of
> > everything (and using dependabot or similar - I use it for other
> projects),
> > but over time I realised that this is very short-sighted approach. it
> does
> > not take into account the "library" point of view.. Depending which user
> of
> > airflow you are, you have contradicting requirements. If you are user of
> > airflow who just want to use it as "end product" you want pinning. If you
> > want to develop your own operators or extend existing ones - you use
> > airflow as "library" and you do not want pinning.
> >
> > I also proposed at the beginning of that thread that we split core
> > requirements (and pin it) and non-core ones (and don't pin it). But it
> > ain't easy to separate those two sets in a clear way unfortunately.
> >
> > That's why the idea of choosing at the installation time (and not at
> build
> > time) whether you want to install "loose" or "frozen" dependencies is so
> > appealing.
> > Possibly the best solution could be that you 'pip install airflow' and
> you
> > get the pinned versions and then some other way to get the loose one.
> But I
> > think we are a bit on the mercy of pip - this does not seem to be
> possible.
> >
> > Then - it looks like using extras to add "pinning" mechanism is the next
> > best idea.
> >
> > I am not afraid about complexity. We can fully automate generating those
> > pinned requirements. I already have some ideas how we can make sure that
> we
> > keep those requirements in sync while developing and how they can end up
> > frozen in release. I would like to run a POC on that but in short it is
> > another "by-product" of the CI image we have now. Our CI image is the
> > perfect source of frozen requirements - we know those requirements in CI
> > image are ok and we can use them to generate the standard
> > "requirements.txt" file and keep it updated via some local update script
> > (and pre-commit hooks) + we can verify that they are updated in the CI.
> We
> > can then write custom setup.py that will use that requirements.txt and
> the
> > existing "extras" and generate "pinned" extras automatically. That sounds
> > like fully doable and with very limited maintenance effort.
> >
> > J.
> >
> > On Thu, Aug 1, [2019](tel:2019) at 10:45 PM Qingping Hou <q...@scribd.com>
> wrote:
> >
> >> On Thu, Aug 1, [2019](tel:2019) at 1:33 PM Chen Tong <cix...@gmail.com>
> wrote:
> >> > It is sometimes hard to distinguish if it is a library or an
> application.
> >> > Take operator as an example, a non-tech people may think it as a
> >> well-built
> >> > application while an engineer may consider it as a library and extends
> >> > functionalities on it.
> >> >
> >>
> >> Yeah, I agree. Personally, I would consider operator to be a library
> >> due to the expectation that other people will import them in their own
> >> projects/source tree.
> >>
> >> For things like REST endpoint handlers and perhaps scheduler, it seems
> >> safe to assume all changes and improvements will happen within Airflow
> >> source tree. In that case, it's safe to classify that part of code as
> >> application and freeze all its dependencies. The current plugin system
> >> might make this slightly complicated because people can still extend
> >> the core with custom code. But again, in an ideal world, plugins
> >> should be self-contained and communicating with core through a well
> >> defined interface ;)
> >>
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: [+48 660 796 129](tel:+48660796129) <[+48660796129](tel:+48660796129)>
> > [image: Polidea] <https://www.polidea.com/>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Back to (some) dependency pinning

Reply via email to