Does anyone think this one needs voting? I think it's an incremental change
(but one that helps having stable installation mechanism on installing any
released version of Airflow).

J.


On Sun, Mar 15, 2020 at 1:17 PM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> I think I have finally found a good and simple solution that I will also
> be able to use in the production image.
>
> It boils down to adding "-apache-airflow-pinned" package additionally to
> "apache-airflow". This package is the same as apache-airflow but it has
> pinned dependencies (so it will never fail to install because of transitive
> dependencies). The original 'apache-airflow' package is untouched - it
> works as previously and sometimes it will fail when some transitive
> dependencies will change.
>
> With this change we also have self-managed standard python
> requirements.txt file (every time you update setup.py with dependencies, it
> will be regenerated via pre-commit when you rebuild the breeze image).  So
> this will make it nice for IDE integration (a lot of ides will
> automatically manage new requirements from requirements.txt file).
>
> It is designed to be self-managing.
>
> The PR is here: https://github.com/apache/airflow/pull/7730 - looking
> forward to reviews and feedback.
>
> I also described it in CONTRIBUTING.rst so I just copy the description
> here:
>
> Pinned Airflow version and requirements.txt file
> <https://github.com/PolideaInternal/airflow/blob/AIRFLOW-7067-pinned-version-of-airflow/CONTRIBUTING.rst#id18>
>
> Airflow is not a standard python project. Most of the python projects fall
> into one of two types - application or library. As described in [Should I
> pin my python dependencies versions](
> https://stackoverflow.com/questions/28509481/should-i-pin-my-python-dependencies-versions)
> decision whether to pin (freeze) requirements for a python project depdends
> on the type. For applications, dependencies should be pinned, but for
> libraries, they should be open.
>
> For application, pinning the dependencies makes it more stable to install
> in the future - because new (even transitive) dependencies might cause
> installation to fail. For libraries - the dependencies should be open to
> allow several different libraries with the same requirements to be
> installed at the same time.
>
> The problem is that Apache Airflow is a bit of both - application to
> install and library to be used when you are developing your own operators
> and DAGs.
>
> This - seemingly unsolvable - puzzle is solved as follows:
>
>    - by default when you install apache-airflow package - the
>    dependencies are as open as possible while still allowing the
>    apache-airflow to install. This means that 'apache-airflow' package might
>    fail to install in case a direct or transitive dependency is released that
>    breaks the installation. In such case when installing apache-airflow,
>    you might need to provide additional constraints (for example pip
>    install apache-airflow==1.10.2 Werkzeug<1.0.0)
>    - we have requirements.txt file generated automatically based on the
>    set of all latest working and tested requirement versions. You can also use
>    that file as a constraints file when installing apache airflow - either
>    from the sources pip install -e . --constraint requirements.txt or
>    from the pypi package pip install apache-airflow --constraint
>    requirements.txt. Note that this will also work with extras for
>    example pip install .[gcp] --constraint requirements.txt or pip
>    install apache-airflow[gcp] --constraint requirements.txt
>    - we also release apache-airflow-pinned package which has all
>    dependencies frozen. This is guaranteed to work and be installable no
>    matter if some of the direct or transitive dependencies are released. If
>    you want to only install airflow and start using it - install
>    apache-airflow-pinned package.
>
> The requirements.txt file is maintained automatically as we update
> dependencies and verified at CI time via pre-commit check/generation. It
> reflects the current set of dependencies installed in the CI image of
> Apache Airflow. The same set of requirements will be used to produce the
> production image.
>
> J.
>
>
> On Sat, Nov 30, 2019 at 1:33 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
>> Just to update the plan and have some summary for those who did not read
>> my TL;DR mail (and include probot integration). Here is a shorter summary
>> of the proposal I have:
>>
>>    1. Keep the current process of *pip install -e[<extras>]* unchanged -
>>    taking into account the limits from setup.py
>>    2. Add *requirements.txt* file with a fixed set of requirements
>>    (always *== -* including transitive deps). Those will be *managed by
>>    probot*, similarly to dependabot.
>>    3. PRs in CI will use the requirements.txt file as a constraint file
>>    (known to work) when running tests to avoid transitive dependency problems
>>    4. Daily CRON job installs all deps from scratch without *requirements.txt
>>    - * to detect early problems with current set of dependencies. An
>>    example of that from last night [1]: the CRON job failed and I
>>    could prepare the fix [2] while PRs of contributors continued to work.
>>    5. Airflow installation for development will be guaranteed to work
>>    with *pip install -e . [<extras>] --constraints requirements.txt *from
>>    master - no matter the state of transitive dependencies
>>    6. We will modify setup.py to also have a way to take the
>>    constraints, to guarantee installation works. Something like: *pip
>>    install apache-airflow==1.10.6[<extras>,constrained] *
>>
>> I'd love to hear more comments and ask you about opinion on that proposal.
>>
>> [1] Failed CRON job from last night - triggered likely by "from scratch"
>> dependencies: https://travis-ci.org/apache/airflow/builds/618832805
>> [2] Fix to the last night CRON job failure:
>> https://github.com/apache/airflow/pull/6691
>> [4] AIP-8 - Split Hooks/Operators to separate packages:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
>> [3] AIP-21 - Changes in import paths
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths
>>
>> *@Dan Davydow* -> Fully agree we should have something like that. I
>> think this should be part of AIP-8 [3] when we get there. I hope we can
>> learn a lot about those when we get AIP-21 [4] fully implemented and
>> backporting packages generated and used for people as preparation for
>> Airflow 2.0 migration (I hope it will happen soon - we are very close to
>> have automation in place to move all the packages from contrib - worked on
>> by Kamil BreguĊ‚a).  Having such (probot managed) set of requirements + a
>> way to test it automatically is - I think - an important prerequisite to
>> start designing it (and work out a reasonable approach for cross-dependency
>> problems).
>>
>> J.
>>
>> On Mon, Nov 18, 2019 at 6:18 PM Dan Davydov <ddavy...@twitter.com.invalid>
>> wrote:
>>
>>> My 2 cents on the long-term plan:
>>> Once Airflow has Dag Isolation (i.e. DAG python dependencies are
>>> completely
>>> decoupled from Airflow), we should pin the core Airflow deps, and package
>>> operators separately with version ranges (for the reasons Ash mentioned
>>> about libraries vs applications).
>>>
>>> On Sat, Nov 16, 2019 at 6:22 AM Jarek Potiuk <jarek.pot...@polidea.com>
>>> wrote:
>>>
>>> > After some discussions and some tests by myself and Astronomer docker,
>>> I
>>> > think I have finally a complete proposal for good, consistent solution
>>> for
>>> > our dependency management.
>>> >
>>> > It uses well-known standards, does not introduce any new tools (like
>>> > poetry) and I think it serves all the use cases we have very well.
>>> >
>>> > I wanted to summarise it here before I create it as an official AIP. I
>>> will
>>> > need to finalise the approach to complete the work on the official
>>> > production image.
>>> >
>>> > Here are some basic assumptions:
>>> >
>>> > 1) We keep the current process of setup.py which is used to install
>>> Airflow
>>> > for development *'pip install -e .[extras]'*. It's nice and modular.
>>> It has
>>> > open-dependencies (some of them constrained due to compatibilities).
>>> This
>>> > will break from time to time for fresh install when some transitive
>>> > dependencies are changed (or like with yesterday's pymssql drama).  No
>>> > changes whatsoever for people who continue installing it this way. The
>>> same
>>> > setup.py is used to install airflow in production (but see point 7
>>> below
>>> > for constrained builds).
>>> >
>>> > 2) On top of setup.py, we will have a standard *requirements.txt* file
>>> as
>>> > well. This will be a snapshot of current "working" dependencies
>>> (including
>>> > all transitive dependencies!). This file might be generated
>>> automatically
>>> > by pip freeze (with some scripting) and possibly put as part of
>>> pre-commit
>>> > checks to make sure it is self-maintainable when new dependencies are
>>> > added. With the current Breeze + Pre-commit setup we can do it fully
>>> > automatically, so that it will happen behind the scenes (and it will be
>>> > verified in CI).  This requirements.txt file will be a set of "KNOWN
>>> TO BE
>>> > GOOD" requirements. This requirements.txt file will have ONLY
>>> > 'dependency==VERSION' - specific versions of each requirement . Nice
>>> > side-effect of the requirements.txt file is that if you use IDEs like
>>> > Pycharm, VSCode, it will automatically suggest installing
>>> missing/outdated
>>> > requirements in your virtualenv so you do not have to remember about
>>> it.
>>> >
>>> > 3) By having *requirements.txt *we can add a dependabot
>>> > <https://dependabot.com/> that will upgrade dependencies in this file
>>> > automatically (it works in a way that it will create a PR every time
>>> one of
>>> > the dependencies is released). It works nicely with  some of our other
>>> > projects - see for example Oozie-2-Airflow here
>>> > <
>>> https://github.com/GoogleCloudPlatform/oozie-to-airflow/commits/master>
>>> .
>>> > It has the benefit that it prepares super-nice, isolated commits/PRs.
>>> Those
>>> > PRs (example here
>>> > <https://github.com/GoogleCloudPlatform/oozie-to-airflow/pull/438>)
>>> have
>>> > all the information about changes and commits. Those are nice and
>>> > actionable in terms of checking if we have breaking changes before
>>> merging
>>> > it (example below). The nice thing about dependabot is that it will
>>> create
>>> > a PR and our CI will test it automatically. So committers will have to
>>> just
>>> > double-check, take a look at the release notes and will be able to
>>> merge it
>>> > straight away. I will have to check if dependabot can be used for
>>> Apache
>>> > repos though (dependabot is owned by Github). It's an optional step
>>> and we
>>> > can do it manually though if we cannot use dependabot (but it requires
>>> some
>>> > more complex scripting).
>>> >
>>> > 4) We will use the requirements.txt file in CI in PRs so that we know
>>> that
>>> > CI always uses a "known to be good" set of requirements. Every PR will
>>> have
>>> > to have good set of requirements to be merged. No more transitive
>>> > dependency changes.
>>> >
>>> > 5) We have a special CRON job run in CI daily and this job will use
>>> '*pip
>>> > install -e . [all]*'  (not using the requirements.txt) - this way we
>>> will
>>> > have a mechanism to detect early problems with transitive dependencies
>>> -
>>> > without breaking PR builds.
>>> >
>>> > 6) We can use the requirements.txt file as constraints file as well
>>> (!).
>>> > For example you should be able to run '*pip install -e . [gcp]
>>> > --constraints requirements.txt*' and it will install the "known to be
>>> good"
>>> > versions but only for those extras we want.
>>> >
>>> > 7) I believe I can customise setup.py to add extra "*constrained*"
>>> extra
>>> > which will also use the information from the requirements.txt. This
>>> way the
>>> > users will be able to do `*pip install
>>> > apache-airflow==1.10.6[gcp,aws,constrained]*` and have "known to be
>>> good"
>>> > installation of Airflow - even if some transitive dependencies break
>>> the
>>> > "clean" install. I am not 100% sure if it is possible exactly this
>>> way, but
>>> > I think in the worst case it will be `*pip install
>>> > apache-airflow=1.10.6[constrained,gcp-constrained,aws-constrained]*` -
>>> we
>>> > can generate a "-constrained" version of each of the extras
>>> automatically
>>> > based on the requirements.txt. I already had a POC for that. And then
>>> it
>>> > should be possible to upgrade each of the dependencies on its own.
>>> >
>>> > 8) Last but not least - we will use the requirements.txt file to build
>>> the
>>> > production image. One of the requirements for a good official image
>>> > <
>>> >
>>> https://github.com/docker-library/official-images/blob/master/README.md#repeatability
>>> > >
>>> > is repeatability - which boils down to dependency-pinning. Currently we
>>> > have no mechanism to enforce that. With requirements.txt this will be
>>> > possible.
>>> >
>>> > Let me know what you think!
>>> >
>>> > J.
>>> >
>>> >
>>> > Appendix: example dependabot-generated commit:
>>> >
>>> > Bump pylint from 2.4.3 to 2.4.4
>>> >
>>> > Bumps [pylint](https://github.com/PyCQA/pylint) from 2.4.3 to 2.4.4.
>>> > - [Release notes](https://github.com/PyCQA/pylint/releases)
>>> > - [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog)
>>> > - [Commits](PyCQA/pylint@pylint-2.4.3...pylint-2.4.4
>>> > <https://github.com/PyCQA/pylint/compare/pylint-2.4.3...pylint-2.4.4>)
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Fri, Aug 2, 2019 at 8:34 AM Felix Uellendall <felue...@pm.me.invalid
>>> >
>>> > wrote:
>>> >
>>> > > I understand what Ash and Jarek are saying. I actually use airflow
>>> with
>>> > > custom plugins to have a end-product that fully satisfy our needs and
>>> > when
>>> > > writing new hooks and operators I don't want to see "airflow uses
>>> > > requirement foo=1 but you have foo=2" - but actually that sometimes
>>> also
>>> > > just works. So why not let the developer decide if he want to risk
>>> > breaking
>>> > > it?
>>> > > I like the idea of firstly defining groups for core and non-core
>>> > > dependencies. I think only need to manage the core ones to always
>>> work
>>> > will
>>> > > be a lot easier.
>>> > > So I would suggest for core ones we have ranges including minor
>>> releases
>>> > > and non-core ones getting pinned by default installation (and ci).
>>> > >
>>> > > I think the more dependencies we add (which comes automatically as
>>> > airflow
>>> > > is growing) the more we are willing to set to more specific version
>>> > ranges
>>> > > and maybe even pin them at the end.
>>> > >
>>> > > Felix
>>> > >
>>> > > Sent from ProtonMail mobile
>>> > >
>>> > > -------- Original Message --------
>>> > > On Aug 2, 2019, 06:10, Jarek Potiuk wrote:
>>> > >
>>> > > > Ash is totally right - that's exactly the difficulty we face.
>>> Airflow
>>> > is
>>> > > > both a library and end product and this makes the usual advice
>>> (pin if
>>> > > you
>>> > > > are end-product, don't pin if you are library) not really useful.
>>> From
>>> > > the
>>> > > > very beginning of my adventures with Airflow I was for pinning of
>>> > > > everything (and using dependabot or similar - I use it for other
>>> > > projects),
>>> > > > but over time I realised that this is very short-sighted approach.
>>> it
>>> > > does
>>> > > > not take into account the "library" point of view.. Depending which
>>> > user
>>> > > of
>>> > > > airflow you are, you have contradicting requirements. If you are
>>> user
>>> > of
>>> > > > airflow who just want to use it as "end product" you want pinning.
>>> If
>>> > you
>>> > > > want to develop your own operators or extend existing ones - you
>>> use
>>> > > > airflow as "library" and you do not want pinning.
>>> > > >
>>> > > > I also proposed at the beginning of that thread that we split core
>>> > > > requirements (and pin it) and non-core ones (and don't pin it).
>>> But it
>>> > > > ain't easy to separate those two sets in a clear way unfortunately.
>>> > > >
>>> > > > That's why the idea of choosing at the installation time (and not
>>> at
>>> > > build
>>> > > > time) whether you want to install "loose" or "frozen" dependencies
>>> is
>>> > so
>>> > > > appealing.
>>> > > > Possibly the best solution could be that you 'pip install airflow'
>>> and
>>> > > you
>>> > > > get the pinned versions and then some other way to get the loose
>>> one.
>>> > > But I
>>> > > > think we are a bit on the mercy of pip - this does not seem to be
>>> > > possible.
>>> > > >
>>> > > > Then - it looks like using extras to add "pinning" mechanism is the
>>> > next
>>> > > > best idea.
>>> > > >
>>> > > > I am not afraid about complexity. We can fully automate generating
>>> > those
>>> > > > pinned requirements. I already have some ideas how we can make sure
>>> > that
>>> > > we
>>> > > > keep those requirements in sync while developing and how they can
>>> end
>>> > up
>>> > > > frozen in release. I would like to run a POC on that but in short
>>> it is
>>> > > > another "by-product" of the CI image we have now. Our CI image is
>>> the
>>> > > > perfect source of frozen requirements - we know those requirements
>>> in
>>> > CI
>>> > > > image are ok and we can use them to generate the standard
>>> > > > "requirements.txt" file and keep it updated via some local update
>>> > script
>>> > > > (and pre-commit hooks) + we can verify that they are updated in
>>> the CI.
>>> > > We
>>> > > > can then write custom setup.py that will use that requirements.txt
>>> and
>>> > > the
>>> > > > existing "extras" and generate "pinned" extras automatically. That
>>> > sounds
>>> > > > like fully doable and with very limited maintenance effort.
>>> > > >
>>> > > > J.
>>> > > >
>>> > > > On Thu, Aug 1, [2019](tel:2019) at 10:45 PM Qingping Hou <
>>> > q...@scribd.com>
>>> > > wrote:
>>> > > >
>>> > > >> On Thu, Aug 1, [2019](tel:2019) at 1:33 PM Chen Tong <
>>> > cix...@gmail.com>
>>> > > wrote:
>>> > > >> > It is sometimes hard to distinguish if it is a library or an
>>> > > application.
>>> > > >> > Take operator as an example, a non-tech people may think it as a
>>> > > >> well-built
>>> > > >> > application while an engineer may consider it as a library and
>>> > extends
>>> > > >> > functionalities on it.
>>> > > >> >
>>> > > >>
>>> > > >> Yeah, I agree. Personally, I would consider operator to be a
>>> library
>>> > > >> due to the expectation that other people will import them in
>>> their own
>>> > > >> projects/source tree.
>>> > > >>
>>> > > >> For things like REST endpoint handlers and perhaps scheduler, it
>>> seems
>>> > > >> safe to assume all changes and improvements will happen within
>>> Airflow
>>> > > >> source tree. In that case, it's safe to classify that part of
>>> code as
>>> > > >> application and freeze all its dependencies. The current plugin
>>> system
>>> > > >> might make this slightly complicated because people can still
>>> extend
>>> > > >> the core with custom code. But again, in an ideal world, plugins
>>> > > >> should be self-contained and communicating with core through a
>>> well
>>> > > >> defined interface ;)
>>> > > >>
>>> > > >
>>> > > > --
>>> > > >
>>> > > > Jarek Potiuk
>>> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
>>> > > >
>>> > > > M: [+48 660 796 129](tel:+48660796129)
>>> > <[+48660796129](tel:+48660796129)>
>>> > > > [image: Polidea] <https://www.polidea.com/>
>>> >
>>> >
>>> >
>>> > --
>>> >
>>> > Jarek Potiuk
>>> > Polidea <https://www.polidea.com/> | Principal Software Engineer
>>> >
>>> > M: +48 660 796 129 <+48660796129>
>>> > [image: Polidea] <https://www.polidea.com/>
>>> >
>>>
>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to