Does anyone think this one needs voting? I think it's an incremental change (but one that helps having stable installation mechanism on installing any released version of Airflow).
J. On Sun, Mar 15, 2020 at 1:17 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > I think I have finally found a good and simple solution that I will also > be able to use in the production image. > > It boils down to adding "-apache-airflow-pinned" package additionally to > "apache-airflow". This package is the same as apache-airflow but it has > pinned dependencies (so it will never fail to install because of transitive > dependencies). The original 'apache-airflow' package is untouched - it > works as previously and sometimes it will fail when some transitive > dependencies will change. > > With this change we also have self-managed standard python > requirements.txt file (every time you update setup.py with dependencies, it > will be regenerated via pre-commit when you rebuild the breeze image). So > this will make it nice for IDE integration (a lot of ides will > automatically manage new requirements from requirements.txt file). > > It is designed to be self-managing. > > The PR is here: https://github.com/apache/airflow/pull/7730 - looking > forward to reviews and feedback. > > I also described it in CONTRIBUTING.rst so I just copy the description > here: > > Pinned Airflow version and requirements.txt file > <https://github.com/PolideaInternal/airflow/blob/AIRFLOW-7067-pinned-version-of-airflow/CONTRIBUTING.rst#id18> > > Airflow is not a standard python project. Most of the python projects fall > into one of two types - application or library. As described in [Should I > pin my python dependencies versions]( > https://stackoverflow.com/questions/28509481/should-i-pin-my-python-dependencies-versions) > decision whether to pin (freeze) requirements for a python project depdends > on the type. For applications, dependencies should be pinned, but for > libraries, they should be open. > > For application, pinning the dependencies makes it more stable to install > in the future - because new (even transitive) dependencies might cause > installation to fail. For libraries - the dependencies should be open to > allow several different libraries with the same requirements to be > installed at the same time. > > The problem is that Apache Airflow is a bit of both - application to > install and library to be used when you are developing your own operators > and DAGs. > > This - seemingly unsolvable - puzzle is solved as follows: > > - by default when you install apache-airflow package - the > dependencies are as open as possible while still allowing the > apache-airflow to install. This means that 'apache-airflow' package might > fail to install in case a direct or transitive dependency is released that > breaks the installation. In such case when installing apache-airflow, > you might need to provide additional constraints (for example pip > install apache-airflow==1.10.2 Werkzeug<1.0.0) > - we have requirements.txt file generated automatically based on the > set of all latest working and tested requirement versions. You can also use > that file as a constraints file when installing apache airflow - either > from the sources pip install -e . --constraint requirements.txt or > from the pypi package pip install apache-airflow --constraint > requirements.txt. Note that this will also work with extras for > example pip install .[gcp] --constraint requirements.txt or pip > install apache-airflow[gcp] --constraint requirements.txt > - we also release apache-airflow-pinned package which has all > dependencies frozen. This is guaranteed to work and be installable no > matter if some of the direct or transitive dependencies are released. If > you want to only install airflow and start using it - install > apache-airflow-pinned package. > > The requirements.txt file is maintained automatically as we update > dependencies and verified at CI time via pre-commit check/generation. It > reflects the current set of dependencies installed in the CI image of > Apache Airflow. The same set of requirements will be used to produce the > production image. > > J. > > > On Sat, Nov 30, 2019 at 1:33 PM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > >> Just to update the plan and have some summary for those who did not read >> my TL;DR mail (and include probot integration). Here is a shorter summary >> of the proposal I have: >> >> 1. Keep the current process of *pip install -e[<extras>]* unchanged - >> taking into account the limits from setup.py >> 2. Add *requirements.txt* file with a fixed set of requirements >> (always *== -* including transitive deps). Those will be *managed by >> probot*, similarly to dependabot. >> 3. PRs in CI will use the requirements.txt file as a constraint file >> (known to work) when running tests to avoid transitive dependency problems >> 4. Daily CRON job installs all deps from scratch without *requirements.txt >> - * to detect early problems with current set of dependencies. An >> example of that from last night [1]: the CRON job failed and I >> could prepare the fix [2] while PRs of contributors continued to work. >> 5. Airflow installation for development will be guaranteed to work >> with *pip install -e . [<extras>] --constraints requirements.txt *from >> master - no matter the state of transitive dependencies >> 6. We will modify setup.py to also have a way to take the >> constraints, to guarantee installation works. Something like: *pip >> install apache-airflow==1.10.6[<extras>,constrained] * >> >> I'd love to hear more comments and ask you about opinion on that proposal. >> >> [1] Failed CRON job from last night - triggered likely by "from scratch" >> dependencies: https://travis-ci.org/apache/airflow/builds/618832805 >> [2] Fix to the last night CRON job failure: >> https://github.com/apache/airflow/pull/6691 >> [4] AIP-8 - Split Hooks/Operators to separate packages: >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 >> [3] AIP-21 - Changes in import paths >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths >> >> *@Dan Davydow* -> Fully agree we should have something like that. I >> think this should be part of AIP-8 [3] when we get there. I hope we can >> learn a lot about those when we get AIP-21 [4] fully implemented and >> backporting packages generated and used for people as preparation for >> Airflow 2.0 migration (I hope it will happen soon - we are very close to >> have automation in place to move all the packages from contrib - worked on >> by Kamil BreguĊa). Having such (probot managed) set of requirements + a >> way to test it automatically is - I think - an important prerequisite to >> start designing it (and work out a reasonable approach for cross-dependency >> problems). >> >> J. >> >> On Mon, Nov 18, 2019 at 6:18 PM Dan Davydov <ddavy...@twitter.com.invalid> >> wrote: >> >>> My 2 cents on the long-term plan: >>> Once Airflow has Dag Isolation (i.e. DAG python dependencies are >>> completely >>> decoupled from Airflow), we should pin the core Airflow deps, and package >>> operators separately with version ranges (for the reasons Ash mentioned >>> about libraries vs applications). >>> >>> On Sat, Nov 16, 2019 at 6:22 AM Jarek Potiuk <jarek.pot...@polidea.com> >>> wrote: >>> >>> > After some discussions and some tests by myself and Astronomer docker, >>> I >>> > think I have finally a complete proposal for good, consistent solution >>> for >>> > our dependency management. >>> > >>> > It uses well-known standards, does not introduce any new tools (like >>> > poetry) and I think it serves all the use cases we have very well. >>> > >>> > I wanted to summarise it here before I create it as an official AIP. I >>> will >>> > need to finalise the approach to complete the work on the official >>> > production image. >>> > >>> > Here are some basic assumptions: >>> > >>> > 1) We keep the current process of setup.py which is used to install >>> Airflow >>> > for development *'pip install -e .[extras]'*. It's nice and modular. >>> It has >>> > open-dependencies (some of them constrained due to compatibilities). >>> This >>> > will break from time to time for fresh install when some transitive >>> > dependencies are changed (or like with yesterday's pymssql drama). No >>> > changes whatsoever for people who continue installing it this way. The >>> same >>> > setup.py is used to install airflow in production (but see point 7 >>> below >>> > for constrained builds). >>> > >>> > 2) On top of setup.py, we will have a standard *requirements.txt* file >>> as >>> > well. This will be a snapshot of current "working" dependencies >>> (including >>> > all transitive dependencies!). This file might be generated >>> automatically >>> > by pip freeze (with some scripting) and possibly put as part of >>> pre-commit >>> > checks to make sure it is self-maintainable when new dependencies are >>> > added. With the current Breeze + Pre-commit setup we can do it fully >>> > automatically, so that it will happen behind the scenes (and it will be >>> > verified in CI). This requirements.txt file will be a set of "KNOWN >>> TO BE >>> > GOOD" requirements. This requirements.txt file will have ONLY >>> > 'dependency==VERSION' - specific versions of each requirement . Nice >>> > side-effect of the requirements.txt file is that if you use IDEs like >>> > Pycharm, VSCode, it will automatically suggest installing >>> missing/outdated >>> > requirements in your virtualenv so you do not have to remember about >>> it. >>> > >>> > 3) By having *requirements.txt *we can add a dependabot >>> > <https://dependabot.com/> that will upgrade dependencies in this file >>> > automatically (it works in a way that it will create a PR every time >>> one of >>> > the dependencies is released). It works nicely with some of our other >>> > projects - see for example Oozie-2-Airflow here >>> > < >>> https://github.com/GoogleCloudPlatform/oozie-to-airflow/commits/master> >>> . >>> > It has the benefit that it prepares super-nice, isolated commits/PRs. >>> Those >>> > PRs (example here >>> > <https://github.com/GoogleCloudPlatform/oozie-to-airflow/pull/438>) >>> have >>> > all the information about changes and commits. Those are nice and >>> > actionable in terms of checking if we have breaking changes before >>> merging >>> > it (example below). The nice thing about dependabot is that it will >>> create >>> > a PR and our CI will test it automatically. So committers will have to >>> just >>> > double-check, take a look at the release notes and will be able to >>> merge it >>> > straight away. I will have to check if dependabot can be used for >>> Apache >>> > repos though (dependabot is owned by Github). It's an optional step >>> and we >>> > can do it manually though if we cannot use dependabot (but it requires >>> some >>> > more complex scripting). >>> > >>> > 4) We will use the requirements.txt file in CI in PRs so that we know >>> that >>> > CI always uses a "known to be good" set of requirements. Every PR will >>> have >>> > to have good set of requirements to be merged. No more transitive >>> > dependency changes. >>> > >>> > 5) We have a special CRON job run in CI daily and this job will use >>> '*pip >>> > install -e . [all]*' (not using the requirements.txt) - this way we >>> will >>> > have a mechanism to detect early problems with transitive dependencies >>> - >>> > without breaking PR builds. >>> > >>> > 6) We can use the requirements.txt file as constraints file as well >>> (!). >>> > For example you should be able to run '*pip install -e . [gcp] >>> > --constraints requirements.txt*' and it will install the "known to be >>> good" >>> > versions but only for those extras we want. >>> > >>> > 7) I believe I can customise setup.py to add extra "*constrained*" >>> extra >>> > which will also use the information from the requirements.txt. This >>> way the >>> > users will be able to do `*pip install >>> > apache-airflow==1.10.6[gcp,aws,constrained]*` and have "known to be >>> good" >>> > installation of Airflow - even if some transitive dependencies break >>> the >>> > "clean" install. I am not 100% sure if it is possible exactly this >>> way, but >>> > I think in the worst case it will be `*pip install >>> > apache-airflow=1.10.6[constrained,gcp-constrained,aws-constrained]*` - >>> we >>> > can generate a "-constrained" version of each of the extras >>> automatically >>> > based on the requirements.txt. I already had a POC for that. And then >>> it >>> > should be possible to upgrade each of the dependencies on its own. >>> > >>> > 8) Last but not least - we will use the requirements.txt file to build >>> the >>> > production image. One of the requirements for a good official image >>> > < >>> > >>> https://github.com/docker-library/official-images/blob/master/README.md#repeatability >>> > > >>> > is repeatability - which boils down to dependency-pinning. Currently we >>> > have no mechanism to enforce that. With requirements.txt this will be >>> > possible. >>> > >>> > Let me know what you think! >>> > >>> > J. >>> > >>> > >>> > Appendix: example dependabot-generated commit: >>> > >>> > Bump pylint from 2.4.3 to 2.4.4 >>> > >>> > Bumps [pylint](https://github.com/PyCQA/pylint) from 2.4.3 to 2.4.4. >>> > - [Release notes](https://github.com/PyCQA/pylint/releases) >>> > - [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog) >>> > - [Commits](PyCQA/pylint@pylint-2.4.3...pylint-2.4.4 >>> > <https://github.com/PyCQA/pylint/compare/pylint-2.4.3...pylint-2.4.4>) >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Fri, Aug 2, 2019 at 8:34 AM Felix Uellendall <felue...@pm.me.invalid >>> > >>> > wrote: >>> > >>> > > I understand what Ash and Jarek are saying. I actually use airflow >>> with >>> > > custom plugins to have a end-product that fully satisfy our needs and >>> > when >>> > > writing new hooks and operators I don't want to see "airflow uses >>> > > requirement foo=1 but you have foo=2" - but actually that sometimes >>> also >>> > > just works. So why not let the developer decide if he want to risk >>> > breaking >>> > > it? >>> > > I like the idea of firstly defining groups for core and non-core >>> > > dependencies. I think only need to manage the core ones to always >>> work >>> > will >>> > > be a lot easier. >>> > > So I would suggest for core ones we have ranges including minor >>> releases >>> > > and non-core ones getting pinned by default installation (and ci). >>> > > >>> > > I think the more dependencies we add (which comes automatically as >>> > airflow >>> > > is growing) the more we are willing to set to more specific version >>> > ranges >>> > > and maybe even pin them at the end. >>> > > >>> > > Felix >>> > > >>> > > Sent from ProtonMail mobile >>> > > >>> > > -------- Original Message -------- >>> > > On Aug 2, 2019, 06:10, Jarek Potiuk wrote: >>> > > >>> > > > Ash is totally right - that's exactly the difficulty we face. >>> Airflow >>> > is >>> > > > both a library and end product and this makes the usual advice >>> (pin if >>> > > you >>> > > > are end-product, don't pin if you are library) not really useful. >>> From >>> > > the >>> > > > very beginning of my adventures with Airflow I was for pinning of >>> > > > everything (and using dependabot or similar - I use it for other >>> > > projects), >>> > > > but over time I realised that this is very short-sighted approach. >>> it >>> > > does >>> > > > not take into account the "library" point of view.. Depending which >>> > user >>> > > of >>> > > > airflow you are, you have contradicting requirements. If you are >>> user >>> > of >>> > > > airflow who just want to use it as "end product" you want pinning. >>> If >>> > you >>> > > > want to develop your own operators or extend existing ones - you >>> use >>> > > > airflow as "library" and you do not want pinning. >>> > > > >>> > > > I also proposed at the beginning of that thread that we split core >>> > > > requirements (and pin it) and non-core ones (and don't pin it). >>> But it >>> > > > ain't easy to separate those two sets in a clear way unfortunately. >>> > > > >>> > > > That's why the idea of choosing at the installation time (and not >>> at >>> > > build >>> > > > time) whether you want to install "loose" or "frozen" dependencies >>> is >>> > so >>> > > > appealing. >>> > > > Possibly the best solution could be that you 'pip install airflow' >>> and >>> > > you >>> > > > get the pinned versions and then some other way to get the loose >>> one. >>> > > But I >>> > > > think we are a bit on the mercy of pip - this does not seem to be >>> > > possible. >>> > > > >>> > > > Then - it looks like using extras to add "pinning" mechanism is the >>> > next >>> > > > best idea. >>> > > > >>> > > > I am not afraid about complexity. We can fully automate generating >>> > those >>> > > > pinned requirements. I already have some ideas how we can make sure >>> > that >>> > > we >>> > > > keep those requirements in sync while developing and how they can >>> end >>> > up >>> > > > frozen in release. I would like to run a POC on that but in short >>> it is >>> > > > another "by-product" of the CI image we have now. Our CI image is >>> the >>> > > > perfect source of frozen requirements - we know those requirements >>> in >>> > CI >>> > > > image are ok and we can use them to generate the standard >>> > > > "requirements.txt" file and keep it updated via some local update >>> > script >>> > > > (and pre-commit hooks) + we can verify that they are updated in >>> the CI. >>> > > We >>> > > > can then write custom setup.py that will use that requirements.txt >>> and >>> > > the >>> > > > existing "extras" and generate "pinned" extras automatically. That >>> > sounds >>> > > > like fully doable and with very limited maintenance effort. >>> > > > >>> > > > J. >>> > > > >>> > > > On Thu, Aug 1, [2019](tel:2019) at 10:45 PM Qingping Hou < >>> > q...@scribd.com> >>> > > wrote: >>> > > > >>> > > >> On Thu, Aug 1, [2019](tel:2019) at 1:33 PM Chen Tong < >>> > cix...@gmail.com> >>> > > wrote: >>> > > >> > It is sometimes hard to distinguish if it is a library or an >>> > > application. >>> > > >> > Take operator as an example, a non-tech people may think it as a >>> > > >> well-built >>> > > >> > application while an engineer may consider it as a library and >>> > extends >>> > > >> > functionalities on it. >>> > > >> > >>> > > >> >>> > > >> Yeah, I agree. Personally, I would consider operator to be a >>> library >>> > > >> due to the expectation that other people will import them in >>> their own >>> > > >> projects/source tree. >>> > > >> >>> > > >> For things like REST endpoint handlers and perhaps scheduler, it >>> seems >>> > > >> safe to assume all changes and improvements will happen within >>> Airflow >>> > > >> source tree. In that case, it's safe to classify that part of >>> code as >>> > > >> application and freeze all its dependencies. The current plugin >>> system >>> > > >> might make this slightly complicated because people can still >>> extend >>> > > >> the core with custom code. But again, in an ideal world, plugins >>> > > >> should be self-contained and communicating with core through a >>> well >>> > > >> defined interface ;) >>> > > >> >>> > > > >>> > > > -- >>> > > > >>> > > > Jarek Potiuk >>> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer >>> > > > >>> > > > M: [+48 660 796 129](tel:+48660796129) >>> > <[+48660796129](tel:+48660796129)> >>> > > > [image: Polidea] <https://www.polidea.com/> >>> > >>> > >>> > >>> > -- >>> > >>> > Jarek Potiuk >>> > Polidea <https://www.polidea.com/> | Principal Software Engineer >>> > >>> > M: +48 660 796 129 <+48660796129> >>> > [image: Polidea] <https://www.polidea.com/> >>> > >>> >> >> >> -- >> >> Jarek Potiuk >> Polidea <https://www.polidea.com/> | Principal Software Engineer >> >> M: +48 660 796 129 <+48660796129> >> [image: Polidea] <https://www.polidea.com/> >> >> > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>