Just to update the plan and have some summary for those who did not read my TL;DR mail (and include probot integration). Here is a shorter summary of the proposal I have:
1. Keep the current process of *pip install -e[<extras>]* unchanged - taking into account the limits from setup.py 2. Add *requirements.txt* file with a fixed set of requirements (always *== -* including transitive deps). Those will be *managed by probot*, similarly to dependabot. 3. PRs in CI will use the requirements.txt file as a constraint file (known to work) when running tests to avoid transitive dependency problems 4. Daily CRON job installs all deps from scratch without *requirements.txt - * to detect early problems with current set of dependencies. An example of that from last night [1]: the CRON job failed and I could prepare the fix [2] while PRs of contributors continued to work. 5. Airflow installation for development will be guaranteed to work with *pip install -e . [<extras>] --constraints requirements.txt *from master - no matter the state of transitive dependencies 6. We will modify setup.py to also have a way to take the constraints, to guarantee installation works. Something like: *pip install apache-airflow==1.10.6[<extras>,constrained] * I'd love to hear more comments and ask you about opinion on that proposal. [1] Failed CRON job from last night - triggered likely by "from scratch" dependencies: https://travis-ci.org/apache/airflow/builds/618832805 [2] Fix to the last night CRON job failure: https://github.com/apache/airflow/pull/6691 [4] AIP-8 - Split Hooks/Operators to separate packages: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 [3] AIP-21 - Changes in import paths https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths *@Dan Davydow* -> Fully agree we should have something like that. I think this should be part of AIP-8 [3] when we get there. I hope we can learn a lot about those when we get AIP-21 [4] fully implemented and backporting packages generated and used for people as preparation for Airflow 2.0 migration (I hope it will happen soon - we are very close to have automation in place to move all the packages from contrib - worked on by Kamil BreguĊa). Having such (probot managed) set of requirements + a way to test it automatically is - I think - an important prerequisite to start designing it (and work out a reasonable approach for cross-dependency problems). J. On Mon, Nov 18, 2019 at 6:18 PM Dan Davydov <[email protected]> wrote: > My 2 cents on the long-term plan: > Once Airflow has Dag Isolation (i.e. DAG python dependencies are completely > decoupled from Airflow), we should pin the core Airflow deps, and package > operators separately with version ranges (for the reasons Ash mentioned > about libraries vs applications). > > On Sat, Nov 16, 2019 at 6:22 AM Jarek Potiuk <[email protected]> > wrote: > > > After some discussions and some tests by myself and Astronomer docker, I > > think I have finally a complete proposal for good, consistent solution > for > > our dependency management. > > > > It uses well-known standards, does not introduce any new tools (like > > poetry) and I think it serves all the use cases we have very well. > > > > I wanted to summarise it here before I create it as an official AIP. I > will > > need to finalise the approach to complete the work on the official > > production image. > > > > Here are some basic assumptions: > > > > 1) We keep the current process of setup.py which is used to install > Airflow > > for development *'pip install -e .[extras]'*. It's nice and modular. It > has > > open-dependencies (some of them constrained due to compatibilities). This > > will break from time to time for fresh install when some transitive > > dependencies are changed (or like with yesterday's pymssql drama). No > > changes whatsoever for people who continue installing it this way. The > same > > setup.py is used to install airflow in production (but see point 7 below > > for constrained builds). > > > > 2) On top of setup.py, we will have a standard *requirements.txt* file as > > well. This will be a snapshot of current "working" dependencies > (including > > all transitive dependencies!). This file might be generated automatically > > by pip freeze (with some scripting) and possibly put as part of > pre-commit > > checks to make sure it is self-maintainable when new dependencies are > > added. With the current Breeze + Pre-commit setup we can do it fully > > automatically, so that it will happen behind the scenes (and it will be > > verified in CI). This requirements.txt file will be a set of "KNOWN TO > BE > > GOOD" requirements. This requirements.txt file will have ONLY > > 'dependency==VERSION' - specific versions of each requirement . Nice > > side-effect of the requirements.txt file is that if you use IDEs like > > Pycharm, VSCode, it will automatically suggest installing > missing/outdated > > requirements in your virtualenv so you do not have to remember about it. > > > > 3) By having *requirements.txt *we can add a dependabot > > <https://dependabot.com/> that will upgrade dependencies in this file > > automatically (it works in a way that it will create a PR every time one > of > > the dependencies is released). It works nicely with some of our other > > projects - see for example Oozie-2-Airflow here > > <https://github.com/GoogleCloudPlatform/oozie-to-airflow/commits/master> > . > > It has the benefit that it prepares super-nice, isolated commits/PRs. > Those > > PRs (example here > > <https://github.com/GoogleCloudPlatform/oozie-to-airflow/pull/438>) have > > all the information about changes and commits. Those are nice and > > actionable in terms of checking if we have breaking changes before > merging > > it (example below). The nice thing about dependabot is that it will > create > > a PR and our CI will test it automatically. So committers will have to > just > > double-check, take a look at the release notes and will be able to merge > it > > straight away. I will have to check if dependabot can be used for Apache > > repos though (dependabot is owned by Github). It's an optional step and > we > > can do it manually though if we cannot use dependabot (but it requires > some > > more complex scripting). > > > > 4) We will use the requirements.txt file in CI in PRs so that we know > that > > CI always uses a "known to be good" set of requirements. Every PR will > have > > to have good set of requirements to be merged. No more transitive > > dependency changes. > > > > 5) We have a special CRON job run in CI daily and this job will use '*pip > > install -e . [all]*' (not using the requirements.txt) - this way we will > > have a mechanism to detect early problems with transitive dependencies - > > without breaking PR builds. > > > > 6) We can use the requirements.txt file as constraints file as well (!). > > For example you should be able to run '*pip install -e . [gcp] > > --constraints requirements.txt*' and it will install the "known to be > good" > > versions but only for those extras we want. > > > > 7) I believe I can customise setup.py to add extra "*constrained*" extra > > which will also use the information from the requirements.txt. This way > the > > users will be able to do `*pip install > > apache-airflow==1.10.6[gcp,aws,constrained]*` and have "known to be good" > > installation of Airflow - even if some transitive dependencies break the > > "clean" install. I am not 100% sure if it is possible exactly this way, > but > > I think in the worst case it will be `*pip install > > apache-airflow=1.10.6[constrained,gcp-constrained,aws-constrained]*` - we > > can generate a "-constrained" version of each of the extras automatically > > based on the requirements.txt. I already had a POC for that. And then it > > should be possible to upgrade each of the dependencies on its own. > > > > 8) Last but not least - we will use the requirements.txt file to build > the > > production image. One of the requirements for a good official image > > < > > > https://github.com/docker-library/official-images/blob/master/README.md#repeatability > > > > > is repeatability - which boils down to dependency-pinning. Currently we > > have no mechanism to enforce that. With requirements.txt this will be > > possible. > > > > Let me know what you think! > > > > J. > > > > > > Appendix: example dependabot-generated commit: > > > > Bump pylint from 2.4.3 to 2.4.4 > > > > Bumps [pylint](https://github.com/PyCQA/pylint) from 2.4.3 to 2.4.4. > > - [Release notes](https://github.com/PyCQA/pylint/releases) > > - [Changelog](https://github.com/PyCQA/pylint/blob/master/ChangeLog) > > - [Commits](PyCQA/[email protected] > > <https://github.com/PyCQA/pylint/compare/pylint-2.4.3...pylint-2.4.4>) > > > > > > > > > > > > > > On Fri, Aug 2, 2019 at 8:34 AM Felix Uellendall <[email protected]> > > wrote: > > > > > I understand what Ash and Jarek are saying. I actually use airflow with > > > custom plugins to have a end-product that fully satisfy our needs and > > when > > > writing new hooks and operators I don't want to see "airflow uses > > > requirement foo=1 but you have foo=2" - but actually that sometimes > also > > > just works. So why not let the developer decide if he want to risk > > breaking > > > it? > > > I like the idea of firstly defining groups for core and non-core > > > dependencies. I think only need to manage the core ones to always work > > will > > > be a lot easier. > > > So I would suggest for core ones we have ranges including minor > releases > > > and non-core ones getting pinned by default installation (and ci). > > > > > > I think the more dependencies we add (which comes automatically as > > airflow > > > is growing) the more we are willing to set to more specific version > > ranges > > > and maybe even pin them at the end. > > > > > > Felix > > > > > > Sent from ProtonMail mobile > > > > > > -------- Original Message -------- > > > On Aug 2, 2019, 06:10, Jarek Potiuk wrote: > > > > > > > Ash is totally right - that's exactly the difficulty we face. Airflow > > is > > > > both a library and end product and this makes the usual advice (pin > if > > > you > > > > are end-product, don't pin if you are library) not really useful. > From > > > the > > > > very beginning of my adventures with Airflow I was for pinning of > > > > everything (and using dependabot or similar - I use it for other > > > projects), > > > > but over time I realised that this is very short-sighted approach. it > > > does > > > > not take into account the "library" point of view.. Depending which > > user > > > of > > > > airflow you are, you have contradicting requirements. If you are user > > of > > > > airflow who just want to use it as "end product" you want pinning. If > > you > > > > want to develop your own operators or extend existing ones - you use > > > > airflow as "library" and you do not want pinning. > > > > > > > > I also proposed at the beginning of that thread that we split core > > > > requirements (and pin it) and non-core ones (and don't pin it). But > it > > > > ain't easy to separate those two sets in a clear way unfortunately. > > > > > > > > That's why the idea of choosing at the installation time (and not at > > > build > > > > time) whether you want to install "loose" or "frozen" dependencies is > > so > > > > appealing. > > > > Possibly the best solution could be that you 'pip install airflow' > and > > > you > > > > get the pinned versions and then some other way to get the loose one. > > > But I > > > > think we are a bit on the mercy of pip - this does not seem to be > > > possible. > > > > > > > > Then - it looks like using extras to add "pinning" mechanism is the > > next > > > > best idea. > > > > > > > > I am not afraid about complexity. We can fully automate generating > > those > > > > pinned requirements. I already have some ideas how we can make sure > > that > > > we > > > > keep those requirements in sync while developing and how they can end > > up > > > > frozen in release. I would like to run a POC on that but in short it > is > > > > another "by-product" of the CI image we have now. Our CI image is the > > > > perfect source of frozen requirements - we know those requirements in > > CI > > > > image are ok and we can use them to generate the standard > > > > "requirements.txt" file and keep it updated via some local update > > script > > > > (and pre-commit hooks) + we can verify that they are updated in the > CI. > > > We > > > > can then write custom setup.py that will use that requirements.txt > and > > > the > > > > existing "extras" and generate "pinned" extras automatically. That > > sounds > > > > like fully doable and with very limited maintenance effort. > > > > > > > > J. > > > > > > > > On Thu, Aug 1, [2019](tel:2019) at 10:45 PM Qingping Hou < > > [email protected]> > > > wrote: > > > > > > > >> On Thu, Aug 1, [2019](tel:2019) at 1:33 PM Chen Tong < > > [email protected]> > > > wrote: > > > >> > It is sometimes hard to distinguish if it is a library or an > > > application. > > > >> > Take operator as an example, a non-tech people may think it as a > > > >> well-built > > > >> > application while an engineer may consider it as a library and > > extends > > > >> > functionalities on it. > > > >> > > > > >> > > > >> Yeah, I agree. Personally, I would consider operator to be a library > > > >> due to the expectation that other people will import them in their > own > > > >> projects/source tree. > > > >> > > > >> For things like REST endpoint handlers and perhaps scheduler, it > seems > > > >> safe to assume all changes and improvements will happen within > Airflow > > > >> source tree. In that case, it's safe to classify that part of code > as > > > >> application and freeze all its dependencies. The current plugin > system > > > >> might make this slightly complicated because people can still extend > > > >> the core with custom code. But again, in an ideal world, plugins > > > >> should be self-contained and communicating with core through a well > > > >> defined interface ;) > > > >> > > > > > > > > -- > > > > > > > > Jarek Potiuk > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > M: [+48 660 796 129](tel:+48660796129) > > <[+48660796129](tel:+48660796129)> > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
