Pretty hard pass from me in airflow_ext. If it's released by airflow I want it to live under airflow.* (Anyone else is free to release packages under any namespace they choose)
That said I think I've got something that works: /Users/ash/.virtualenvs/test-providers/lib/python3.7/site-packages/notairflow/__init__.py module level code running /Users/ash/.virtualenvs/test-providers/lib/python3.7/site-packages/notairflow/providers/gcp/__init__.py module level code running Let me test it again in a few different cases etc. -a On 4 November 2019 14:00:24 GMT, Jarek Potiuk <jarek.pot...@polidea.com> wrote: Hey Ash, Thanks for the offer. I must admin pkgutil and package namespaces are not the best documented part of python. I dug a deep deeper and I found a similar problem - https://github.com/pypa/setuptools/issues/895. <https://github.com/pypa/setuptools/issues/895.> Seems that even if it is not explicitly explained in pkgutil documentation, this comment (assuming it is right) explains everything: *"That's right. All parents of a namespace package must also be namespace packages, as they will necessarily share that parent name space (farm and farm.deps in this example)."* There are few possibilities mentioned in the issue on how this can be "workarounded", but those are by far not perfect solutions. They would require patching already installed airflow's __init__.py to work - to manipulate the search path, Still from my tests I do not know if this would be possible at all because of the non-trivial __init__.py we have (and use) in the *airflow* package. We have a few PRs now waiting for decision on that one I think, so maybe we can simply agree that we should use another package (I really like *"airflow_ext" *:D and use it from now on? What do you (and others) think. I'd love to start voting on it soon. J. On Thu, Oct 31, 2019 at 5:37 PM Ash Berlin-Taylor <a...@apache.org> wrote: Let me run some tests too - I've used them a bit in the past. I thought since we only want to make airflow.providers a namespace package it might work for us. Will report back next week. -ash On 31 October 2019 15:58:22 GMT, Jarek Potiuk <jarek.pot...@polidea.com> wrote: The same repo (so mono-repo approach). All packages would be in "airflow_integrations" directory. It's mainly about moving the operators/hooks/sensor files to different directory structure. It might be done pretty much without changing the current installation/development model: 1) We can add setup.py command to install all the packages in -e mode in the main setup.py (to make it easier to install all deps in one go). 2) We can add dependencies in setup.py extras to install appropriate packages. For example [google] extra will 'require apache-airflow-integrations-providers-google' package - or apache-airflow-providers-google if we decide to skip -integrations from the package name to make it shorter. The only potential drawback I see is a bit more involved setup of the IDE. This way installation method for both dev and prod remains simple. In the future we can have separate release schedule for the packages (AIP-8) but for now we can stick to the same version for 'apache-airflow' and 'apache-airflow-integrations-*' package (+ separate release schedule for backporting needs) Here again the structure of repo (we will likely be able to use native namespaces so I removed some needles __init__.py). |-- airflow | |- __init__.py| |- operators -> fundamental operators are here |-- tests -> tests for core airflow are here (optionally we can move them under "airflow")|-- setup.py -> setup.py for the "apache-airflow" package|-- airflow_integrations | |-providers | | |-google | | |-setup.py -> setup.py for the "apache-airflow-integrations-providers-google" package | | |-airflow_integrations | | |-providers | | |-google | | |-__init__.py | | | tests -> tests for the "apache-airflow-integrations-providers-google" package| | |-__init__.py| |-protocols | |-setup.py -> setup.py for the "apache-airflow-integrations-protocols" package | |-airflow_integrations | |-protocols | |-__init__.py| |-tests -> tests for the "apache-airflow-integrations-protocols" package J. On Thu, Oct 31, 2019 at 3:38 PM Kaxil Naik <kaxiln...@gmail.com> wrote: So create another package in a different repo? or the same repo with a separate setup.py file that has airflow has dependency? On Thu, Oct 31, 2019 at 2:32 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: TL;DR; I did some more testing on how namespaces work. I still believe the only way to use namespaces is to have separate (for example "airflow_integrations") package for all backportable packages. I am not sue if someone used namespaces before, but after reading and trying out , the main blocker seems to be that we have non-trivial code in airflow's "__init__.py" (including class definitions, imported sub-packages and plugin initialisation). Details are in https://packaging.python.org/guides/packaging-namespace-packages/ <https://packaging.python.org/guides/packaging-namespace-packages/> but it's a long one so let me summarize my findings: - In order to use "airflow.providers" package we would have to declare "airflow" as namespace - It can be done in three different ways: - omitting __init__.py in this package (native/implicit namespace) - making __init__.py of the "airflow" package in main airflow (and other packages) must be "*__path__ = __import__('pkgutil').extend_path(__path__, __name__)*" (pkgutil style) or "*__import__('pkg_resources').declare_namespace(__name__)*" (pkg_resources style) The first is not possible (we already have __init__.py in "airflow". The second case is not possible because we already have quite a lot in the airflow's "__init__.py" and both pkgutil and pkg_resources style state: "*Every* distribution that uses the namespace package must include an identical *__init__.py*. If any distribution does not, it will cause the namespace logic to fail and the other sub-packages will not be importable. *Any additional code in __init__.py will be inaccessible."* I even tried to add those pkgutil/pkg_resources to airflow and do some experimenting with it - but it does not work. Pip install fails at the plugins_manager as "airflow.plugins" is not accessible (kind of expected), but I am sure there will be other problems as well. :( Basically - we cannot turn "airflow" into namespace because it has some "__init__.py" logic :(. So I think it still holds that if we want to use namespaces, we should use another package. The *"airflow_integrations"* is current candidate, but we can think of some nicer/shorter one: "airflow_ext", "airflow_int", "airflow_x", "airflow_mod", "airlfow_next", "airflow_xt", "airflow_", "ext_airflow", .... Interestingly "airflow_" is the one suggested by PEP8 to avoid conflicts with Python names (which is a different case but kind of close). What do you think? J. On Tue, Oct 29, 2019 at 4:51 PM Kaxil Naik <kaxiln...@gmail.com> wrote: The namespace feature looks promising and from your tests, it looks like it would work well from Airflow 2.0 and onwards. I will look at it in-depth and see if I have more suggestions or opinion on it On Tue, Oct 29, 2019 at 3:32 PM Jarek Potiuk <jarek.pot...@polidea.com wrote: TL;DR; We did some testing about namespaces and packaging (and potential backporting options for 1.10.* python3 Airflows) and we think it's best to use namespaces quickly and use different package name "airflow-integrations" for all non-fundamental integrations. Unless we missed some tricks, we cannot use airflow.* sub-packages for the 1.10.* backportable packages. Example: - "*apache-airflow"* package provides: "airflow.*" (this is what we have today) - "*apache-airflow-providers-google*": provides "airflow.providers.google.*" packages If we install both packages (old apache-airflow 1.10.6 and new apache-airflow-providers-google from 2.0) - it seems that the "airflow.providers.google.*" package cannot be imported. This is a bit of a problem if we would like to backport the operators from Airflow 2.0 to Airflow 1.10 in a way that will be forward-compatible We really want users who started using backported operators in 1.10.* do not have to change imports in their DAGs to run them in Airflow 2.0. We discussed it internally in our team and considered several options, but we think the best way will be to go straight to "namespaces" in Airflow 2.0 and to have the integrations (as discussed in AIP-21 discussion) to be in a separate "*airflow_integrations*" package. It might be even more towards the AIP-8 implementation and plays together very well in terms of "stewardship" discussed in AIP-21 now. But we will still keep (for now) single release process for all packages for 2.0 (except for the backporting which can be done per-provider before 2.0 release) and provide a foundation for future more complex release cycles in future versions. Herre is the way how the new Airflow 2.0 repository could look like (i only show subset of dirs but they are representative). For those whose email fixed/colorfont will get corrupted here is an image of this structure https://pasteboard.co/IEesTih.png: <https://pasteboard.co/IEesTih.png:> |-- airflow | |- __init__.py| |- operators -> fundamental operators are here |-- tests -> tests for core airflow are here (optionally we can move them under "airflow")|-- setup.py -> setup.py for the "apache-airflow" package|-- airflow_integrations | |-providers | | |-google | | |-setup.py -> setup.py for the "apache-airflow-integrations-providers-google" package | | |-airflow_integrations | | |-__init__.py | | |-providers | | |-__init__.py | | |-google | | |-__init__.py | | | tests -> tests for the "apache-airflow-integrations-providers-google" package| | |-__init__.py| |-protocols | |-setup.py -> setup.py for the "apache-airflow-integrations-protocols" package | |-airflow_integrations | |-protocols | |-__init__.py| |-tests -> tests for the "apache-airflow-integrations-protocols" package There are a number of pros for this solution: - We could use the standard namespaces feature of python to build multiple packages: https://packaging.python.org/guides/packaging-namespace-packages/ <https://packaging.python.org/guides/packaging-namespace-packages/> - Installation for users will be the same as previously. We could install the needed packages automatically when particular extras are used (pip install apache-airflow[google] could install both "apache-airflow" and "apache-airflow-integrations-providers-google") - We could have custom setup.py installation process for developers that could install all the packages in development ("-e ." mode) in a single operation. - In case of transfer packages we could have nice error messages informing that the other package needs to be installed (for example S3->GCS operator would import "airflow-integrations.providers.amazon.*" and if it fails it could raise ("Please install [amazon] extra to use me.") - We could implement numerous optimisations in the way how we run tests in CI (for example run all the "providers" tests only with sqlite, run tests in parallel etc.) - We could implement it gradually - we do not have to have a "big bang" approach - we can implement it in "provider-by-provider" way and test it with one provider (Google) first to make sure that all the mechanisms are working - For now we could have the monorepo approach where all the packages will be developed in concert - for now avoiding the dependency problems (but allowing for back-portability to 1.10). - We will have clear boundaries between packages and ability to test for some unwanted/hidden dependencies between packages. - We could switch to (much better) sphinx-apidoc package to continue building single documentation for all of those (sphinx apidoc has support for namespaces). As we are working on GCP move from contrib to core, we could make all the effort to test it and try it before we merge it to master so that it will be ready for others (and we could help with most of the moves afterwards). It seems complex, but in fact in most cases it will be very simple move between the packages and can be done incrementally so there is little risk in doing this I think. J. On Mon, Oct 28, 2019 at 11:45 PM Kevin Yang <yrql...@gmail.com> wrote: Tomasz and Ash got good points about the overhead of having separate repos. But while we grow bigger and more mature, I would prefer to have what was described in AIP-8. It shouldn't be extremely hard for us to come up with good strategies to handle the overhead. AIP-8 already talked about how it can benefit us. IMO on a high level, having clearly seperation on core vs. hooks/operators would make the project much more scalable and the gains would outweigh the cost we pay. That being said, I'm supportive to this moving towards AIP-8 while learning approach, quite a good practise to tackle a big project. Looking forward to read the AIP. Cheers, Kevin Y On Mon, Oct 28, 2019 at 6:21 AM Jarek Potiuk < jarek.pot...@polidea.com wrote: We are checking how we can use namespaces in back-portable way and we will have POC soon so that we all will be able to see how it will look like. J. On Mon, Oct 28, 2019 at 1:24 PM Ash Berlin-Taylor < a...@apache.org> wrote: I'll have to read your proposal in detail (sorry, no time right now!), but I'm broadly in favour of this approach, and I think keeping them _in_ the same repo is the best plan -- that makes writing and testing cross-cutting changes easier. -a On 28 Oct 2019, at 12:14, Tomasz Urbaszek < tomasz.urbas...@polidea.com wrote: I think utilizing namespaces should reduce a lot of problems raised by using separate repos (who will manage it? how to release? where should be the repo?). Bests, Tomek On Sun, Oct 27, 2019 at 11:54 AM Jarek Potiuk < jarek.pot...@polidea.com> wrote: Thanks Bas for comments! Let me share my thoughts below. On Sun, Oct 27, 2019 at 9:23 AM Bas Harenslak < basharens...@godatadriven.com> wrote: Hi Jarek, I definitely see a future in creating separate installable packages for various operators/hooks/etc (as in AIP-8). This would IMO strip the “core” Airflow to only what’s needed and result in a small package without a ton of dependencies (and make it more maintainable, shorter tests, etc etc etc). Not exactly sure though what you’re proposing in your e-mail, is it a new AIP for an intermediate step towards AIP-8? It's a new AIP I am proposing. For now it's only for backporting the new 2.0 import paths to 1.10.* series. It's more of "incremental going in direction of AIP-8 and learning some difficulties involved" than implementing AIP-8 fully. We are taking advantage of changes in import paths from AIP-21 which make it possible to have both old and new (optional) operators available in 1.10.* series of Airflow. I think there is a lot more to do for full implementation of AIP-8: decisions how to maintain, install those operator groups separately, stewardship model/organisation for the separate groups, how to manage cross-dependencies, procedures for releasing the packages etc. I think about this new AIP also as a learning effort - we would learn more how separate packaging works and then we can follow up with AIP-8 full implementation for "modular" Airflow. Then AIP-8 could be implemented in Airflow 2.1 for example - or 3.0 if we start following semantic versioning - based on those learnings. It's a bit of good example of having cake and eating it too. We can try out modularity in 1.10.* while cutting the scope of 2.0 and not implementing full management/release procedure for AIP-8 yet. Thinking about this, I think there are still a few grey areas (which would be good to discuss in a new AIP, or continue on AIP-8): * In your email you only speak only about the 3 big cloud providers (btw I made a PR for migrating all AWS components -> https://github.com/apache/airflow/pull/6439). <https://github.com/apache/airflow/pull/6439).> Is there a plan for splitting other components than Google/AWS/Azure? We could add more groups as part of this new AIP indeed (as an extension to AIP-21 and pre-requisite to AIP-8). We already see how moving/deprecation works for the providers package - it works for GCP/Google rather nicely. But there is nothing to prevent us from extending it to cover other groups of operators/hooks. If you look at the current structure of documentation done by Kamil, we can follow the structure there and move the operators/hooks accordingly ( https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html <https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html> ): Fundamentals, ASF: Apache Software Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google Cloud Platform, Service integrations, Software integrations, Protocol integrations. I am happy to include that in the AIP - if others agree it's a good idea. Out of those groups - I think only Fundamentals should not be back-ported. Others should be rather easy to port (if we decide to). We already have quite a lot of those in the new GCP operators for 2.0. So starting with GCP/Google group is a good idea. Also following with Cloud Providers first is a good thing. For example we have now support from Google Composer team to do this separation for GCP (and we learn from it) and then we can claim the stewardship in our team for releasing the python 3/ Airflow 1.10-compatible "airflow-google" packages. Possibly other Cloud Providers/teams might follow this (if they see the value in it) and there could be different stewards for those. And then we can do other groups if we decide to. I think this way we can learn whether AIP-8 is manageable and what real problems we are going to face. * Each “plugin” e.g. GCP would be a separate repo, should we create some sort of blueprint for such packages? I think we do not need separate repos (at all) but in this new AIP we can test it before we decide to go for AIP-8. IMHO - monorepo approach will work here rather nicely. We could use python-3 native namespaces < https://packaging.python.org/guides/packaging-namespace-packages/ <https://packaging.python.org/guides/packaging-namespace-packages/>> for the sub-packages when we go full AIP-8. For now we could simply package the new operators in separate pip package for Python 3 version 1.10.* series only. We only need to test if it works well with another package providing 'airflow.providers.*' after apache-airflow is installed (providing 'airflow' package). But I think we can make it work. I don't think we really need to split the repos, namespaces will work just fine and has easier management of cross-repository dependencies (but we can learn otherwise). For sure we will not need it for the new proposed AIP of backporting groups to 1.10 and we can defer that decision to AIP-8 implementation time. * In which Airflow version do we start raising deprecation warnings and in which version would we remove the original? I think we should do what we did in GCP case already. Those old "imports" for operators can be made as deprecated in Airflow 2.0 (and removed in 2.1 or 3.0 if we start following semantic versioning). We can however do it before in 1.10.7 or 1.10.8 if we release those (without removing the old operators yet - just raise deprecation warnings and inform that for python3 the new "airflow-google", "airflow-aws" etc. packages can be installed and users can switch to it). J. Cheers, Bas On 27 Oct 2019, at 08:33, Jarek Potiuk < jarek.pot...@polidea.com <mailto: jarek.pot...@polidea.com>> wrote: Hello - any comments on that? I am happy to make it into an AIP :)? On Sun, Oct 13, 2019 at 5:53 PM Jarek Potiuk < jarek.pot...@polidea.com <mailto:jarek.pot...@polidea.com>> wrote: *Motivation* I think we really should start thinking about making it easier to migrate to 2.0 for our users. After implementing some recent changes related to AIP-21- Changes in import paths < https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths> I think I have an idea that might help with it. *Proposal* We could package some of the new and improved 2.0 operators (moved to "providers" package) and let them be used in Python 3 environment of airflow 1.10.x. This can be done case-by-case per "cloud provider". It should not be obligatory, should be largely driven by each provider. It's not yet full AIP-8 Split Hooks/Operators into separate packages < https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303> . It's merely backporting of some operators/hooks to get it work in 1.10. But by doing it we might try out the concept of splitting, learn about maintenance problems and maybe implement full *AIP-8 *approach in 2.1 consistently across the board. *Context* Part of the AIP-21 was to move import paths for Cloud providers to separate providers/<PROVIDER> package. An example for that (the first provider we already almost migrated) was providers/google package (further divided into gcp/gsuite etc). We've done a massive migration of all the Google-related operators, created a few missing ones and retrofitted some old operators to follow GCP best practices and fixing a number of problems - also implementing Python3 and Pylint compatibility. Some of these operators/hooks are not backwards compatible. Those that are compatible are still available via the old imports with deprecation warning. We've added missing tests (including system tests) and missing features - improving some of the Google operators - giving the users more capabilities and fixing some issues. Those operators should pretty much "just work" in Airflow 1.10.x (any recent version) for Python 3. We should be able to release a separate pip-installable package for those operators that users should be able to install in Airflow 1.10.x. Any user will be able to install this separate package in their Airflow 1.10.x installation and start using those new "provider" operators in parallel to the old 1.10.x operators. Other providers ("microsoft", "amazon") might follow the same approach if they want. We could even at some point decide to move some of the core operators in similar fashion (for example following the structure proposed in the latest documentation: fundamentals / software / etc. https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html) <https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html)> *Pros and cons* There are a number of pros: - Users will have an easier migration path if they are deeply vested into 1.10.* version - It's possible to migrate in stages for people who are also vested in py2: *py2 (1.10) -> py3 (1.10) -> py3 + new operators (1.10) -> py3 + 2.0* - Moving to new operators in py3 + new operators can be done gradually. Old operators will continue to work while new can be used more and more - People will get incentivised to migrate to python 3 before 2.0 is out (by using new operators) - Each provider "package" can have independent release schedule - and add functionality in already released Airflow versions. - We do not take out any functionality from the users - we just add more options - The releases can be - similarly as main airflow releases - voted separately by PMC after "stewards" of the package (per provider) perform round of testing on 1.10.* versions. - Users will start migrating to new operators earlier and have smoother switch to 2.0 later - The latest improved operators will start There are three cons I could think of: - There will be quite a lot of duplication between old and new operators (they will co-exist in 1.10). That might lead to confusion of users and problems with cooperation between different operators/hooks - Having new operators in 1.10 python 3 might keep people from migrating to 2.0 - It will require some maintenance and separate release overhead. I already spoke to Composer team @Google and they are very positive about this. I also spoke to Ash and seems it might also be OK for Astronomer team. We have Google's backing and support, and we can provide maintenance and support for those packages - being an example for other providers how they can do it. Let me know what you think - and whether I should make it into an official AIP maybe? J. -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Tomasz Urbaszek Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Junior Software Engineer M: +48 505 628 493 <+48505628493> E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com Unique Tech Check out our projects! <https://www.polidea.com/our-work <https://www.polidea.com/our-work>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> -- Jarek Potiuk Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>