TL;DR; We did some testing about namespaces and packaging (and potential
backporting options for 1.10.* python3 Airflows) and we think it's best to
use namespaces quickly and use different package name
"airflow-integrations" for all non-fundamental integrations.

Unless we missed some tricks, we cannot use airflow.* sub-packages for the
1.10.* backportable packages. Example:

   - "*apache-airflow"* package provides: "airflow.*" (this is what we have
   today)
   - "*apache-airflow-providers-google*": provides
   "airflow.providers.google.*" packages

If we install both packages (old apache-airflow 1.10.6  and new
apache-airflow-providers-google from 2.0) - it seems that
the "airflow.providers.google.*" package cannot be imported. This is a bit
of a problem if we would like to backport the operators from Airflow 2.0 to
Airflow 1.10 in a way that will be forward-compatible We really want users
who started using backported operators in 1.10.* do not have to change
imports in their DAGs to run them in Airflow 2.0.

We discussed it internally in our team and considered several options, but
we think the best way will be to go straight to "namespaces" in Airflow 2.0
and to have the integrations (as discussed in AIP-21 discussion) to be in a
separate "*airflow_integrations*" package.  It might be even more towards
the AIP-8 implementation and plays together very well in terms of
"stewardship" discussed in AIP-21 now. But we will still keep (for now)
single release process for all packages for 2.0 (except for the backporting
which can be done per-provider before 2.0 release) and provide a foundation
for future more complex release cycles in future versions.

Herre is the way how the new Airflow 2.0 repository could look like (i only
show subset of dirs but they are representative). For those whose email
fixed/colorfont will get corrupted here is an image of this structure
https://pasteboard.co/IEesTih.png:

|-- airflow
|   |- __init__.py|   |- operators -> fundamental operators are here
|-- tests -> tests for core airflow are here (optionally we can move
them under "airflow")|-- setup.py -> setup.py for the "apache-airflow"
package|-- airflow_integrations
|   |-providers
|   | |-google
|   |   |-setup.py -> setup.py for the
"apache-airflow-integrations-providers-google" package
|   |   |-airflow_integrations
|   |     |-__init__.py
|   |     |-providers
|   |       |-__init__.py
|   |       |-google
|   |         |-__init__.py
|   |         | tests -> tests for the
"apache-airflow-integrations-providers-google" package|   |
|-__init__.py|   |-protocols
|     |-setup.py -> setup.py for the
"apache-airflow-integrations-protocols" package
|     |-airflow_integrations
|        |-protocols
|          |-__init__.py|          |-tests -> tests for the
"apache-airflow-integrations-protocols" package

There are a number of pros for this solution:

   - We could use the standard namespaces feature of python to build
   multiple packages:
   https://packaging.python.org/guides/packaging-namespace-packages/
   - Installation for users will be the same as previously. We could
   install the needed packages automatically when particular extras are used
   (pip install apache-airflow[google] could install both "apache-airflow" and
   "apache-airflow-integrations-providers-google")
   - We could have custom setup.py installation process for developers that
   could install all the packages in development ("-e ." mode) in a single
   operation.
   - In case of transfer packages we could have nice error messages
   informing that the other package needs to be installed (for example S3->GCS
   operator would import "airflow-integrations.providers.amazon.*" and if it
   fails it could raise ("Please install [amazon] extra to use me.")
   - We could implement numerous optimisations in the way how we run tests
   in CI (for example run all the "providers" tests only with sqlite, run
   tests in parallel etc.)
   - We could implement it gradually - we do not have to have a "big bang"
   approach - we can implement it in "provider-by-provider" way and test it
   with one provider (Google) first to make sure that all the mechanisms are
   working
   - For now we could have the monorepo approach where all the packages
   will be developed in concert - for now avoiding the dependency problems
   (but allowing for back-portability to 1.10).
   - We will have clear boundaries between packages and ability to test for
   some unwanted/hidden dependencies between packages.
   - We could switch to (much better) sphinx-apidoc package to continue
   building single documentation for all of those (sphinx apidoc has support
   for namespaces).

As we are working on GCP move from contrib to core, we could make all the
effort to test it and try it before we merge it to master so that it will
be ready for others (and we could help with most of the moves afterwards).
It seems complex, but in fact in most cases it will be very simple move
between the packages and can be done incrementally so there is little risk
in doing this I think.

J.


On Mon, Oct 28, 2019 at 11:45 PM Kevin Yang <yrql...@gmail.com> wrote:

> Tomasz and Ash got good points about the overhead of having separate repos.
> But while we grow bigger and more mature, I would prefer to have what was
> described in AIP-8. It shouldn't be extremely hard for us to come up with
> good strategies to handle the overhead. AIP-8 already talked about how it
> can benefit us. IMO on a high level, having clearly seperation on core vs.
> hooks/operators would make the project much more scalable and the gains
> would outweigh the cost we pay.
>
> That being said, I'm supportive to this moving towards AIP-8 while learning
> approach, quite a good practise to tackle a big project. Looking forward to
> read the AIP.
>
>
> Cheers,
> Kevin Y
>
> On Mon, Oct 28, 2019 at 6:21 AM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > We are checking how we can use namespaces in back-portable way and we
> will
> > have POC soon so that we all will be able to see how it will look like.
> >
> > J.
> >
> > On Mon, Oct 28, 2019 at 1:24 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >
> > > I'll have to read your proposal in detail (sorry, no time right now!),
> > but
> > > I'm broadly in favour of this approach, and I think keeping them _in_
> the
> > > same repo is the best plan -- that makes writing and  testing
> > cross-cutting
> > > changes  easier.
> > >
> > > -a
> > >
> > > > On 28 Oct 2019, at 12:14, Tomasz Urbaszek <
> tomasz.urbas...@polidea.com
> > >
> > > wrote:
> > > >
> > > > I think utilizing namespaces should reduce a lot of problems raised
> by
> > > > using separate repos (who will manage it? how to release? where
> should
> > be
> > > > the repo?).
> > > >
> > > > Bests,
> > > > Tomek
> > > >
> > > > On Sun, Oct 27, 2019 at 11:54 AM Jarek Potiuk <
> > jarek.pot...@polidea.com>
> > > > wrote:
> > > >
> > > >> Thanks Bas for comments! Let me share my thoughts below.
> > > >>
> > > >> On Sun, Oct 27, 2019 at 9:23 AM Bas Harenslak <
> > > >> basharens...@godatadriven.com>
> > > >> wrote:
> > > >>
> > > >>> Hi Jarek, I definitely see a future in creating separate
> installable
> > > >>> packages for various operators/hooks/etc (as in AIP-8). This would
> > IMO
> > > >>> strip the “core” Airflow to only what’s needed and result in a
> small
> > > >>> package without a ton of dependencies (and make it more
> maintainable,
> > > >>> shorter tests, etc etc etc). Not exactly sure though what you’re
> > > >> proposing
> > > >>> in your e-mail, is it a new AIP for an intermediate step towards
> > AIP-8?
> > > >>>
> > > >>
> > > >> It's a new AIP I am proposing.  For now it's only for backporting
> the
> > > new
> > > >> 2.0 import paths to 1.10.* series.
> > > >>
> > > >> It's more of "incremental going in direction of AIP-8 and learning
> > some
> > > >> difficulties involved" than implementing AIP-8 fully. We are taking
> > > >> advantage of changes in import paths from AIP-21 which make it
> > possible
> > > to
> > > >> have both old and new (optional) operators available in 1.10.*
> series
> > of
> > > >> Airflow. I think there is a lot more to do for full implementation
> of
> > > >> AIP-8: decisions how to maintain, install those operator groups
> > > separately,
> > > >> stewardship model/organisation for the separate groups, how to
> manage
> > > >> cross-dependencies, procedures for releasing the packages etc.
> > > >>
> > > >> I think about this new AIP also as a learning effort - we would
> learn
> > > more
> > > >> how separate packaging works and then we can follow up with AIP-8
> full
> > > >> implementation for "modular" Airflow. Then AIP-8 could be
> implemented
> > in
> > > >> Airflow 2.1 for example - or 3.0 if we start following semantic
> > > versioning
> > > >> - based on those learnings. It's a bit of good example of having
> cake
> > > and
> > > >> eating it too. We can try out modularity in 1.10.* while cutting the
> > > scope
> > > >> of 2.0 and not implementing full management/release procedure for
> > AIP-8
> > > >> yet.
> > > >>
> > > >>
> > > >>> Thinking about this, I think there are still a few grey areas
> (which
> > > >> would
> > > >>> be good to discuss in a new AIP, or continue on AIP-8):
> > > >>>
> > > >>>  *   In your email you only speak only about the 3 big cloud
> > providers
> > > >>> (btw I made a PR for migrating all AWS components ->
> > > >>> https://github.com/apache/airflow/pull/6439). Is there a plan for
> > > >>> splitting other components than Google/AWS/Azure?
> > > >>>
> > > >>
> > > >> We could add more groups as part of this new AIP indeed (as an
> > > extension to
> > > >> AIP-21 and pre-requisite to AIP-8). We already see how
> > > moving/deprecation
> > > >> works for the providers package - it works for GCP/Google rather
> > nicely.
> > > >> But there is nothing to prevent us from extending it to cover other
> > > groups
> > > >> of operators/hooks. If you look at the current structure of
> > > documentation
> > > >> done by Kamil, we can follow the structure there and move the
> > > >> operators/hooks accordingly (
> > > >>
> https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html
> > ):
> > > >>
> > > >>      Fundamentals, ASF: Apache Software Foundation, Azure: Microsoft
> > > >> Azure, AWS: Amazon Web Services, GCP: Google Cloud Platform, Service
> > > >> integrations, Software integrations, Protocol integrations.
> > > >>
> > > >> I am happy to include that in the AIP - if others agree it's a good
> > > idea.
> > > >> Out of those groups -  I think only Fundamentals should not be
> > > back-ported.
> > > >> Others should be rather easy to port (if we decide to). We already
> > have
> > > >> quite a lot of those in the new GCP operators for 2.0. So starting
> > with
> > > >> GCP/Google group is a good idea. Also following with Cloud Providers
> > > first
> > > >> is a good thing. For example we have now support from Google
> Composer
> > > team
> > > >> to do this separation for GCP (and we learn from it) and then we can
> > > claim
> > > >> the stewardship in our team for releasing the python 3/ Airflow
> > > >> 1.10-compatible "airflow-google" packages. Possibly other Cloud
> > > >> Providers/teams might follow this (if they see the value in it) and
> > > there
> > > >> could be different stewards for those. And then we can do other
> groups
> > > if
> > > >> we decide to. I think this way we can learn whether AIP-8 is
> > manageable
> > > and
> > > >> what real problems we are going to face.
> > > >>
> > > >>  *   Each “plugin” e.g. GCP would be a separate repo, should we
> create
> > > >>> some sort of blueprint for such packages?
> > > >>>
> > > >>
> > > >> I think we do not need separate repos (at all) but in this new AIP
> we
> > > can
> > > >> test it before we decide to go for AIP-8. IMHO - monorepo approach
> > will
> > > >> work here rather nicely. We could use python-3 native namespaces
> > > >> <https://packaging.python.org/guides/packaging-namespace-packages/>
> > for
> > > >> the
> > > >> sub-packages when we go full AIP-8. For now we could simply package
> > the
> > > new
> > > >> operators in separate pip package for Python 3 version 1.10.* series
> > > only.
> > > >> We only need to test if it works well with another package providing
> > > >> 'airflow.providers.*' after apache-airflow is installed (providing
> > > >> 'airflow' package). But I think we can make it work. I don't think
> we
> > > >> really need to split the repos, namespaces will work just fine and
> has
> > > >> easier management of cross-repository dependencies (but we can learn
> > > >> otherwise). For sure we will not need it for the new proposed AIP of
> > > >> backporting groups to 1.10 and we can defer that decision to AIP-8
> > > >> implementation time.
> > > >>
> > > >>
> > > >>>  *   In which Airflow version do we start raising deprecation
> > warnings
> > > >>> and in which version would we remove the original?
> > > >>>
> > > >>
> > > >> I think we should do what we did in GCP case already. Those old
> > > "imports"
> > > >> for operators can be made as deprecated in Airflow 2.0 (and removed
> in
> > > 2.1
> > > >> or 3.0 if we start following semantic versioning). We can however do
> > it
> > > >> before in 1.10.7 or 1.10.8 if we release those (without removing the
> > old
> > > >> operators yet - just raise deprecation warnings and inform that for
> > > python3
> > > >> the new "airflow-google", "airflow-aws" etc. packages can be
> installed
> > > and
> > > >> users can switch to it).
> > > >>
> > > >> J.
> > > >>
> > > >>
> > > >>>
> > > >>> Cheers,
> > > >>> Bas
> > > >>>
> > > >>> On 27 Oct 2019, at 08:33, Jarek Potiuk <jarek.pot...@polidea.com
> > > <mailto:
> > > >>> jarek.pot...@polidea.com>> wrote:
> > > >>>
> > > >>> Hello - any comments on that? I am happy to make it into an AIP :)?
> > > >>>
> > > >>> On Sun, Oct 13, 2019 at 5:53 PM Jarek Potiuk <
> > jarek.pot...@polidea.com
> > > >>> <mailto:jarek.pot...@polidea.com>>
> > > >>> wrote:
> > > >>>
> > > >>> *Motivation*
> > > >>>
> > > >>> I think we really should start thinking about making it easier to
> > > migrate
> > > >>> to 2.0 for our users. After implementing some recent changes
> related
> > to
> > > >>> AIP-21-
> > > >>> Changes in import paths
> > > >>> <
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths
> > > >>>
> > > >>> I
> > > >>> think I have an idea that might help with it.
> > > >>>
> > > >>> *Proposal*
> > > >>>
> > > >>> We could package some of the new and improved 2.0 operators (moved
> to
> > > >>> "providers" package) and let them be used in Python 3 environment
> of
> > > >>> airflow 1.10.x.
> > > >>>
> > > >>> This can be done case-by-case per "cloud provider". It should not
> be
> > > >>> obligatory, should be largely driven by each provider. It's not yet
> > > full
> > > >>> AIP-8
> > > >>> Split Hooks/Operators into separate packages
> > > >>> <
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
> > > >>> .
> > > >>> It's
> > > >>> merely backporting of some operators/hooks to get it work in 1.10.
> > But
> > > by
> > > >>> doing it we might try out the concept of splitting, learn about
> > > >> maintenance
> > > >>> problems and maybe implement full *AIP-8 *approach in 2.1
> > consistently
> > > >>> across the board.
> > > >>>
> > > >>> *Context*
> > > >>>
> > > >>> Part of the AIP-21 was to move import paths for Cloud providers to
> > > >>> separate providers/<PROVIDER> package. An example for that (the
> first
> > > >>> provider we already almost migrated) was providers/google package
> > > >> (further
> > > >>> divided into gcp/gsuite etc).
> > > >>>
> > > >>> We've done a massive migration of all the Google-related operators,
> > > >>> created a few missing ones and retrofitted some old operators to
> > follow
> > > >> GCP
> > > >>> best practices and fixing a number of problems - also implementing
> > > >> Python3
> > > >>> and Pylint compatibility. Some of these operators/hooks are not
> > > backwards
> > > >>> compatible. Those that are compatible are still available via the
> old
> > > >>> imports with deprecation warning.
> > > >>>
> > > >>> We've added missing tests (including system tests) and missing
> > > features -
> > > >>> improving some of the Google operators - giving the users more
> > > >> capabilities
> > > >>> and fixing some issues. Those operators should pretty much "just
> > work"
> > > in
> > > >>> Airflow 1.10.x (any recent version) for Python 3. We should be able
> > to
> > > >>> release a separate pip-installable package for those operators that
> > > users
> > > >>> should be able to install in Airflow 1.10.x.
> > > >>>
> > > >>> Any user will be able to install this separate package in their
> > Airflow
> > > >>> 1.10.x installation and start using those new "provider" operators
> in
> > > >>> parallel to the old 1.10.x operators. Other providers ("microsoft",
> > > >>> "amazon") might follow the same approach if they want. We could
> even
> > at
> > > >>> some point decide to move some of the core operators in similar
> > fashion
> > > >>> (for example following the structure proposed in the latest
> > > >> documentation:
> > > >>> fundamentals / software / etc.
> > > >>>
> > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html)
> > > >>>
> > > >>> *Pros and cons*
> > > >>>
> > > >>> There are a number of pros:
> > > >>>
> > > >>>  - Users will have an easier migration path if they are deeply
> vested
> > > >>>  into 1.10.* version
> > > >>>  - It's possible to migrate in stages for people who are also
> vested
> > in
> > > >>>  py2: *py2 (1.10) -> py3 (1.10) -> py3 + new operators (1.10) ->
> py3
> > +
> > > >>>  2.0*
> > > >>>  - Moving to new operators in py3 + new operators can be done
> > > >>>  gradually. Old operators will continue to work while new can be
> used
> > > >> more
> > > >>>  and more
> > > >>>  - People will get incentivised to migrate to python 3 before 2.0
> is
> > > >>>  out (by using new operators)
> > > >>>  - Each provider "package" can have independent release schedule -
> > and
> > > >>>  add functionality in already released Airflow versions.
> > > >>>  - We do not take out any functionality from the users - we just
> add
> > > >>>  more options
> > > >>>  - The releases can be - similarly as main airflow releases - voted
> > > >>>  separately by PMC after "stewards" of the package (per provider)
> > > >> perform
> > > >>>  round of testing on 1.10.* versions.
> > > >>>  - Users will start migrating to new operators earlier and have
> > > >>>  smoother switch to 2.0 later
> > > >>>  - The latest improved operators will start
> > > >>>
> > > >>> There are three cons I could think of:
> > > >>>
> > > >>>  - There will be quite a lot of duplication between old and new
> > > >>>  operators (they will co-exist in 1.10). That might lead to
> confusion
> > > of
> > > >>>  users and problems with cooperation between different
> > operators/hooks
> > > >>>  - Having new operators in 1.10 python 3 might keep people from
> > > >>>  migrating to 2.0
> > > >>>  - It will require some maintenance and separate release overhead.
> > > >>>
> > > >>> I already spoke to Composer team @Google and they are very positive
> > > about
> > > >>> this. I also spoke to Ash and seems it might also be OK for
> > Astronomer
> > > >>> team. We have Google's backing and support, and we can provide
> > > >> maintenance
> > > >>> and support for those packages - being an example for other
> providers
> > > how
> > > >>> they can do it.
> > > >>>
> > > >>> Let me know what you think - and whether I should make it into an
> > > >> official
> > > >>> AIP maybe?
> > > >>>
> > > >>> J.
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>
> > > >>> Jarek Potiuk
> > > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >>>
> > > >>> M: +48 660 796 129 <+48660796129>
> > > >>> [image: Polidea] <https://www.polidea.com/>
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>>
> > > >>> Jarek Potiuk
> > > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >>>
> > > >>> M: +48 660 796 129 <+48660796129>
> > > >>> [image: Polidea] <https://www.polidea.com/>
> > > >>>
> > > >>>
> > > >>
> > > >> --
> > > >>
> > > >> Jarek Potiuk
> > > >> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >>
> > > >> M: +48 660 796 129 <+48660796129>
> > > >> [image: Polidea] <https://www.polidea.com/>
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Tomasz Urbaszek
> > > > Polidea <https://www.polidea.com/> | Junior Software Engineer
> > > >
> > > > M: +48 505 628 493 <+48505628493>
> > > > E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>
> > > >
> > > > Unique Tech
> > > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> > >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to