That was indeed a long one. I think I like where you are going with it and it
feels like a nice step if the goal is to consider a full break. I also think
consolidating everything related to a given provider in one place like that
will make it much more intuitive for contributors.
I have two immediate thoughts on it. First, can this be used as an
opportunity to reorganize the system tests tree into the provider's test tree?
Instead of having `tests/system/providers/{foo}`, maybe we can move those
system tests alongside the other provider tests like this:
[cid:cd7da5a0-c387-47cb-bb5f-a593baa4f757]
I'm not sure if that would mess with how pytest runs the system tests, but
maybe it's something that can be done.
My other thought is that I think I am missing something here as far as the new
organization goes. I haven't spent the time that you have spent looking at how
to make this work, and I also realize this is early PoC discussion so maybe
this isn't the end target, but the new paths are very long and redundant. Is
that for automation reasons? For example in the image you shared:
`airflow/providers/airbyte/hooks/airbyte.py` becomes
`airflow/providers/airbyte/src/airflow/providers/airbyte/hooks/airbyte.py`
`tests/providers/airbyte/hooks/test_airbyte.py` becomes
`airflow/providers/airbyte/tests/airflow/providers/airbyte/hooks/test_airbyte.py.
If that redundancy can be removed, that would be best. /src/ basically maps to
package root, and /tests/ is also a top-level directory/module so I'm guessing
the plan is just to copy those directory trees over into the Airflow project
tree using a script?
- Dennis
________________________________
From: Jarek Potiuk <[email protected]>
Sent: Sunday, December 11, 2022 5:25 PM
To: [email protected]
Subject: [EXTERNAL] [PROPOSAL] (Internal) move of provider packages to isolated
"providers" sub-folders
CAUTION: This email originated from outside of the organization. Do not click
links or open attachments unless you can confirm the sender and know the
content is safe.
Hello everyone,
TL;DR: I have a proposal to reorganise the way how our providers are kept in
our sources to reflect more the standard ways of packaging of the providers.
It's a REALLY long one, so If you are not ready to deep dive in discussion on
the structure of the airflow project, you might want to skip that one :).
NOTE. This is not (yet) a discussion about separating the providers out
completely from "airflow" repo. This discussion is yet to happen in the future
and it has much more "social" than "technology" aspects and this discussion is
something we know will come.
But before we get to "split providers in repo" discussion, I would like (and I
also know few other people think in a similar direction - for sure Ash and TP
that we have at least few - including some recent chats with) how our provider
packages are "packaged" inside our repo.
Actually once we succeed "interna" separation of providers - each provider
will be much more "sandboxed" and if we decide to split them out to a separate
repo (or repos) it will be much easier. Or maybe we get to the conclusion that
such separation inside a mono-repo is "good enough" and we decide not to split
to separate repos (which is also a possibility).
CURRENT STATE:
The current state is that "airflow.providers.*" source folders live under the
"airflow" package in our sources. This historically comes from 1.10 and the way
we back-ported old providers and has some good and bad properties - the good
property is that you can do `pip install -e .` and you can develop both
providers and core together so setup for anyone who contributes to providers is
easy
But this has also the side-effect that we cannot use super-standard approach
and standards for packaging defined in modern Python world and PyPA - namely
those two PEPs: https://peps.python.org/pep-0517/ and
https://peps.python.org/pep-0621/ . Our build system for providers is pretty
convoluted as it requires dynamic generation of setup.py and running
setuptools. It works (for a few years now) - but for example if someone - not
release manager - wants to build the package, has to use "breeze" to generate
the sources and run the build.
PEP 517 – A build-system independent format for source trees |
peps.python.org<https://peps.python.org/pep-0517/>
peps.python.org
Python Enhancement Proposals (PEPs)
PEP 621 – Storing project metadata in pyproject.toml |
peps.python.org<https://peps.python.org/pep-0621/>
peps.python.org
Python Enhancement Proposals (PEPs)
Also part of the problem is that provider - related stuff is scattered all-over
the repo in a few places:
* airflow/providers/
* tests/providers
* tests/integration/providers
* tests/system/providers/
* docs/apache-airflow-providers*
Likely some other places too that we will find out :).
Over the last couple of months I did small, incremental refactoring and
restructuring of our build system to make it easier, but finally after
separating out integration tests, we should be ready to make a "big" refactor
to restructure the packages.
I think we are very close to being able to make it in the way that it will be
just "moving" the files around without any changes to them (maybe some in
tests), so that's why I made a POC and wanted to discuss it before investing
more time in making it a reality.
PROPOSAL:
What I would like to achieve is to:
* have each provider to have a separate, complete, standard structure of its
folder
* use non-setuptools based builder that only uses pyproject.toml (for example
flit).
* I would like to keep the "easiness" of development - both with breeze and
local venv for people who are developing providers (especially when they want
to implement changes in several providers at the same time).
* for people using IDEs such as VSCode or INtelliJ or Codespaces or whatever,
it should be straightforward how to work on the common codebase of Airflow and
providers. Actually this is the most difficult part of it.
RESULT:
I developed a fully automated POC of a script that can convert all providers in
one go:
* https://github.com/apache/airflow/pull/28291
[https://opengraph.githubassets.com/64f0a96a5194a8f54a6144bb49ff10901ba687cf6d86179c3b616a03cd9e1b23/apache/airflow/pull/28291]<https://github.com/apache/airflow/pull/28291>
Add script to move providers to a new directory structure by potiuk · Pull
Request #28291 · apache/airflow ·
GitHub<https://github.com/apache/airflow/pull/28291>
github.com
^ Add meaningful description above Read the Pull Request Guidelines for more
information. In case of fundamental code changes, an Airflow Improvement
Proposal (AIP) is needed. In case of a new dependency, check compliance with
the ASF 3rd Party License Policy. In case of backwards incompatible changes
please leave a note in a newsfragment file, named {pr_number}.significant.rst
or {issue_number}.significant.rst, in newsfragments.
Here is also how the repo might look like after I applied the script:
* https://github.com/apache/airflow/pull/28292
[https://opengraph.githubassets.com/222611e3057422172e2fb33e3c93d2f3968610eed959a6294ffedc38d6ed490f/apache/airflow/pull/28292]<https://github.com/apache/airflow/pull/28292>
Draft POC attempt to make a better structure for providers by potiuk · Pull
Request #28292 · apache/airflow ·
GitHub<https://github.com/apache/airflow/pull/28292>
github.com
This is a draft attempt showing the new structure of provider package if we
separte them out to separate provider directory. This is still not a complete
solution - we need to work out a few more things regarding docs building and
breeze integration and package importing so that everything works directly from
the sources, but it should be good to start discussion about the move. ^ Add
meaningful description above Read the Pull Request Guidelines for more
information. In case of fundamental code changes, an Airflow Improvement
Proposal (AIP) is needed. In case of a new dependency, check compliance with
the ASF 3rd Party License Policy. In case of backwards incompatible changes
please leave a note in a newsfragment file, named {pr_number}.significant.rst
or {issue_number}.significant.rst, in newsfragments.
You can check it out and see the new structure in place for all providers.
I also attach the image of the structure I came up with (also see the link here
if you cannot see attachments:
https://gcdnb.pbrd.co/images/R0RtjHbnnqNw.png?o=1
WHAT's DONE:
* providers are fully isolated from each other in a separate, standalone
project inside our repo.
* they have a common structure:
provider/<PROVIDER>
src/
airflow/providers/<PROVIDER>/
operators/
hooks/
docs/
tests/
airflow/providers/<PROVIDER>/
operators/
hooks/
tests/system/
airflow/providers/<PROVIDER>/
...
tests/integration
pyproject.toml
INSTALL.txt
README.rst
provider.yaml
...
* all of the files that are needed to build a package or develop are committed
together with each provider - turning the "providers/<PROVIDER>" into a
complete, separate, closed project. This means that we have quite some
duplication in the project, but those files can be also re-generated by
`pre-commits` as needed as they are automatically generated in most cases.
* there is no setup.py, setup.cfg any more, there is only a pyproject.toml.
Theoretically any compliant PEP 517/621 builder could be used - (almost - you
have to actually choose the tool and make some requirements and not all build
tools support everything). I chose flit as a tool to integrate with and it
works rather nicely. But I am happy to discuss (especially with TP) other
choices we might make.
* we can now automatically build all the packages - I modified our release
tools to use `flit` and it works nicely (and you can also do it now easily
manually too if you do not want to use breeze). I also compared few generated
packages and we are close to say we have 1-1 replacement (there are likely some
embedded data files that need to be more verified/fixed)
* Note that this is purely development-related change. If we make all the
packages contain the same files or very similar ones, there should be
pretty-much 0 impact on production installation of airflow and providers..
WHAT'S LEFT:
* CI /tests do not work as we need a bit more work to make sure those packages
from separate projects are importable in CI directly from sources without
building and installing the packages and making sure that all tests work. It
might require some import trickery though because airflow and airflow.providers
packages overlap.
* The local development "easiness" is not yet addressed. Those projects are
separate from the main airflow. I think this is the most difficult part of the
move actually - to make sure that any new or even existing contributor can
start developing any change to any provider very easily without heavy IDE
configuration. Maybe even it will be difficult to achieve it in some cases (for
example in standard community PyCharm or VSCode or even using Codespaces (that
is available now for free in a capacity that is good for casual users). But I
will invest quite some time to make it as straightforward as possible and
"natural" for sure. For me this is an absolute prerequisite - to make sure
contributions to Airflow and Providers remain easy for first-time users.
* Documentation building does not yet work. restructuring the docs to separate
folders in providers might require some Sphinx trickery - I want to simply make
it possible that you can build the doc of each provider separately, making them
fully standalone packages.
* None of the choices I made are set in stone. I proposed the package structure
as in the attached picture, but I am happy to discuss pros and cons of
different approaches. This refactor is fully automated with the Python script I
created, so we can modify it and update until we decide it's ready. There are
few choices, which I am not sure about.
* I think - if we decide this is a good move - we should make it a "clean-cut"
and migrate all providers at once. The change is very invasive in our tooling -
they depend on the structure to be in place, so it would be terribly
complicated to keep our CI and dev-env to support both approaches in parallel.
But once we solve the "development easiness", this will also be a very "easy"
move. Most of the files will be simply moved and not modified, which will mean
that Git will keep track of them, it will break a number of open PRs, but it
has no impact on "airflow" cherry-picking because all the changes are in
"providers" not in core airflow.
MY ASKS:
If you got that far - congratulations :D. It was really a long one. I wanted us
to discuss some questions:
* Do you like that idea?
* Do you have any concerns?
* Any comments or proposals w/regards to structure/tools?
J.