Regardless of how complex this implementation would be I am +1 on this.
From the developer's point of view that the CI would run so much faster is the biggest plus for me. I think It will only become worse the more dependencies we add. From the user's point of view that I am able to choose from multiple packages/repositories only the ones I really want to use and I know that any contributor of this repo probably can answer my questions related to this hooks/operators is the biggest plus for me.

I know there would be a lot to do and this gives me headaches even now but in the end I think it would be a great change that is necessary in the long run.

- feluelle

Am 08/01/2019 um 17:42 schrieb Jarek Potiuk:
While splitting the monolithical Airfllow architecture to pieces sounds
good, there is one problem that might be difficult to tackle (or rather
impossible unless we change architecture of Airflow significantly) - namely
dependencies/requirements.

The way Airflow uses operators is that its operators are already closely
coupled with Airflow core. Airflow has to parse all the operators within
the same python interpreter/virtual machine as the core Airflow. This means
potentially big problem with dependency/requirement handling if we have
multiple packages. There are enough common/shared dependencies that various
operators use even now to cause occasional headaches even now. We already
have quite a challenge with handling dependencies of Airflow and its
operators/hooks when they are part of Airflow repo.

Currently the problem is that Ariflow sometimes uses outdated dependencies
or that some random transient dependencies break Airflow installation. But
at we at least have a common dependency list that we work against for all
operators. Unfortunately if we split, then the problem will be worse - very
quickly some contrib operators will require different dependencies and will
not be compatible with Airflow or break Airflow's behaviour.

Not mentioning the problem when you want to use hooks from some other
"area" in your operator. Currently Hooks are the way how you can speed up
development of cross-are behaviour. You implement hooks in some "area" and
other "areas" are free or even encouraged to use them. For example
exporting from BigQuery to all cloud storages in principle should depend on
Hooks for every single Cloud Storage package out there (Google, Azure,
AWS). This is even worse than the MySqLToHive case described earlier - very
quickly we would end up with totally unmanageable mesh of
cross-dependencies.

I think to really make Operators independent from Airflow core, we would
need to allow the dependencies to be fully isolated - i.e. to allow
operators to have different set of dependencies than the core. That's quite
impossible with the current Airflow approach where the same operator code
is parsed in the core. And the same code is used during execute in worker.
And the same code might be used by another operator in form of hook.
Unfortunately we are not in the npm
<https://npm.github.io/how-npm-works-docs/index.html> world (as Kamil
Breguła pointed to me today) where the module loaded handles multiple
versions of the same library in the same process.

One other questions that bothers me -  I believe (please correct me If I am
wrong) some of the operators are using some core features of Airflow and
are even more tied with the Core. For example it is perfectly fine for the
operator to use SQL Alchemy ORM classes of Airflow and run queries/perform
updates in the metadata database of Airflow, I believe - as far as I know
there is a requirement (I saw this somewhere at least) that Celery or
Kubernetes workers need to be able to open a direct database connection to
the metadata database of Airflow and there is nothing to prevent the
operators to do it. This in essence means that the operator has to depend
on many core dependencies/requirements including sqlalchemy, postgres/mysql
....). This can be changed and "forbidden" to use Airflow's core features
but it might break compatibility (If I am right about it).

We could imagine a different approach - where operator is split to a
"Proxy" and "Execute" classes. "Proxy" within Core's interpreter with
Core's dependencies, and "Executor" within the worker case. Then each task
could run in its own Docker image/Pod on Kubernetes with its own
dependencies. But that looks like big, backwards-incompatible change and it
still does not solve cross-dependencies between different "areas". For
handling cross-area operations we would somehow implement communication
between different containers - each having own dependencies. That would be
possible in Kubernetes by having single POD with several containers sharing
the common data and communicating. Seems possible.

It's quite an entertaining idea, but it sounds like Airflow 3.0 already and
one that is not really backwards compatible ;).

J.


On Tue, Jan 8, 2019 at 5:37 PM Tim Swast <sw...@google.com.invalid> wrote:

I don't see it solving any problem than test speed (which is a big one,
yes) but doesn't reduce the amount of workload on the committers.

It's about distributed ownership. For example, I'm not a committer in
pandas, but I am the primary maintainer of pandas-gbq. You're right that if
the set of committers is the same for all 24 repos, there isn't all that
much benefit beyond testing speed.

Each sub-project would still have to follow the normal Apache voting
process.

Presumably the set of people that care about the sub-packages will be
smaller. I don't know enough about the Apache voting process to know how
that might affect it.

Maybe many of the sub-packages can live outside the Apache org? Pandas
keeps the I/O sub-packages in a different org, for example.

Google could choose to release a airflow-gcp-operators package now and
tell people to |from gcp.airflow.operators import SomeNewOperator|.

That's actually part of my motivation for this proposal. I've got some red
tape to get through, but ideally the proposed airflow-google repository in
AIP-8 would actually live in the GoogleCloudPlatform org.

*Maybe I should decrease the scope of AIP-8 to Google hooks/operators?*

There is nothing stopping someone /currently/ creating their own
operators package.

Hooks still need some support in core, so that connections can be
configured. Also, the fact that so many operators live in the Airflow makes
it seem like an operator is less supported / a hack if it doesn't live
there.

How will we ensure that core changes don't break any hooks/operators?
Pandas does this by running tests in the I/O repos against pandas master
branch in addition to against supported releases.

How do we support the logging backends for s3/azure/gcp?
I don't see any reason we can't keep doing what we're already doing.


https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45

We'd need to adjust the import path for the hook, but so long as the upload
/ download method remains stable, it'll work the same. The sub-package will
need to ensure it tests the logging code path in addition to testing DAGs
that use the relevant operators.

*  •  **Tim Swast*
*  •  *Software Friendliness Engineer
*  •  *Google Cloud Developer Relations
*  •  *Seattle, WA, USA


On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor <a...@apache.org> wrote:

Can someone explain to me how having multiple packages will work in
practice?

How will we ensure that core changes don't break any hooks/operators?

How do we support the logging backends for s3/azure/gcp?

What would the release process be for the "sub"-packages?

There is nothing stopping someone /currently/​ creating their own
operators package. There is nothing what-so-ever special about the
|airflow.operators| package namespace, and for example Google could
choose to release a airflow-gcp-operators package now and tell people to
|from gcp.airflow.operators import SomeNewOperator|​.​

My view on this currently is -1 as I don't see it solving any problem
than test speed (which is a big one, yes) but doesn't reduce the amount
of workload on the committers - rather it increases it by having a more
complex release process (each sub-project would still have to follow the
normal Apache voting process) and having 24 repos to check for PRs
rather than just 1.

Am I missing something?

("Core" vs "contrib" made sense when Airflow was still under Airbnb, we
should probably just move everything from contrib out to core pre 2.0.0)

-ash

airflowuser wrote on 08/01/2019 15:44:
I think the operator should be placed by the source.
If it's MySQLToHiveOperator then it would be placed in MySQL package.


The BIG question here is if this serve actual improvement like faster
deployment of hook/operators bug-fix to Airflow users (faster than actual
Airflow release) or this is mere cosmetic issue.
I assume that this also covers the unnecessary separation of core and
contrib.


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:
Something to think about is how data transfer operators like the
MysqlToHiveOperator usually rely on 2 hooks. With a package-specific
approach that may mean something like an `airflow-hive`,
`airflow-mysql`
and `airflow-mysql-hive` packages, where the `airflow-mysql-hive`
package
depends on the two other packages.

It's just a matter of having a clear strategy, good naming conventions
and
a nice central place in the docs that centralize a list of approved
packages.

Max

On Mon, Jan 7, 2019 at 9:05 AM Tim Swast sw...@google.com.invalid
wrote:
I've created AIP-8:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
To follow-up from the discussion about splitting hooks/operators out
of the
core Airflow package at
http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
308670db-bd2a-4738-81b1-3f6fb312c...@apache.org>
I propose packaging based on the target system, informed by the
existing
hooks in both core and contrib. This will allow those with the
relevant
expertise in each target system to respond to contributions / issues
without having to follow the flood of everything Airflow-related. It
will
also decrease the surface area of the core package, helping with
testability and long-term maintenance.

-   • *Tim Swast
-   • *Software Friendliness Engineer
-   • *Google Cloud Developer Relations
-   • *Seattle, WA, USA


Reply via email to