Re: AIP-8 Split Hooks/Operators into Separate Packages

Maxime Beauchemin Thu, 10 Jan 2019 08:59:47 -0800

A related though is around the fact that all these libs depend on Airflow
itself to get the base class they are deriving (BaseHook and BaseOperator
mostly). It's a bit upside down when the small library depends on a big
library. That may be ok as is, but pushing the micro-package logic would
dictate to break down `airflow-core`, which in turns calls for breaking
down `airflow-scheduler` and `airflow-web`.


All this is super disruptive work (breaks all open PRs), but to me is a
much better place to be once we get there. Apache-wise that means more
[though smaller] releases and more coordination. With more packages, we
have to be good at defining the dependencies supported version ranges. My
incline would be to have a single repo with multiple `setup.py` within it,
so that PRs can touch multiple packages.

We don't have to do all of this or all of this at once either, but the
community should agree on a plan.

Refactoring the hooks and operators out, a set at a time, seems like a
really good start.

Max

On Thu, Jan 10, 2019 at 8:44 AM Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> That's not what I meant. If I apply what I meant to your example we'd have
> a single package for each hook `airflow-hook-s3` and `airflow-hook-gcs`,
> and a package for `airflow-operator-s3-to-gcs`. The operator package would
> depend on both hook packages.
>
> There's no code or test duplication there. If fancy mocking solutions are
> defined for tests, they should be exposed in the hook packages and can be
> reused in the operator packages.
>
> Max
>
> On Wed, Jan 9, 2019 at 11:00 PM airflowuser
> <airflowu...@protonmail.com.invalid> wrote:
>
>> @Max  I don't see how this is doable.
>> Consider S3ToGoogleCloudStorageOperator
>> It users both S3Hook and GoogleCloudStorageHook.
>>
>> With your suggestion we have to maintain S3Hook in each separated package
>> per operator/sensor. Which means for example  if new parameter is added to
>> any of the hooks you have to add it in dozes of places (+tests).
>> This is very inconvenient.
>>
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>
>> > If there's a strict policy of having a single hook and a single operator
>> > per package, then the hook package would be the only place where the
>> > external dependency is defined, and the operator packages would depend
>> on
>> > hook package(s). That would follow the "micro package" philosophy and
>> could
>> > work pretty well. Every hook and operator can have its own set of
>> > maintainers, test/CI and release cadence.
>> >
>> > There can even be packages composing common operators as in
>> > "airflow-hadoop-operators", or "airflow-common-databases-operators", and
>> > for backward compatibility, During a transition phase, Airflow could
>> depend
>> > on a new "airflow-backward-compatibility-operators", though ultimately
>> we
>> > should encourage people to come up with the right packages/operators for
>> > their environment instead.
>> >
>> > Max
>> >
>> > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall
>> felix.uellend...@gmx.de
>> > wrote:
>> >
>> > > Regardless of how complex this implementation would be I am +1 on
>> this.
>> > > From the developer's point of view that the CI would run so much
>> faster
>> > > is the biggest plus for me. I think It will only become worse the more
>> > > dependencies we add.
>> > > From the user's point of view that I am able to choose from multiple
>> > > packages/repositories only the ones I really want to use and I know
>> that
>> > > any contributor of this repo probably can answer my questions related
>> to
>> > > this hooks/operators is the biggest plus for me.
>> > > I know there would be a lot to do and this gives me headaches even now
>> > > but in the end I think it would be a great change that is necessary in
>> > > the long run.
>> > >
>> > > -   feluelle
>> > >
>> > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk:
>> > >
>> > > > While splitting the monolithical Airfllow architecture to pieces
>> sounds
>> > > > good, there is one problem that might be difficult to tackle (or
>> rather
>> > > > impossible unless we change architecture of Airflow significantly) -
>> > > > namely
>> > > > dependencies/requirements.
>> > > > The way Airflow uses operators is that its operators are already
>> closely
>> > > > coupled with Airflow core. Airflow has to parse all the operators
>> within
>> > > > the same python interpreter/virtual machine as the core Airflow.
>> This
>> > > > means
>> > > > potentially big problem with dependency/requirement handling if we
>> have
>> > > > multiple packages. There are enough common/shared dependencies that
>> > > > various
>> > > > operators use even now to cause occasional headaches even now. We
>> already
>> > > > have quite a challenge with handling dependencies of Airflow and its
>> > > > operators/hooks when they are part of Airflow repo.
>> > > > Currently the problem is that Ariflow sometimes uses outdated
>> > > > dependencies
>> > > > or that some random transient dependencies break Airflow
>> installation.
>> > > > But
>> > > > at we at least have a common dependency list that we work against
>> for all
>> > > > operators. Unfortunately if we split, then the problem will be
>> worse -
>> > > > very
>> > > > quickly some contrib operators will require different dependencies
>> and
>> > > > will
>> > > > not be compatible with Airflow or break Airflow's behaviour.
>> > > > Not mentioning the problem when you want to use hooks from some
>> other
>> > > > "area" in your operator. Currently Hooks are the way how you can
>> speed up
>> > > > development of cross-are behaviour. You implement hooks in some
>> "area"
>> > > > and
>> > > > other "areas" are free or even encouraged to use them. For example
>> > > > exporting from BigQuery to all cloud storages in principle should
>> depend
>> > > > on
>> > > > Hooks for every single Cloud Storage package out there (Google,
>> Azure,
>> > > > AWS). This is even worse than the MySqLToHive case described
>> earlier -
>> > > > very
>> > > > quickly we would end up with totally unmanageable mesh of
>> > > > cross-dependencies.
>> > > > I think to really make Operators independent from Airflow core, we
>> would
>> > > > need to allow the dependencies to be fully isolated - i.e. to allow
>> > > > operators to have different set of dependencies than the core.
>> That's
>> > > > quite
>> > > > impossible with the current Airflow approach where the same
>> operator code
>> > > > is parsed in the core. And the same code is used during execute in
>> > > > worker.
>> > > > And the same code might be used by another operator in form of hook.
>> > > > Unfortunately we are not in the npm
>> > > > https://npm.github.io/how-npm-works-docs/index.html world (as Kamil
>> > > > Breguła pointed to me today) where the module loaded handles
>> multiple
>> > > > versions of the same library in the same process.
>> > > > One other questions that bothers me - I believe (please correct me
>> If I
>> > > > am
>> > > > wrong) some of the operators are using some core features of
>> Airflow and
>> > > > are even more tied with the Core. For example it is perfectly fine
>> for
>> > > > the
>> > > > operator to use SQL Alchemy ORM classes of Airflow and run
>> > > > queries/perform
>> > > > updates in the metadata database of Airflow, I believe - as far as
>> I know
>> > > > there is a requirement (I saw this somewhere at least) that Celery
>> or
>> > > > Kubernetes workers need to be able to open a direct database
>> connection
>> > > > to
>> > > > the metadata database of Airflow and there is nothing to prevent the
>> > > > operators to do it. This in essence means that the operator has to
>> depend
>> > > > on many core dependencies/requirements including sqlalchemy,
>> > > > postgres/mysql
>> > > > ....). This can be changed and "forbidden" to use Airflow's core
>> features
>> > > > but it might break compatibility (If I am right about it).
>> > > > We could imagine a different approach - where operator is split to a
>> > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter
>> with
>> > > > Core's dependencies, and "Executor" within the worker case. Then
>> each
>> > > > task
>> > > > could run in its own Docker image/Pod on Kubernetes with its own
>> > > > dependencies. But that looks like big, backwards-incompatible
>> change and
>> > > > it
>> > > > still does not solve cross-dependencies between different "areas".
>> For
>> > > > handling cross-area operations we would somehow implement
>> communication
>> > > > between different containers - each having own dependencies. That
>> would
>> > > > be
>> > > > possible in Kubernetes by having single POD with several containers
>> > > > sharing
>> > > > the common data and communicating. Seems possible.
>> > > > It's quite an entertaining idea, but it sounds like Airflow 3.0
>> already
>> > > > and
>> > > > one that is not really backwards compatible ;).
>> > > > J.
>> > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast sw...@google.com.invalid
>> > > > wrote:
>> > > >
>> > > > > > I don't see it solving any problem than test speed (which is a
>> big one,
>> > > > > > yes) but doesn't reduce the amount of workload on the
>> committers.
>> > > > >
>> > > > > It's about distributed ownership. For example, I'm not a
>> committer in
>> > > > > pandas, but I am the primary maintainer of pandas-gbq. You're
>> right
>> > > > > that if
>> > > >
>> > > > > the set of committers is the same for all 24 repos, there isn't
>> all that
>> > > > > much benefit beyond testing speed.
>> > > > >
>> > > > > > Each sub-project would still have to follow the normal Apache
>> voting
>> > > > > > process.
>> > > > >
>> > > > > Presumably the set of people that care about the sub-packages
>> will be
>> > > > > smaller. I don't know enough about the Apache voting process to
>> know how
>> > > > > that might affect it.
>> > > > > Maybe many of the sub-packages can live outside the Apache org?
>> Pandas
>> > > > > keeps the I/O sub-packages in a different org, for example.
>> > > > >
>> > > > > > Google could choose to release a airflow-gcp-operators package
>> now and
>> > > > > > tell people to |from gcp.airflow.operators import
>> SomeNewOperator|.
>> > > > >
>> > > > > That's actually part of my motivation for this proposal. I've got
>> some
>> > > > > red
>> > > >
>> > > > > tape to get through, but ideally the proposed airflow-google
>> repository
>> > > > > in
>> > > >
>> > > > > AIP-8 would actually live in the GoogleCloudPlatform org.
>> > > > > Maybe I should decrease the scope of AIP-8 to Google
>> hooks/operators?
>> > > > >
>> > > > > > There is nothing stopping someone /currently/ creating their own
>> > > > > > operators package.
>> > > > >
>> > > > > Hooks still need some support in core, so that connections can be
>> > > > > configured. Also, the fact that so many operators live in the
>> Airflow
>> > > > > makes
>> > > >
>> > > > > it seem like an operator is less supported / a hack if it doesn't
>> live
>> > > > > there.
>> > > > >
>> > > > > > How will we ensure that core changes don't break any
>> hooks/operators?
>> > > > > > Pandas does this by running tests in the I/O repos against
>> pandas master
>> > > > > > branch in addition to against supported releases.
>> > > > >
>> > > > > > How do we support the logging backends for s3/azure/gcp?
>> > > > > > I don't see any reason we can't keep doing what we're already
>> doing.
>> > >
>> > >
>> https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45
>> > >
>> > > > > We'd need to adjust the import path for the hook, but so long as
>> the
>> > > > > upload
>> > > >
>> > > > > / download method remains stable, it'll work the same. The
>> sub-package
>> > > > > will
>> > > >
>> > > > > need to ensure it tests the logging code path in addition to
>> testing
>> > > > > DAGs
>> > > >
>> > > > > that use the relevant operators.
>> > > > >
>> > > > > -   • *Tim Swast
>> > > > > -   • *Software Friendliness Engineer
>> > > > > -   • *Google Cloud Developer Relations
>> > > > > -   • *Seattle, WA, USA
>> > > > >
>> > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor a...@apache.org
>> > > > > wrote:
>> > > >
>> > > > > > Can someone explain to me how having multiple packages will
>> work in
>> > > > > > practice?
>> > > > > > How will we ensure that core changes don't break any
>> hooks/operators?
>> > > > > > How do we support the logging backends for s3/azure/gcp?
>> > > > > > What would the release process be for the "sub"-packages?
>> > > > > > There is nothing stopping someone /currently/ creating their own
>> > > > > > operators package. There is nothing what-so-ever special about
>> the
>> > > > > > |airflow.operators| package namespace, and for example Google
>> could
>> > > > > > choose to release a airflow-gcp-operators package now and tell
>> people
>> > > > > > to
>> > > >
>> > > > > > |from gcp.airflow.operators import SomeNewOperator|.
>> > > > > > My view on this currently is -1 as I don't see it solving any
>> problem
>> > > > > > than test speed (which is a big one, yes) but doesn't reduce
>> the amount
>> > > > > > of workload on the committers - rather it increases it by
>> having a more
>> > > > > > complex release process (each sub-project would still have to
>> follow
>> > > > > > the
>> > > >
>> > > > > > normal Apache voting process) and having 24 repos to check for
>> PRs
>> > > > > > rather than just 1.
>> > > > > > Am I missing something?
>> > > > > > ("Core" vs "contrib" made sense when Airflow was still under
>> Airbnb, we
>> > > > > > should probably just move everything from contrib out to core
>> pre
>> > > > > > 2.0.0)
>> > > >
>> > > > > > -ash
>> > > > > > airflowuser wrote on 08/01/2019 15:44:
>> > > > > >
>> > > > > > > I think the operator should be placed by the source.
>> > > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL
>> package.
>> > > > > > > The BIG question here is if this serve actual improvement
>> like faster
>> > > > > > > deployment of hook/operators bug-fix to Airflow users (faster
>> than
>> > > > > > > actual
>> > > >
>> > > > > > Airflow release) or this is mere cosmetic issue.
>> > > > > >
>> > > > > > > I assume that this also covers the unnecessary separation of
>> core and
>> > > > > > > contrib.
>> > > > > > > Sent with ProtonMail Secure Email.
>> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin <
>> > > > > > > maximebeauche...@gmail.com> wrote:
>> > > > > > >
>> > > > > > > > Something to think about is how data transfer operators
>> like the
>> > > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a
>> package-specific
>> > > > > > > > approach that may mean something like an `airflow-hive`,
>> > > > > > > > `airflow-mysql`
>> > > > > >
>> > > > > > > > and `airflow-mysql-hive` packages, where the
>> `airflow-mysql-hive`
>> > > > > > > > package
>> > > > > > >
>> > > > > > > > depends on the two other packages.
>> > > > > > > > It's just a matter of having a clear strategy, good naming
>> > > > > > > > conventions
>> > > >
>> > > > > > and
>> > > > > >
>> > > > > > > > a nice central place in the docs that centralize a list of
>> approved
>> > > > > > > > packages.
>> > > > > > > > Max
>> > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast
>> sw...@google.com.invalid
>> > > > > > > > wrote:
>> > > > > > >
>> > > > > > > > > I've created AIP-8:
>> > >
>> > >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
>> > >
>> > > > > > > > > To follow-up from the discussion about splitting
>> hooks/operators out
>> > > > > > > > > of the
>> > > > > > >
>> > > > > > > > > core Airflow package at
>> > > > > > > > >
>> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
>> > > > > > > > > 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org>
>> > > > > > >
>> > > > > > > > > I propose packaging based on the target system, informed
>> by the
>> > > > > > > > > existing
>> > > > > > >
>> > > > > > > > > hooks in both core and contrib. This will allow those
>> with the
>> > > > > > > > > relevant
>> > > > > >
>> > > > > > > > > expertise in each target system to respond to
>> contributions / issues
>> > > > > > > > > without having to follow the flood of everything
>> Airflow-related. It
>> > > > > > > > > will
>> > > > > > >
>> > > > > > > > > also decrease the surface area of the core package,
>> helping with
>> > > > > > > > > testability and long-term maintenance.
>> > > > > > > > >
>> > > > > > > > > -   • *Tim Swast
>> > > > > > > > > -   • *Software Friendliness Engineer
>> > > > > > > > > -   • *Google Cloud Developer Relations
>> > > > > > > > > -   • *Seattle, WA, USA
>>
>>
>>

Re: AIP-8 Split Hooks/Operators into Separate Packages

Reply via email to