Re: AIP-8 Split Hooks/Operators into Separate Packages

Tim Swast Tue, 08 Jan 2019 08:38:01 -0800

> I don't see it solving any problem than test speed (which is a big one,
yes) but doesn't reduce the amount of workload on the committers.


It's about distributed ownership. For example, I'm not a committer in
pandas, but I am the primary maintainer of pandas-gbq. You're right that if
the set of committers is the same for all 24 repos, there isn't all that
much benefit beyond testing speed.

> Each sub-project would still have to follow the normal Apache voting
process.

Presumably the set of people that care about the sub-packages will be
smaller. I don't know enough about the Apache voting process to know how
that might affect it.

Maybe many of the sub-packages can live outside the Apache org? Pandas
keeps the I/O sub-packages in a different org, for example.

> Google could choose to release a airflow-gcp-operators package now and
tell people to |from gcp.airflow.operators import SomeNewOperator|.

That's actually part of my motivation for this proposal. I've got some red
tape to get through, but ideally the proposed airflow-google repository in
AIP-8 would actually live in the GoogleCloudPlatform org.

*Maybe I should decrease the scope of AIP-8 to Google hooks/operators?*

> There is nothing stopping someone /currently/ creating their own
operators package.

Hooks still need some support in core, so that connections can be
configured. Also, the fact that so many operators live in the Airflow makes
it seem like an operator is less supported / a hack if it doesn't live
there.

> How will we ensure that core changes don't break any hooks/operators?

Pandas does this by running tests in the I/O repos against pandas master
branch in addition to against supported releases.

> How do we support the logging backends for s3/azure/gcp?

I don't see any reason we can't keep doing what we're already doing.

https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45

We'd need to adjust the import path for the hook, but so long as the upload
/ download method remains stable, it'll work the same. The sub-package will
need to ensure it tests the logging code path in addition to testing DAGs
that use the relevant operators.

*  •  **Tim Swast*
*  •  *Software Friendliness Engineer
*  •  *Google Cloud Developer Relations
*  •  *Seattle, WA, USA


On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> Can someone explain to me how having multiple packages will work in
> practice?
>
> How will we ensure that core changes don't break any hooks/operators?
>
> How do we support the logging backends for s3/azure/gcp?
>
> What would the release process be for the "sub"-packages?
>
> There is nothing stopping someone /currently/ creating their own
> operators package. There is nothing what-so-ever special about the
> |airflow.operators| package namespace, and for example Google could
> choose to release a airflow-gcp-operators package now and tell people to
> |from gcp.airflow.operators import SomeNewOperator|.
>
> My view on this currently is -1 as I don't see it solving any problem
> than test speed (which is a big one, yes) but doesn't reduce the amount
> of workload on the committers - rather it increases it by having a more
> complex release process (each sub-project would still have to follow the
> normal Apache voting process) and having 24 repos to check for PRs
> rather than just 1.
>
> Am I missing something?
>
> ("Core" vs "contrib" made sense when Airflow was still under Airbnb, we
> should probably just move everything from contrib out to core pre 2.0.0)
>
> -ash
>
> airflowuser wrote on 08/01/2019 15:44:
> > I think the operator should be placed by the source.
> > If it's MySQLToHiveOperator then it would be placed in MySQL package.
> >
> >
> > The BIG question here is if this serve actual improvement like faster
> deployment of hook/operators bug-fix to Airflow users (faster than actual
> Airflow release) or this is mere cosmetic issue.
> >
> > I assume that this also covers the unnecessary separation of core and
> contrib.
> >
> >
> >
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> >
> >> Something to think about is how data transfer operators like the
> >> MysqlToHiveOperator usually rely on 2 hooks. With a package-specific
> >> approach that may mean something like an `airflow-hive`, `airflow-mysql`
> >> and `airflow-mysql-hive` packages, where the `airflow-mysql-hive`
> package
> >> depends on the two other packages.
> >>
> >> It's just a matter of having a clear strategy, good naming conventions
> and
> >> a nice central place in the docs that centralize a list of approved
> >> packages.
> >>
> >> Max
> >>
> >> On Mon, Jan 7, 2019 at 9:05 AM Tim Swast sw...@google.com.invalid
> wrote:
> >>
> >>> I've created AIP-8:
> >>>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303
> >>> To follow-up from the discussion about splitting hooks/operators out
> of the
> >>> core Airflow package at
> >>> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/<
> 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org>
> >>> I propose packaging based on the target system, informed by the
> existing
> >>> hooks in both core and contrib. This will allow those with the relevant
> >>> expertise in each target system to respond to contributions / issues
> >>> without having to follow the flood of everything Airflow-related. It
> will
> >>> also decrease the surface area of the core package, helping with
> >>> testability and long-term maintenance.
> >>>
> >>> -   • *Tim Swast
> >>> -   • *Software Friendliness Engineer
> >>> -   • *Google Cloud Developer Relations
> >>> -   • *Seattle, WA, USA
> >
>
>

Re: AIP-8 Split Hooks/Operators into Separate Packages

Reply via email to