I like this proposal to create an "airflow-backward-compatibility-operators" package, which can be released independently of "airflow", and then migrate to "micro-packages" after that. I've updated AIP-8 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303> to describe detail around creating an "airflow-backward-compatibility-operators" package.
I'll leave the question of who will own the micro-packages for the long-term desired state to a later AIP. The gist of that is an owner needs to be identified for each hook/operator and then a similar migration + deprecation will need to be done to each new micropackage. * • **Tim Swast* * • *Software Friendliness Engineer * • *Google Cloud Developer Relations * • *Seattle, WA, USA On Thu, Jan 10, 2019 at 8:59 AM Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > A related though is around the fact that all these libs depend on Airflow > itself to get the base class they are deriving (BaseHook and BaseOperator > mostly). It's a bit upside down when the small library depends on a big > library. That may be ok as is, but pushing the micro-package logic would > dictate to break down `airflow-core`, which in turns calls for breaking > down `airflow-scheduler` and `airflow-web`. > > All this is super disruptive work (breaks all open PRs), but to me is a > much better place to be once we get there. Apache-wise that means more > [though smaller] releases and more coordination. With more packages, we > have to be good at defining the dependencies supported version ranges. My > incline would be to have a single repo with multiple `setup.py` within it, > so that PRs can touch multiple packages. > > We don't have to do all of this or all of this at once either, but the > community should agree on a plan. > > Refactoring the hooks and operators out, a set at a time, seems like a > really good start. > > Max > > On Thu, Jan 10, 2019 at 8:44 AM Maxime Beauchemin < > maximebeauche...@gmail.com> wrote: > > > That's not what I meant. If I apply what I meant to your example we'd > have > > a single package for each hook `airflow-hook-s3` and `airflow-hook-gcs`, > > and a package for `airflow-operator-s3-to-gcs`. The operator package > would > > depend on both hook packages. > > > > There's no code or test duplication there. If fancy mocking solutions are > > defined for tests, they should be exposed in the hook packages and can be > > reused in the operator packages. > > > > Max > > > > On Wed, Jan 9, 2019 at 11:00 PM airflowuser > > <airflowu...@protonmail.com.invalid> wrote: > > > >> @Max I don't see how this is doable. > >> Consider S3ToGoogleCloudStorageOperator > >> It users both S3Hook and GoogleCloudStorageHook. > >> > >> With your suggestion we have to maintain S3Hook in each separated > package > >> per operator/sensor. Which means for example if new parameter is added > to > >> any of the hooks you have to add it in dozes of places (+tests). > >> This is very inconvenient. > >> > >> > >> Sent with ProtonMail Secure Email. > >> > >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > >> On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin < > >> maximebeauche...@gmail.com> wrote: > >> > >> > If there's a strict policy of having a single hook and a single > operator > >> > per package, then the hook package would be the only place where the > >> > external dependency is defined, and the operator packages would depend > >> on > >> > hook package(s). That would follow the "micro package" philosophy and > >> could > >> > work pretty well. Every hook and operator can have its own set of > >> > maintainers, test/CI and release cadence. > >> > > >> > There can even be packages composing common operators as in > >> > "airflow-hadoop-operators", or "airflow-common-databases-operators", > and > >> > for backward compatibility, During a transition phase, Airflow could > >> depend > >> > on a new "airflow-backward-compatibility-operators", though ultimately > >> we > >> > should encourage people to come up with the right packages/operators > for > >> > their environment instead. > >> > > >> > Max > >> > > >> > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall > >> felix.uellend...@gmx.de > >> > wrote: > >> > > >> > > Regardless of how complex this implementation would be I am +1 on > >> this. > >> > > From the developer's point of view that the CI would run so much > >> faster > >> > > is the biggest plus for me. I think It will only become worse the > more > >> > > dependencies we add. > >> > > From the user's point of view that I am able to choose from multiple > >> > > packages/repositories only the ones I really want to use and I know > >> that > >> > > any contributor of this repo probably can answer my questions > related > >> to > >> > > this hooks/operators is the biggest plus for me. > >> > > I know there would be a lot to do and this gives me headaches even > now > >> > > but in the end I think it would be a great change that is necessary > in > >> > > the long run. > >> > > > >> > > - feluelle > >> > > > >> > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk: > >> > > > >> > > > While splitting the monolithical Airfllow architecture to pieces > >> sounds > >> > > > good, there is one problem that might be difficult to tackle (or > >> rather > >> > > > impossible unless we change architecture of Airflow > significantly) - > >> > > > namely > >> > > > dependencies/requirements. > >> > > > The way Airflow uses operators is that its operators are already > >> closely > >> > > > coupled with Airflow core. Airflow has to parse all the operators > >> within > >> > > > the same python interpreter/virtual machine as the core Airflow. > >> This > >> > > > means > >> > > > potentially big problem with dependency/requirement handling if we > >> have > >> > > > multiple packages. There are enough common/shared dependencies > that > >> > > > various > >> > > > operators use even now to cause occasional headaches even now. We > >> already > >> > > > have quite a challenge with handling dependencies of Airflow and > its > >> > > > operators/hooks when they are part of Airflow repo. > >> > > > Currently the problem is that Ariflow sometimes uses outdated > >> > > > dependencies > >> > > > or that some random transient dependencies break Airflow > >> installation. > >> > > > But > >> > > > at we at least have a common dependency list that we work against > >> for all > >> > > > operators. Unfortunately if we split, then the problem will be > >> worse - > >> > > > very > >> > > > quickly some contrib operators will require different dependencies > >> and > >> > > > will > >> > > > not be compatible with Airflow or break Airflow's behaviour. > >> > > > Not mentioning the problem when you want to use hooks from some > >> other > >> > > > "area" in your operator. Currently Hooks are the way how you can > >> speed up > >> > > > development of cross-are behaviour. You implement hooks in some > >> "area" > >> > > > and > >> > > > other "areas" are free or even encouraged to use them. For example > >> > > > exporting from BigQuery to all cloud storages in principle should > >> depend > >> > > > on > >> > > > Hooks for every single Cloud Storage package out there (Google, > >> Azure, > >> > > > AWS). This is even worse than the MySqLToHive case described > >> earlier - > >> > > > very > >> > > > quickly we would end up with totally unmanageable mesh of > >> > > > cross-dependencies. > >> > > > I think to really make Operators independent from Airflow core, we > >> would > >> > > > need to allow the dependencies to be fully isolated - i.e. to > allow > >> > > > operators to have different set of dependencies than the core. > >> That's > >> > > > quite > >> > > > impossible with the current Airflow approach where the same > >> operator code > >> > > > is parsed in the core. And the same code is used during execute in > >> > > > worker. > >> > > > And the same code might be used by another operator in form of > hook. > >> > > > Unfortunately we are not in the npm > >> > > > https://npm.github.io/how-npm-works-docs/index.html world (as > Kamil > >> > > > Breguła pointed to me today) where the module loaded handles > >> multiple > >> > > > versions of the same library in the same process. > >> > > > One other questions that bothers me - I believe (please correct me > >> If I > >> > > > am > >> > > > wrong) some of the operators are using some core features of > >> Airflow and > >> > > > are even more tied with the Core. For example it is perfectly fine > >> for > >> > > > the > >> > > > operator to use SQL Alchemy ORM classes of Airflow and run > >> > > > queries/perform > >> > > > updates in the metadata database of Airflow, I believe - as far as > >> I know > >> > > > there is a requirement (I saw this somewhere at least) that Celery > >> or > >> > > > Kubernetes workers need to be able to open a direct database > >> connection > >> > > > to > >> > > > the metadata database of Airflow and there is nothing to prevent > the > >> > > > operators to do it. This in essence means that the operator has to > >> depend > >> > > > on many core dependencies/requirements including sqlalchemy, > >> > > > postgres/mysql > >> > > > ....). This can be changed and "forbidden" to use Airflow's core > >> features > >> > > > but it might break compatibility (If I am right about it). > >> > > > We could imagine a different approach - where operator is split > to a > >> > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter > >> with > >> > > > Core's dependencies, and "Executor" within the worker case. Then > >> each > >> > > > task > >> > > > could run in its own Docker image/Pod on Kubernetes with its own > >> > > > dependencies. But that looks like big, backwards-incompatible > >> change and > >> > > > it > >> > > > still does not solve cross-dependencies between different "areas". > >> For > >> > > > handling cross-area operations we would somehow implement > >> communication > >> > > > between different containers - each having own dependencies. That > >> would > >> > > > be > >> > > > possible in Kubernetes by having single POD with several > containers > >> > > > sharing > >> > > > the common data and communicating. Seems possible. > >> > > > It's quite an entertaining idea, but it sounds like Airflow 3.0 > >> already > >> > > > and > >> > > > one that is not really backwards compatible ;). > >> > > > J. > >> > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast sw...@google.com.invalid > >> > > > wrote: > >> > > > > >> > > > > > I don't see it solving any problem than test speed (which is a > >> big one, > >> > > > > > yes) but doesn't reduce the amount of workload on the > >> committers. > >> > > > > > >> > > > > It's about distributed ownership. For example, I'm not a > >> committer in > >> > > > > pandas, but I am the primary maintainer of pandas-gbq. You're > >> right > >> > > > > that if > >> > > > > >> > > > > the set of committers is the same for all 24 repos, there isn't > >> all that > >> > > > > much benefit beyond testing speed. > >> > > > > > >> > > > > > Each sub-project would still have to follow the normal Apache > >> voting > >> > > > > > process. > >> > > > > > >> > > > > Presumably the set of people that care about the sub-packages > >> will be > >> > > > > smaller. I don't know enough about the Apache voting process to > >> know how > >> > > > > that might affect it. > >> > > > > Maybe many of the sub-packages can live outside the Apache org? > >> Pandas > >> > > > > keeps the I/O sub-packages in a different org, for example. > >> > > > > > >> > > > > > Google could choose to release a airflow-gcp-operators package > >> now and > >> > > > > > tell people to |from gcp.airflow.operators import > >> SomeNewOperator|. > >> > > > > > >> > > > > That's actually part of my motivation for this proposal. I've > got > >> some > >> > > > > red > >> > > > > >> > > > > tape to get through, but ideally the proposed airflow-google > >> repository > >> > > > > in > >> > > > > >> > > > > AIP-8 would actually live in the GoogleCloudPlatform org. > >> > > > > Maybe I should decrease the scope of AIP-8 to Google > >> hooks/operators? > >> > > > > > >> > > > > > There is nothing stopping someone /currently/ creating their > own > >> > > > > > operators package. > >> > > > > > >> > > > > Hooks still need some support in core, so that connections can > be > >> > > > > configured. Also, the fact that so many operators live in the > >> Airflow > >> > > > > makes > >> > > > > >> > > > > it seem like an operator is less supported / a hack if it > doesn't > >> live > >> > > > > there. > >> > > > > > >> > > > > > How will we ensure that core changes don't break any > >> hooks/operators? > >> > > > > > Pandas does this by running tests in the I/O repos against > >> pandas master > >> > > > > > branch in addition to against supported releases. > >> > > > > > >> > > > > > How do we support the logging backends for s3/azure/gcp? > >> > > > > > I don't see any reason we can't keep doing what we're already > >> doing. > >> > > > >> > > > >> > https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45 > >> > > > >> > > > > We'd need to adjust the import path for the hook, but so long as > >> the > >> > > > > upload > >> > > > > >> > > > > / download method remains stable, it'll work the same. The > >> sub-package > >> > > > > will > >> > > > > >> > > > > need to ensure it tests the logging code path in addition to > >> testing > >> > > > > DAGs > >> > > > > >> > > > > that use the relevant operators. > >> > > > > > >> > > > > - • *Tim Swast > >> > > > > - • *Software Friendliness Engineer > >> > > > > - • *Google Cloud Developer Relations > >> > > > > - • *Seattle, WA, USA > >> > > > > > >> > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor a...@apache.org > >> > > > > wrote: > >> > > > > >> > > > > > Can someone explain to me how having multiple packages will > >> work in > >> > > > > > practice? > >> > > > > > How will we ensure that core changes don't break any > >> hooks/operators? > >> > > > > > How do we support the logging backends for s3/azure/gcp? > >> > > > > > What would the release process be for the "sub"-packages? > >> > > > > > There is nothing stopping someone /currently/ creating their > own > >> > > > > > operators package. There is nothing what-so-ever special about > >> the > >> > > > > > |airflow.operators| package namespace, and for example Google > >> could > >> > > > > > choose to release a airflow-gcp-operators package now and tell > >> people > >> > > > > > to > >> > > > > >> > > > > > |from gcp.airflow.operators import SomeNewOperator|. > >> > > > > > My view on this currently is -1 as I don't see it solving any > >> problem > >> > > > > > than test speed (which is a big one, yes) but doesn't reduce > >> the amount > >> > > > > > of workload on the committers - rather it increases it by > >> having a more > >> > > > > > complex release process (each sub-project would still have to > >> follow > >> > > > > > the > >> > > > > >> > > > > > normal Apache voting process) and having 24 repos to check for > >> PRs > >> > > > > > rather than just 1. > >> > > > > > Am I missing something? > >> > > > > > ("Core" vs "contrib" made sense when Airflow was still under > >> Airbnb, we > >> > > > > > should probably just move everything from contrib out to core > >> pre > >> > > > > > 2.0.0) > >> > > > > >> > > > > > -ash > >> > > > > > airflowuser wrote on 08/01/2019 15:44: > >> > > > > > > >> > > > > > > I think the operator should be placed by the source. > >> > > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL > >> package. > >> > > > > > > The BIG question here is if this serve actual improvement > >> like faster > >> > > > > > > deployment of hook/operators bug-fix to Airflow users > (faster > >> than > >> > > > > > > actual > >> > > > > >> > > > > > Airflow release) or this is mere cosmetic issue. > >> > > > > > > >> > > > > > > I assume that this also covers the unnecessary separation of > >> core and > >> > > > > > > contrib. > >> > > > > > > Sent with ProtonMail Secure Email. > >> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > >> > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin < > >> > > > > > > maximebeauche...@gmail.com> wrote: > >> > > > > > > > >> > > > > > > > Something to think about is how data transfer operators > >> like the > >> > > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a > >> package-specific > >> > > > > > > > approach that may mean something like an `airflow-hive`, > >> > > > > > > > `airflow-mysql` > >> > > > > > > >> > > > > > > > and `airflow-mysql-hive` packages, where the > >> `airflow-mysql-hive` > >> > > > > > > > package > >> > > > > > > > >> > > > > > > > depends on the two other packages. > >> > > > > > > > It's just a matter of having a clear strategy, good naming > >> > > > > > > > conventions > >> > > > > >> > > > > > and > >> > > > > > > >> > > > > > > > a nice central place in the docs that centralize a list of > >> approved > >> > > > > > > > packages. > >> > > > > > > > Max > >> > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast > >> sw...@google.com.invalid > >> > > > > > > > wrote: > >> > > > > > > > >> > > > > > > > > I've created AIP-8: > >> > > > >> > > > >> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 > >> > > > >> > > > > > > > > To follow-up from the discussion about splitting > >> hooks/operators out > >> > > > > > > > > of the > >> > > > > > > > >> > > > > > > > > core Airflow package at > >> > > > > > > > > > >> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/< > >> > > > > > > > > 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org> > >> > > > > > > > >> > > > > > > > > I propose packaging based on the target system, informed > >> by the > >> > > > > > > > > existing > >> > > > > > > > >> > > > > > > > > hooks in both core and contrib. This will allow those > >> with the > >> > > > > > > > > relevant > >> > > > > > > >> > > > > > > > > expertise in each target system to respond to > >> contributions / issues > >> > > > > > > > > without having to follow the flood of everything > >> Airflow-related. It > >> > > > > > > > > will > >> > > > > > > > >> > > > > > > > > also decrease the surface area of the core package, > >> helping with > >> > > > > > > > > testability and long-term maintenance. > >> > > > > > > > > > >> > > > > > > > > - • *Tim Swast > >> > > > > > > > > - • *Software Friendliness Engineer > >> > > > > > > > > - • *Google Cloud Developer Relations > >> > > > > > > > > - • *Seattle, WA, USA > >> > >> > >> >