A related though is around the fact that all these libs depend on Airflow itself to get the base class they are deriving (BaseHook and BaseOperator mostly). It's a bit upside down when the small library depends on a big library. That may be ok as is, but pushing the micro-package logic would dictate to break down `airflow-core`, which in turns calls for breaking down `airflow-scheduler` and `airflow-web`.
All this is super disruptive work (breaks all open PRs), but to me is a much better place to be once we get there. Apache-wise that means more [though smaller] releases and more coordination. With more packages, we have to be good at defining the dependencies supported version ranges. My incline would be to have a single repo with multiple `setup.py` within it, so that PRs can touch multiple packages. We don't have to do all of this or all of this at once either, but the community should agree on a plan. Refactoring the hooks and operators out, a set at a time, seems like a really good start. Max On Thu, Jan 10, 2019 at 8:44 AM Maxime Beauchemin < maximebeauche...@gmail.com> wrote: > That's not what I meant. If I apply what I meant to your example we'd have > a single package for each hook `airflow-hook-s3` and `airflow-hook-gcs`, > and a package for `airflow-operator-s3-to-gcs`. The operator package would > depend on both hook packages. > > There's no code or test duplication there. If fancy mocking solutions are > defined for tests, they should be exposed in the hook packages and can be > reused in the operator packages. > > Max > > On Wed, Jan 9, 2019 at 11:00 PM airflowuser > <airflowu...@protonmail.com.invalid> wrote: > >> @Max I don't see how this is doable. >> Consider S3ToGoogleCloudStorageOperator >> It users both S3Hook and GoogleCloudStorageHook. >> >> With your suggestion we have to maintain S3Hook in each separated package >> per operator/sensor. Which means for example if new parameter is added to >> any of the hooks you have to add it in dozes of places (+tests). >> This is very inconvenient. >> >> >> Sent with ProtonMail Secure Email. >> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin < >> maximebeauche...@gmail.com> wrote: >> >> > If there's a strict policy of having a single hook and a single operator >> > per package, then the hook package would be the only place where the >> > external dependency is defined, and the operator packages would depend >> on >> > hook package(s). That would follow the "micro package" philosophy and >> could >> > work pretty well. Every hook and operator can have its own set of >> > maintainers, test/CI and release cadence. >> > >> > There can even be packages composing common operators as in >> > "airflow-hadoop-operators", or "airflow-common-databases-operators", and >> > for backward compatibility, During a transition phase, Airflow could >> depend >> > on a new "airflow-backward-compatibility-operators", though ultimately >> we >> > should encourage people to come up with the right packages/operators for >> > their environment instead. >> > >> > Max >> > >> > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall >> felix.uellend...@gmx.de >> > wrote: >> > >> > > Regardless of how complex this implementation would be I am +1 on >> this. >> > > From the developer's point of view that the CI would run so much >> faster >> > > is the biggest plus for me. I think It will only become worse the more >> > > dependencies we add. >> > > From the user's point of view that I am able to choose from multiple >> > > packages/repositories only the ones I really want to use and I know >> that >> > > any contributor of this repo probably can answer my questions related >> to >> > > this hooks/operators is the biggest plus for me. >> > > I know there would be a lot to do and this gives me headaches even now >> > > but in the end I think it would be a great change that is necessary in >> > > the long run. >> > > >> > > - feluelle >> > > >> > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk: >> > > >> > > > While splitting the monolithical Airfllow architecture to pieces >> sounds >> > > > good, there is one problem that might be difficult to tackle (or >> rather >> > > > impossible unless we change architecture of Airflow significantly) - >> > > > namely >> > > > dependencies/requirements. >> > > > The way Airflow uses operators is that its operators are already >> closely >> > > > coupled with Airflow core. Airflow has to parse all the operators >> within >> > > > the same python interpreter/virtual machine as the core Airflow. >> This >> > > > means >> > > > potentially big problem with dependency/requirement handling if we >> have >> > > > multiple packages. There are enough common/shared dependencies that >> > > > various >> > > > operators use even now to cause occasional headaches even now. We >> already >> > > > have quite a challenge with handling dependencies of Airflow and its >> > > > operators/hooks when they are part of Airflow repo. >> > > > Currently the problem is that Ariflow sometimes uses outdated >> > > > dependencies >> > > > or that some random transient dependencies break Airflow >> installation. >> > > > But >> > > > at we at least have a common dependency list that we work against >> for all >> > > > operators. Unfortunately if we split, then the problem will be >> worse - >> > > > very >> > > > quickly some contrib operators will require different dependencies >> and >> > > > will >> > > > not be compatible with Airflow or break Airflow's behaviour. >> > > > Not mentioning the problem when you want to use hooks from some >> other >> > > > "area" in your operator. Currently Hooks are the way how you can >> speed up >> > > > development of cross-are behaviour. You implement hooks in some >> "area" >> > > > and >> > > > other "areas" are free or even encouraged to use them. For example >> > > > exporting from BigQuery to all cloud storages in principle should >> depend >> > > > on >> > > > Hooks for every single Cloud Storage package out there (Google, >> Azure, >> > > > AWS). This is even worse than the MySqLToHive case described >> earlier - >> > > > very >> > > > quickly we would end up with totally unmanageable mesh of >> > > > cross-dependencies. >> > > > I think to really make Operators independent from Airflow core, we >> would >> > > > need to allow the dependencies to be fully isolated - i.e. to allow >> > > > operators to have different set of dependencies than the core. >> That's >> > > > quite >> > > > impossible with the current Airflow approach where the same >> operator code >> > > > is parsed in the core. And the same code is used during execute in >> > > > worker. >> > > > And the same code might be used by another operator in form of hook. >> > > > Unfortunately we are not in the npm >> > > > https://npm.github.io/how-npm-works-docs/index.html world (as Kamil >> > > > Breguła pointed to me today) where the module loaded handles >> multiple >> > > > versions of the same library in the same process. >> > > > One other questions that bothers me - I believe (please correct me >> If I >> > > > am >> > > > wrong) some of the operators are using some core features of >> Airflow and >> > > > are even more tied with the Core. For example it is perfectly fine >> for >> > > > the >> > > > operator to use SQL Alchemy ORM classes of Airflow and run >> > > > queries/perform >> > > > updates in the metadata database of Airflow, I believe - as far as >> I know >> > > > there is a requirement (I saw this somewhere at least) that Celery >> or >> > > > Kubernetes workers need to be able to open a direct database >> connection >> > > > to >> > > > the metadata database of Airflow and there is nothing to prevent the >> > > > operators to do it. This in essence means that the operator has to >> depend >> > > > on many core dependencies/requirements including sqlalchemy, >> > > > postgres/mysql >> > > > ....). This can be changed and "forbidden" to use Airflow's core >> features >> > > > but it might break compatibility (If I am right about it). >> > > > We could imagine a different approach - where operator is split to a >> > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter >> with >> > > > Core's dependencies, and "Executor" within the worker case. Then >> each >> > > > task >> > > > could run in its own Docker image/Pod on Kubernetes with its own >> > > > dependencies. But that looks like big, backwards-incompatible >> change and >> > > > it >> > > > still does not solve cross-dependencies between different "areas". >> For >> > > > handling cross-area operations we would somehow implement >> communication >> > > > between different containers - each having own dependencies. That >> would >> > > > be >> > > > possible in Kubernetes by having single POD with several containers >> > > > sharing >> > > > the common data and communicating. Seems possible. >> > > > It's quite an entertaining idea, but it sounds like Airflow 3.0 >> already >> > > > and >> > > > one that is not really backwards compatible ;). >> > > > J. >> > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast sw...@google.com.invalid >> > > > wrote: >> > > > >> > > > > > I don't see it solving any problem than test speed (which is a >> big one, >> > > > > > yes) but doesn't reduce the amount of workload on the >> committers. >> > > > > >> > > > > It's about distributed ownership. For example, I'm not a >> committer in >> > > > > pandas, but I am the primary maintainer of pandas-gbq. You're >> right >> > > > > that if >> > > > >> > > > > the set of committers is the same for all 24 repos, there isn't >> all that >> > > > > much benefit beyond testing speed. >> > > > > >> > > > > > Each sub-project would still have to follow the normal Apache >> voting >> > > > > > process. >> > > > > >> > > > > Presumably the set of people that care about the sub-packages >> will be >> > > > > smaller. I don't know enough about the Apache voting process to >> know how >> > > > > that might affect it. >> > > > > Maybe many of the sub-packages can live outside the Apache org? >> Pandas >> > > > > keeps the I/O sub-packages in a different org, for example. >> > > > > >> > > > > > Google could choose to release a airflow-gcp-operators package >> now and >> > > > > > tell people to |from gcp.airflow.operators import >> SomeNewOperator|. >> > > > > >> > > > > That's actually part of my motivation for this proposal. I've got >> some >> > > > > red >> > > > >> > > > > tape to get through, but ideally the proposed airflow-google >> repository >> > > > > in >> > > > >> > > > > AIP-8 would actually live in the GoogleCloudPlatform org. >> > > > > Maybe I should decrease the scope of AIP-8 to Google >> hooks/operators? >> > > > > >> > > > > > There is nothing stopping someone /currently/ creating their own >> > > > > > operators package. >> > > > > >> > > > > Hooks still need some support in core, so that connections can be >> > > > > configured. Also, the fact that so many operators live in the >> Airflow >> > > > > makes >> > > > >> > > > > it seem like an operator is less supported / a hack if it doesn't >> live >> > > > > there. >> > > > > >> > > > > > How will we ensure that core changes don't break any >> hooks/operators? >> > > > > > Pandas does this by running tests in the I/O repos against >> pandas master >> > > > > > branch in addition to against supported releases. >> > > > > >> > > > > > How do we support the logging backends for s3/azure/gcp? >> > > > > > I don't see any reason we can't keep doing what we're already >> doing. >> > > >> > > >> https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45 >> > > >> > > > > We'd need to adjust the import path for the hook, but so long as >> the >> > > > > upload >> > > > >> > > > > / download method remains stable, it'll work the same. The >> sub-package >> > > > > will >> > > > >> > > > > need to ensure it tests the logging code path in addition to >> testing >> > > > > DAGs >> > > > >> > > > > that use the relevant operators. >> > > > > >> > > > > - • *Tim Swast >> > > > > - • *Software Friendliness Engineer >> > > > > - • *Google Cloud Developer Relations >> > > > > - • *Seattle, WA, USA >> > > > > >> > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor a...@apache.org >> > > > > wrote: >> > > > >> > > > > > Can someone explain to me how having multiple packages will >> work in >> > > > > > practice? >> > > > > > How will we ensure that core changes don't break any >> hooks/operators? >> > > > > > How do we support the logging backends for s3/azure/gcp? >> > > > > > What would the release process be for the "sub"-packages? >> > > > > > There is nothing stopping someone /currently/ creating their own >> > > > > > operators package. There is nothing what-so-ever special about >> the >> > > > > > |airflow.operators| package namespace, and for example Google >> could >> > > > > > choose to release a airflow-gcp-operators package now and tell >> people >> > > > > > to >> > > > >> > > > > > |from gcp.airflow.operators import SomeNewOperator|. >> > > > > > My view on this currently is -1 as I don't see it solving any >> problem >> > > > > > than test speed (which is a big one, yes) but doesn't reduce >> the amount >> > > > > > of workload on the committers - rather it increases it by >> having a more >> > > > > > complex release process (each sub-project would still have to >> follow >> > > > > > the >> > > > >> > > > > > normal Apache voting process) and having 24 repos to check for >> PRs >> > > > > > rather than just 1. >> > > > > > Am I missing something? >> > > > > > ("Core" vs "contrib" made sense when Airflow was still under >> Airbnb, we >> > > > > > should probably just move everything from contrib out to core >> pre >> > > > > > 2.0.0) >> > > > >> > > > > > -ash >> > > > > > airflowuser wrote on 08/01/2019 15:44: >> > > > > > >> > > > > > > I think the operator should be placed by the source. >> > > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL >> package. >> > > > > > > The BIG question here is if this serve actual improvement >> like faster >> > > > > > > deployment of hook/operators bug-fix to Airflow users (faster >> than >> > > > > > > actual >> > > > >> > > > > > Airflow release) or this is mere cosmetic issue. >> > > > > > >> > > > > > > I assume that this also covers the unnecessary separation of >> core and >> > > > > > > contrib. >> > > > > > > Sent with ProtonMail Secure Email. >> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin < >> > > > > > > maximebeauche...@gmail.com> wrote: >> > > > > > > >> > > > > > > > Something to think about is how data transfer operators >> like the >> > > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a >> package-specific >> > > > > > > > approach that may mean something like an `airflow-hive`, >> > > > > > > > `airflow-mysql` >> > > > > > >> > > > > > > > and `airflow-mysql-hive` packages, where the >> `airflow-mysql-hive` >> > > > > > > > package >> > > > > > > >> > > > > > > > depends on the two other packages. >> > > > > > > > It's just a matter of having a clear strategy, good naming >> > > > > > > > conventions >> > > > >> > > > > > and >> > > > > > >> > > > > > > > a nice central place in the docs that centralize a list of >> approved >> > > > > > > > packages. >> > > > > > > > Max >> > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast >> sw...@google.com.invalid >> > > > > > > > wrote: >> > > > > > > >> > > > > > > > > I've created AIP-8: >> > > >> > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 >> > > >> > > > > > > > > To follow-up from the discussion about splitting >> hooks/operators out >> > > > > > > > > of the >> > > > > > > >> > > > > > > > > core Airflow package at >> > > > > > > > > >> http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/< >> > > > > > > > > 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org> >> > > > > > > >> > > > > > > > > I propose packaging based on the target system, informed >> by the >> > > > > > > > > existing >> > > > > > > >> > > > > > > > > hooks in both core and contrib. This will allow those >> with the >> > > > > > > > > relevant >> > > > > > >> > > > > > > > > expertise in each target system to respond to >> contributions / issues >> > > > > > > > > without having to follow the flood of everything >> Airflow-related. It >> > > > > > > > > will >> > > > > > > >> > > > > > > > > also decrease the surface area of the core package, >> helping with >> > > > > > > > > testability and long-term maintenance. >> > > > > > > > > >> > > > > > > > > - • *Tim Swast >> > > > > > > > > - • *Software Friendliness Engineer >> > > > > > > > > - • *Google Cloud Developer Relations >> > > > > > > > > - • *Seattle, WA, USA >> >> >>