@Max I don't see how this is doable. Consider S3ToGoogleCloudStorageOperator It users both S3Hook and GoogleCloudStorageHook.
With your suggestion we have to maintain S3Hook in each separated package per operator/sensor. Which means for example if new parameter is added to any of the hooks you have to add it in dozes of places (+tests). This is very inconvenient. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, January 9, 2019 9:29 PM, Maxime Beauchemin <maximebeauche...@gmail.com> wrote: > If there's a strict policy of having a single hook and a single operator > per package, then the hook package would be the only place where the > external dependency is defined, and the operator packages would depend on > hook package(s). That would follow the "micro package" philosophy and could > work pretty well. Every hook and operator can have its own set of > maintainers, test/CI and release cadence. > > There can even be packages composing common operators as in > "airflow-hadoop-operators", or "airflow-common-databases-operators", and > for backward compatibility, During a transition phase, Airflow could depend > on a new "airflow-backward-compatibility-operators", though ultimately we > should encourage people to come up with the right packages/operators for > their environment instead. > > Max > > On Wed, Jan 9, 2019 at 11:12 AM Felix Uellendall felix.uellend...@gmx.de > wrote: > > > Regardless of how complex this implementation would be I am +1 on this. > > From the developer's point of view that the CI would run so much faster > > is the biggest plus for me. I think It will only become worse the more > > dependencies we add. > > From the user's point of view that I am able to choose from multiple > > packages/repositories only the ones I really want to use and I know that > > any contributor of this repo probably can answer my questions related to > > this hooks/operators is the biggest plus for me. > > I know there would be a lot to do and this gives me headaches even now > > but in the end I think it would be a great change that is necessary in > > the long run. > > > > - feluelle > > > > Am 08/01/2019 um 17:42 schrieb Jarek Potiuk: > > > > > While splitting the monolithical Airfllow architecture to pieces sounds > > > good, there is one problem that might be difficult to tackle (or rather > > > impossible unless we change architecture of Airflow significantly) - > > > namely > > > dependencies/requirements. > > > The way Airflow uses operators is that its operators are already closely > > > coupled with Airflow core. Airflow has to parse all the operators within > > > the same python interpreter/virtual machine as the core Airflow. This > > > means > > > potentially big problem with dependency/requirement handling if we have > > > multiple packages. There are enough common/shared dependencies that > > > various > > > operators use even now to cause occasional headaches even now. We already > > > have quite a challenge with handling dependencies of Airflow and its > > > operators/hooks when they are part of Airflow repo. > > > Currently the problem is that Ariflow sometimes uses outdated > > > dependencies > > > or that some random transient dependencies break Airflow installation. > > > But > > > at we at least have a common dependency list that we work against for all > > > operators. Unfortunately if we split, then the problem will be worse - > > > very > > > quickly some contrib operators will require different dependencies and > > > will > > > not be compatible with Airflow or break Airflow's behaviour. > > > Not mentioning the problem when you want to use hooks from some other > > > "area" in your operator. Currently Hooks are the way how you can speed up > > > development of cross-are behaviour. You implement hooks in some "area" > > > and > > > other "areas" are free or even encouraged to use them. For example > > > exporting from BigQuery to all cloud storages in principle should depend > > > on > > > Hooks for every single Cloud Storage package out there (Google, Azure, > > > AWS). This is even worse than the MySqLToHive case described earlier - > > > very > > > quickly we would end up with totally unmanageable mesh of > > > cross-dependencies. > > > I think to really make Operators independent from Airflow core, we would > > > need to allow the dependencies to be fully isolated - i.e. to allow > > > operators to have different set of dependencies than the core. That's > > > quite > > > impossible with the current Airflow approach where the same operator code > > > is parsed in the core. And the same code is used during execute in > > > worker. > > > And the same code might be used by another operator in form of hook. > > > Unfortunately we are not in the npm > > > https://npm.github.io/how-npm-works-docs/index.html world (as Kamil > > > Breguła pointed to me today) where the module loaded handles multiple > > > versions of the same library in the same process. > > > One other questions that bothers me - I believe (please correct me If I > > > am > > > wrong) some of the operators are using some core features of Airflow and > > > are even more tied with the Core. For example it is perfectly fine for > > > the > > > operator to use SQL Alchemy ORM classes of Airflow and run > > > queries/perform > > > updates in the metadata database of Airflow, I believe - as far as I know > > > there is a requirement (I saw this somewhere at least) that Celery or > > > Kubernetes workers need to be able to open a direct database connection > > > to > > > the metadata database of Airflow and there is nothing to prevent the > > > operators to do it. This in essence means that the operator has to depend > > > on many core dependencies/requirements including sqlalchemy, > > > postgres/mysql > > > ....). This can be changed and "forbidden" to use Airflow's core features > > > but it might break compatibility (If I am right about it). > > > We could imagine a different approach - where operator is split to a > > > "Proxy" and "Execute" classes. "Proxy" within Core's interpreter with > > > Core's dependencies, and "Executor" within the worker case. Then each > > > task > > > could run in its own Docker image/Pod on Kubernetes with its own > > > dependencies. But that looks like big, backwards-incompatible change and > > > it > > > still does not solve cross-dependencies between different "areas". For > > > handling cross-area operations we would somehow implement communication > > > between different containers - each having own dependencies. That would > > > be > > > possible in Kubernetes by having single POD with several containers > > > sharing > > > the common data and communicating. Seems possible. > > > It's quite an entertaining idea, but it sounds like Airflow 3.0 already > > > and > > > one that is not really backwards compatible ;). > > > J. > > > On Tue, Jan 8, 2019 at 5:37 PM Tim Swast sw...@google.com.invalid > > > wrote: > > > > > > > > I don't see it solving any problem than test speed (which is a big > > > > > one, > > > > > yes) but doesn't reduce the amount of workload on the committers. > > > > > > > > It's about distributed ownership. For example, I'm not a committer in > > > > pandas, but I am the primary maintainer of pandas-gbq. You're right > > > > that if > > > > > > > the set of committers is the same for all 24 repos, there isn't all that > > > > much benefit beyond testing speed. > > > > > > > > > Each sub-project would still have to follow the normal Apache voting > > > > > process. > > > > > > > > Presumably the set of people that care about the sub-packages will be > > > > smaller. I don't know enough about the Apache voting process to know how > > > > that might affect it. > > > > Maybe many of the sub-packages can live outside the Apache org? Pandas > > > > keeps the I/O sub-packages in a different org, for example. > > > > > > > > > Google could choose to release a airflow-gcp-operators package now and > > > > > tell people to |from gcp.airflow.operators import SomeNewOperator|. > > > > > > > > That's actually part of my motivation for this proposal. I've got some > > > > red > > > > > > > tape to get through, but ideally the proposed airflow-google repository > > > > in > > > > > > > AIP-8 would actually live in the GoogleCloudPlatform org. > > > > Maybe I should decrease the scope of AIP-8 to Google hooks/operators? > > > > > > > > > There is nothing stopping someone /currently/ creating their own > > > > > operators package. > > > > > > > > Hooks still need some support in core, so that connections can be > > > > configured. Also, the fact that so many operators live in the Airflow > > > > makes > > > > > > > it seem like an operator is less supported / a hack if it doesn't live > > > > there. > > > > > > > > > How will we ensure that core changes don't break any hooks/operators? > > > > > Pandas does this by running tests in the I/O repos against pandas > > > > > master > > > > > branch in addition to against supported releases. > > > > > > > > > How do we support the logging backends for s3/azure/gcp? > > > > > I don't see any reason we can't keep doing what we're already doing. > > > > https://github.com/apache/airflow/blob/5d75028d2846ed27c90cc4009b6fe81046752b1e/airflow/utils/log/gcs_task_handler.py#L45 > > > > > > We'd need to adjust the import path for the hook, but so long as the > > > > upload > > > > > > > / download method remains stable, it'll work the same. The sub-package > > > > will > > > > > > > need to ensure it tests the logging code path in addition to testing > > > > DAGs > > > > > > > that use the relevant operators. > > > > > > > > - • *Tim Swast > > > > - • *Software Friendliness Engineer > > > > - • *Google Cloud Developer Relations > > > > - • *Seattle, WA, USA > > > > > > > > On Tue, Jan 8, 2019 at 7:55 AM Ash Berlin-Taylor a...@apache.org > > > > wrote: > > > > > > > > Can someone explain to me how having multiple packages will work in > > > > > practice? > > > > > How will we ensure that core changes don't break any hooks/operators? > > > > > How do we support the logging backends for s3/azure/gcp? > > > > > What would the release process be for the "sub"-packages? > > > > > There is nothing stopping someone /currently/ creating their own > > > > > operators package. There is nothing what-so-ever special about the > > > > > |airflow.operators| package namespace, and for example Google could > > > > > choose to release a airflow-gcp-operators package now and tell people > > > > > to > > > > > > > > |from gcp.airflow.operators import SomeNewOperator|. > > > > > My view on this currently is -1 as I don't see it solving any problem > > > > > than test speed (which is a big one, yes) but doesn't reduce the > > > > > amount > > > > > of workload on the committers - rather it increases it by having a > > > > > more > > > > > complex release process (each sub-project would still have to follow > > > > > the > > > > > > > > normal Apache voting process) and having 24 repos to check for PRs > > > > > rather than just 1. > > > > > Am I missing something? > > > > > ("Core" vs "contrib" made sense when Airflow was still under Airbnb, > > > > > we > > > > > should probably just move everything from contrib out to core pre > > > > > 2.0.0) > > > > > > > > -ash > > > > > airflowuser wrote on 08/01/2019 15:44: > > > > > > > > > > > I think the operator should be placed by the source. > > > > > > If it's MySQLToHiveOperator then it would be placed in MySQL > > > > > > package. > > > > > > The BIG question here is if this serve actual improvement like > > > > > > faster > > > > > > deployment of hook/operators bug-fix to Airflow users (faster than > > > > > > actual > > > > > > > > Airflow release) or this is mere cosmetic issue. > > > > > > > > > > > I assume that this also covers the unnecessary separation of core > > > > > > and > > > > > > contrib. > > > > > > Sent with ProtonMail Secure Email. > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > > > > > On Monday, January 7, 2019 10:16 PM, Maxime Beauchemin < > > > > > > maximebeauche...@gmail.com> wrote: > > > > > > > > > > > > > Something to think about is how data transfer operators like the > > > > > > > MysqlToHiveOperator usually rely on 2 hooks. With a > > > > > > > package-specific > > > > > > > approach that may mean something like an `airflow-hive`, > > > > > > > `airflow-mysql` > > > > > > > > > > > > and `airflow-mysql-hive` packages, where the `airflow-mysql-hive` > > > > > > > package > > > > > > > > > > > > > depends on the two other packages. > > > > > > > It's just a matter of having a clear strategy, good naming > > > > > > > conventions > > > > > > > > and > > > > > > > > > > > > a nice central place in the docs that centralize a list of > > > > > > > approved > > > > > > > packages. > > > > > > > Max > > > > > > > On Mon, Jan 7, 2019 at 9:05 AM Tim Swast sw...@google.com.invalid > > > > > > > wrote: > > > > > > > > > > > > > > I've created AIP-8: > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=100827303 > > > > > > > > > > To follow-up from the discussion about splitting > > > > > > > > hooks/operators out > > > > > > > > of the > > > > > > > > > > > > > > core Airflow package at > > > > > > > > http://mail-archives.apache.org/mod_mbox/airflow-dev/201809.mbox/< > > > > > > > > 308670db-bd2a-4738-81b1-3f6fb312c...@apache.org> > > > > > > > > > > > > > > I propose packaging based on the target system, informed by the > > > > > > > > existing > > > > > > > > > > > > > > hooks in both core and contrib. This will allow those with the > > > > > > > > relevant > > > > > > > > > > > > > expertise in each target system to respond to contributions / > > > > > > > > issues > > > > > > > > without having to follow the flood of everything > > > > > > > > Airflow-related. It > > > > > > > > will > > > > > > > > > > > > > > also decrease the surface area of the core package, helping with > > > > > > > > testability and long-term maintenance. > > > > > > > > > > > > > > > > - • *Tim Swast > > > > > > > > - • *Software Friendliness Engineer > > > > > > > > - • *Google Cloud Developer Relations > > > > > > > > - • *Seattle, WA, USA