I agree we can merge providers and services (especially that they are not "cloud providers" any more :)
>From the discussion above I think the specific proposals: - *fundamentals* - those are all the operators/hooks/sensors that are the "Core" of Airflow (base, dbapi) and allow you to run basic examples, implements basic logic of Airflow (subdags, branch etc.) + generic operators being base for others (like generic transfer/sql). The fundamentals will not have a separate package, they will simply stay in the "airflow/operators". The fundamentals are usually tightly coupled with the airflow core so it makes little sense to separate them as other packages. - *protocols* - integration with protocols that can be implemented by any software (SFTP/mail/etc.) - *software* - Integration with other software that is proprietary or open-source that you can install on-premises (or in the cloud) - *providers* - integration with cloud providers/services - (PAAS/SAAS) Re: Fokko - For me software vs. protocols distinction is quite clear. *Protocol* is something that can be implemented by anyone (like SFTP) where *software* is something that is delivered by some particular implementations only (like Postgres). But if more people have problems with it, we can merge those two. Such split also makes sense on a more abstract level and builds on top of the "ownership" of the components and "stewardship" over that part of code. - *fundamentals* - are really part of Apache Airflow. Stewards - core Airflow team. - *protocols* - are not owned by anyone, they are public and the implementation is fully "open". There are no particular stewards (no need). Users of particular protocols should mainly maintain those and add support for different versions of the protocols. - *software* - both API and software are controlled by someone outside of Airflow (commercial or open-source project), but the deployment of that software is "owned" by the user installing Airflow. The "stewardship" might be also the users but the controlling party (Oracle for example) might be interested in maintaining those operators as well. - *providers* - API/software/deployments are fully controlled by a 3rd party. Here most likely "provider" will be interested in maintaining the operators (and for example like Google - provide integration guidelines <https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?usp=drive_web&ouid=112320280470690058978> for their hooks/operators/sensors) In all cases Airflow Committers will still be responsible for merging the code/reviews etc, but this way it will be clear who can be interested in maintaining the packages. I am also going to discuss the package names and namespaces in separate thread - regarding backporting. It is connected to this discussion but it's more of an implementation detail and backporting needs. J. On Tue, Oct 29, 2019 at 2:39 PM Kaxil Naik <kaxiln...@gmail.com> wrote: > Also, ansible has something similar: > https://github.com/ansible/ansible/tree/devel/lib/ansible/modules > > Generally, I have been inspired by how Terraform and Ansible have > implemented it and can serve as an inspiration to us. > > On Tue, Oct 29, 2019 at 12:51 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > > Also providers and SAAS could be merged (taking inspiration from > Terraform > > here: https://www.terraform.io/docs/providers/index.html < > > https://www.terraform.io/docs/providers/index.html> - ignore the menu on > > the left, that is just for Docs layout, which we could do too -- Docs > > grouping doesn't have to match code grouping 100%) > > > > I would favour fewer sub-packages than more. My only reason for for > > suggesting providers was to make it clear when looking at the code what > the > > purpose of a module is. If "everything" lived under > > airflow.providers.{$major_cloud,core} or I would be okay with that. > > > > Can we talk in specifics here too? What package namespaces are you > > suggesting? > > > > -ash > > > > > > On 29 October 2019 12:02:54 GMT, "Driesprong, Fokko" > <fo...@driesprong.frl> > > wrote: > > Thanks Jarek for clearing that up. > > > > Personally I would omit the Apache one. We should not step into the > > fallacy as before with not being sure if it was in contrib or not. I > would > > even consider merging software and protocols, as it not entirely clear > what > > a protocol is or not. In the end, everything is a protocol, might be a > high > > level (FTP) or a low-level protocol (FS). > > > > Cheers, Fokko > > > > Cheers, Fokko > > > > Op di 29 okt. 2019 om 12:45 schreef Jarek Potiuk < > jarek.pot...@polidea.com > > >: > > > > Yep. We should definitely discuss the split! > > > > For me these are the criteria: > > > > - fundamentals - those are all the operators/hooks/sensors that are > the > > "Core" of Airflow (base, dbapi) and allow you to run basic examples, > > implements basic logic of Airflow (subdags, branch etc.) + generic > > operators being base for others (like generic transfer/sql) > > - providers - integration with cloud providers - (PAAS) > > - apache - integrations - with other ApacheSoftwareFoundation > projects > > - software - Integration with other software that is proprietary or > > open-source that you can install on-premises (or in the cloud) > > - protocols - integration with protocols that can be implemented by > any > > software (SFTP/mail/etc.) > > - services - Integration with SAAS solutions > > > > From the above list I only have doubts about the "apache" one - question > > is > > whether as part of Apache Community we want to somehow group those. > > > > J. > > > > > > On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak < > > basharens...@godatadriven.com> wrote: > > > > 1. Sounds good to me > > 2. Also fine > > 3. We should have some consensus here. E.g. I’m not sure what groups > > “fundamentals” and “software” are meant to be :-) > > > > While we’re at it: we should really move the BaseOperator out of models. > > The BaseOperator has no representation in the DB and should be placed > > together with other scripts where it belongs, i.e. something like > > airflow.operators.base_operator. > > > > Bas > > > > On 29 Oct 2019, at 10:43, Jarek Potiuk <jarek.pot...@polidea.com > <mailto: > > jarek.pot...@polidea.com>> wrote: > > > > After some consideration and seeing the actual move in practice I wanted > > to > > propose 3rd amendment ;) to the AIP-21. > > I have a few observations from seeing the discussions and observing the > > actual moving process. I have the following proposals: > > > > *1) Between-providers transfer operators should be kept at the "target" > > rather than "source"* > > > > If we end up with splitting operators by groups (AIP-8 and the proposed > > Backporting to Airflow 1.10), I think it makes more sense to keep > > transfer > > operators in the "target" package. For example "S3 to GCS" operator in > > "providers/google" package - simply because it is more likely that the > > individuals that will be working on the pure "GCP" services will also be > > more interested in getting the data from other cloud providers, and > > likely > > they will even have some transfer services that can be used for that > > purpose (rather than using worker to transfer the data) - in the > > particular > > S3-> GCS case we have GCP's > > https://cloud.google.com/storage-transfer/docs/overview < > > https://cloud.google.com/storage-transfer/docs/overview> which allows to > > transfer data from any other cloud provider to GCS . Same for example if > > we > > imagine Athena -> Bigquery for example. At least that's the feeling I > > have. > > I can imagine that the kind of "stewardship" over those groups of > > operators > > can be somewhat influenced and maybe even performed by those cloud > > providers themselves. Corresponding hooks of course should be in > > different > > "groups". > > > > 2) *One-side provider-neutral transfer operators should be kept at the > > "provider" regardless if they are target or source.* > > > > For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be > > in > > the "core" package but both operators should be in "providers/google". > > The > > reason is quite the same as above - the "stewardship" over all the > > operators can be done by the "provider" group. > > > > *3) Grouping non-provider operators/hooks according to their purpose.* > > > > I think it is also the right time to move the other operators/hooks to > > different groups within core. We already have some reasonable and nice > > groups proposed in the new documentation by Kamil > > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html < > > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html> > > and > > it only makes sense to move those now (Fundamentals, ASF: Apache Software > > Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google > > Cloud Platform, Service integrations, Software integrations, Protocol > > integrations). I think it would make sense to use the same approach in > > the > > code: We could have > > > > > > > > > fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols) > > packages. > > > > There will be few exceptions probably but we can handle them on > > case-by-case basis. > > > > J. > > > > On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <jarek.pot...@polidea.com > > <mailto:jarek.pot...@polidea.com>> > > wrote: > > > > Hello everyone. I updated AIP-21 and updated examples. > > > > > > Point D. of AIP-21 is now as follows: > > > > > > > > *D. * Group operators/sensors/hooks in > > *airflow/providers/<PROVIDER>*/operators(sensors, > > hooks). > > > > Each provider can define its own internal structure of that package. For > > example in case of "google" provider the packages will be further > grouped > > by "gcp", "gsuite", "core" sub-packages. > > > > In case of transfer operators where two providers are involved, the > > transfer operators will be moved to "source" of the transfer. When there > > is only one provider as target but source is a database or another > > non-provider source, the operator is put to the target provider. > > > > Non-cloud provider ones are moved to airflow/operators(sensors/hooks). > > *Drop the prefix.* > > > > Examples: > > > > AWS operator: > > > > - > > *airflow/contrib/operators/sns_publish_operator.py > > becomes airflow/providers/aws/operators/**sns_publish_operator.py* > > > > *Google GCP operator:* > > > > - *airflow/contrib/operators/dataproc_operator.py* > > becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py* > > > > Previously GCP-prefixed operator: > > > > - > > *airflow/contrib/operators/gcp_bigtable_operator.py *becomes > > *airflow/providers/google/**gcp/operators/bigtable_operator.py* > > > > *Transfer from GCP:* > > > > - *airflow/contrib/operators/gcs_to_s3_operator.py* > > * becomes > airflow/providers/google/gcp/operators/gcs_to_s3_operator.py* > > > > *MySQL to GCS:* > > > > - *airflow/contrib/operators/mysql_to_gcs_operator.py* > > * becomes airflow/providers/google/gcp/operators/* > > *mysql_to_gcs_operator.py* > > > > *SSH operator:* > > > > - > > *airflow/contrib/operators/ssh_operator.py *becomes *airflow/* > > *operators/ssh_operator.py* > > > > > > On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <jarek.pot...@polidea.com > > <mailto:jarek.pot...@polidea.com>> > > wrote: > > > > Yeah. I think the important point is that the latest doc changes by > Kamil > > index all available operators and hooks nicely and make them easy to > > find. > > > > That also includes (as of today) automated CI checking if new operators > > and hooks added are added to the documentation : > > > > > > > > > https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13 > > < > > > https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13 > > > > > > > J. > > > > On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <ch...@crpalmer.com> wrote: > > > > It's not obvious to me why an S3ToMsSQLOperator in the aws package is > > "silly". Why do you say it made sense to create a MsSqlFromS3Operator? > > > > Basically all of these operators could be thought of as "move data from > A > > to B" or "move data to B from A". I think what feels natural to each > > individual will depend on what their frame of reference is, and where > > their > > main focus is. If you are largely focused on MsSql then I can understand > > that it's natural to think "What MsSql operators are there?" and to > > not see S3ToMsSqlOperator > > as one of those MsSql operators. That's exactly the point I made with my > > earlier response; I was so focused on BigQuery that I didn't think to > > look > > under Cloud Storage documentation for the > > GoogleCloudStorageToBigQueryOperator. > > > > I think it is too hard to draw a very distinct line between what is just > > "storage" and what is more. There are going to be fuzzy edge cases, so > > picking a single convention is going to much less hassle in my view. As > > long as that convention is well documented and the documentation is > > improved so that it's easier to find all operators that relate to > > BigQuery > > or MsSql etc in one place (as is being done by Kamil) then that is the > > best > > we can do. > > > > Chris > > > > > > > > On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <dpstand...@gmail.com> > > wrote: > > > > One case popped up for us recently, where it made sense to make a MsSql > > *From*S3Operator . > > > > I think using "source" makes sense in general, but in this case calling > > this a S3ToMsSqlOperator and putting it under AWS seems silly, even > > though > > you could say s3 is "source" here. > > > > I think in most of these cases we say "let's use source" because > > source is > > where the actual work is done and destination is just storage. > > > > Does a guideline saying "ignore storage" or "storage is secondary in > > object > > location" make sense? > > > > > > > > On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <jarek.pot...@polidea.com> > > wrote: > > > > It looks like we have general consensus about putting transfer > > operators > > into "source provider" package. > > That's great for me as well. > > > > Since I will be updating AIP-21 to reflect the "google" vs. "gcp" > > case, I > > will also update it to add this decision. > > > > If no-one objects (Lazy Consensus > > <https://community.apache.org/committers/lazyConsensus.html < > > https://community.apache.org/committers/lazyConsensus.html>>) till > > Monday7th of October, 3.20 CEST, we will update AIP-21 with > > information > > that transfer operators should be placed in the "source" provider > > module. > > > > J. > > > > On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła < > > kamil.breg...@polidea.com > > > > wrote: > > > > On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <ch...@crpalmer.com> > > wrote: > > > > On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła < > > kamil.breg...@polidea.com > > > > wrote: > > > > On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer < > > ch...@crpalmer.com> > > wrote: > > > > Is there a reason why we can't use symlinks to have copies > > of the > > files > > show up in both subpackages? So that `gcs_to_s3.py` would be > > under > > both > > `aws/operators/` and `gcp/operators`. I could imagine there > > may > > be > > technical reasons why this is a bad idea, but just thought I > > would > > ask. > > > > Symlinks is not supported by git. > > > > > > Why do you say that? This blog post > > <https://www.mokacoding.com/blog/symliks-in-git/ < > > https://www.mokacoding.com/blog/symliks-in-git/>> details how > > you > > can > > use > > them, and the caveats with regards to needing relative links not > > absolute. > > The example repo he links to at the end includes a symlink which > > worked > > fine for me when I cloned it. But maybe not relevant given the > > below: > > > > We still have to check if python packages can have links, but I'm > > afraid of this mechanism. This is not popular and may cause > > unexpected > > consequences. > > > > > > Likewise, someone who spends 99% of their time working in > > AWS and > > using > > all > > the operators in that subpackage, might not think to look in > > the > > GCP > > package the first time they need a GCS to S3 operator. I'm > > admittedly > > terrible at documentation, but if duplicating the files via > > symlinks > > isn't > > an option, then is there an easy way we could duplicate the > > documentation > > for those operators so they are easily findable in both doc > > sections? > > > > > > Recently, I updated the documentation: > > https://airflow.readthedocs.io/en/latest/integration.html < > > https://airflow.readthedocs.io/en/latest/integration.html> > > We have list of all integration in AWS, Azure, GCP. If the > > operator > > concerns two cloud proivders, it repeats in two places. It's > > good > > for > > documentation. DRY rule is only valid for source code. > > I am working on documentation for other operators. > > My work is part of this ticket: > > https://issues.apache.org/jira/browse/AIRFLOW-5431 < > > https://issues.apache.org/jira/browse/AIRFLOW-5431> > > > > > > This updated documentation looks great, definitely heading in a > > direction > > that makes it easier and addresses my concerns. (Although it > > took me > > a > > while to realize those tables can be scrolled horizontally!). > > > > I'm working on redesign of documentation theme. It's part of AIP-11 > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow > > < > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow > > > > > We are currently at the stage of collecting comments from the first > > phase - we sent materials to the community, but also conducted > > tests > > with real users > > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E > > < > > > https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E > > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > > Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > > Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > > Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > > Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > > Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>