Also, ansible has something similar: https://github.com/ansible/ansible/tree/devel/lib/ansible/modules
Generally, I have been inspired by how Terraform and Ansible have implemented it and can serve as an inspiration to us. On Tue, Oct 29, 2019 at 12:51 PM Ash Berlin-Taylor <a...@apache.org> wrote: > Also providers and SAAS could be merged (taking inspiration from Terraform > here: https://www.terraform.io/docs/providers/index.html < > https://www.terraform.io/docs/providers/index.html> - ignore the menu on > the left, that is just for Docs layout, which we could do too -- Docs > grouping doesn't have to match code grouping 100%) > > I would favour fewer sub-packages than more. My only reason for for > suggesting providers was to make it clear when looking at the code what the > purpose of a module is. If "everything" lived under > airflow.providers.{$major_cloud,core} or I would be okay with that. > > Can we talk in specifics here too? What package namespaces are you > suggesting? > > -ash > > > On 29 October 2019 12:02:54 GMT, "Driesprong, Fokko" <fo...@driesprong.frl> > wrote: > Thanks Jarek for clearing that up. > > Personally I would omit the Apache one. We should not step into the > fallacy as before with not being sure if it was in contrib or not. I would > even consider merging software and protocols, as it not entirely clear what > a protocol is or not. In the end, everything is a protocol, might be a high > level (FTP) or a low-level protocol (FS). > > Cheers, Fokko > > Cheers, Fokko > > Op di 29 okt. 2019 om 12:45 schreef Jarek Potiuk <jarek.pot...@polidea.com > >: > > Yep. We should definitely discuss the split! > > For me these are the criteria: > > - fundamentals - those are all the operators/hooks/sensors that are the > "Core" of Airflow (base, dbapi) and allow you to run basic examples, > implements basic logic of Airflow (subdags, branch etc.) + generic > operators being base for others (like generic transfer/sql) > - providers - integration with cloud providers - (PAAS) > - apache - integrations - with other ApacheSoftwareFoundation projects > - software - Integration with other software that is proprietary or > open-source that you can install on-premises (or in the cloud) > - protocols - integration with protocols that can be implemented by any > software (SFTP/mail/etc.) > - services - Integration with SAAS solutions > > From the above list I only have doubts about the "apache" one - question > is > whether as part of Apache Community we want to somehow group those. > > J. > > > On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak < > basharens...@godatadriven.com> wrote: > > 1. Sounds good to me > 2. Also fine > 3. We should have some consensus here. E.g. I’m not sure what groups > “fundamentals” and “software” are meant to be :-) > > While we’re at it: we should really move the BaseOperator out of models. > The BaseOperator has no representation in the DB and should be placed > together with other scripts where it belongs, i.e. something like > airflow.operators.base_operator. > > Bas > > On 29 Oct 2019, at 10:43, Jarek Potiuk <jarek.pot...@polidea.com<mailto: > jarek.pot...@polidea.com>> wrote: > > After some consideration and seeing the actual move in practice I wanted > to > propose 3rd amendment ;) to the AIP-21. > I have a few observations from seeing the discussions and observing the > actual moving process. I have the following proposals: > > *1) Between-providers transfer operators should be kept at the "target" > rather than "source"* > > If we end up with splitting operators by groups (AIP-8 and the proposed > Backporting to Airflow 1.10), I think it makes more sense to keep > transfer > operators in the "target" package. For example "S3 to GCS" operator in > "providers/google" package - simply because it is more likely that the > individuals that will be working on the pure "GCP" services will also be > more interested in getting the data from other cloud providers, and > likely > they will even have some transfer services that can be used for that > purpose (rather than using worker to transfer the data) - in the > particular > S3-> GCS case we have GCP's > https://cloud.google.com/storage-transfer/docs/overview < > https://cloud.google.com/storage-transfer/docs/overview> which allows to > transfer data from any other cloud provider to GCS . Same for example if > we > imagine Athena -> Bigquery for example. At least that's the feeling I > have. > I can imagine that the kind of "stewardship" over those groups of > operators > can be somewhat influenced and maybe even performed by those cloud > providers themselves. Corresponding hooks of course should be in > different > "groups". > > 2) *One-side provider-neutral transfer operators should be kept at the > "provider" regardless if they are target or source.* > > For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be > in > the "core" package but both operators should be in "providers/google". > The > reason is quite the same as above - the "stewardship" over all the > operators can be done by the "provider" group. > > *3) Grouping non-provider operators/hooks according to their purpose.* > > I think it is also the right time to move the other operators/hooks to > different groups within core. We already have some reasonable and nice > groups proposed in the new documentation by Kamil > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html < > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html> > and > it only makes sense to move those now (Fundamentals, ASF: Apache Software > Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google > Cloud Platform, Service integrations, Software integrations, Protocol > integrations). I think it would make sense to use the same approach in > the > code: We could have > > > > fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols) > packages. > > There will be few exceptions probably but we can handle them on > case-by-case basis. > > J. > > On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <jarek.pot...@polidea.com > <mailto:jarek.pot...@polidea.com>> > wrote: > > Hello everyone. I updated AIP-21 and updated examples. > > > Point D. of AIP-21 is now as follows: > > > > *D. * Group operators/sensors/hooks in > *airflow/providers/<PROVIDER>*/operators(sensors, > hooks). > > Each provider can define its own internal structure of that package. For > example in case of "google" provider the packages will be further grouped > by "gcp", "gsuite", "core" sub-packages. > > In case of transfer operators where two providers are involved, the > transfer operators will be moved to "source" of the transfer. When there > is only one provider as target but source is a database or another > non-provider source, the operator is put to the target provider. > > Non-cloud provider ones are moved to airflow/operators(sensors/hooks). > *Drop the prefix.* > > Examples: > > AWS operator: > > - > *airflow/contrib/operators/sns_publish_operator.py > becomes airflow/providers/aws/operators/**sns_publish_operator.py* > > *Google GCP operator:* > > - *airflow/contrib/operators/dataproc_operator.py* > becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py* > > Previously GCP-prefixed operator: > > - > *airflow/contrib/operators/gcp_bigtable_operator.py *becomes > *airflow/providers/google/**gcp/operators/bigtable_operator.py* > > *Transfer from GCP:* > > - *airflow/contrib/operators/gcs_to_s3_operator.py* > * becomes airflow/providers/google/gcp/operators/gcs_to_s3_operator.py* > > *MySQL to GCS:* > > - *airflow/contrib/operators/mysql_to_gcs_operator.py* > * becomes airflow/providers/google/gcp/operators/* > *mysql_to_gcs_operator.py* > > *SSH operator:* > > - > *airflow/contrib/operators/ssh_operator.py *becomes *airflow/* > *operators/ssh_operator.py* > > > On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <jarek.pot...@polidea.com > <mailto:jarek.pot...@polidea.com>> > wrote: > > Yeah. I think the important point is that the latest doc changes by Kamil > index all available operators and hooks nicely and make them easy to > find. > > That also includes (as of today) automated CI checking if new operators > and hooks added are added to the documentation : > > > > https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13 > < > https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13 > > > > J. > > On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <ch...@crpalmer.com> wrote: > > It's not obvious to me why an S3ToMsSQLOperator in the aws package is > "silly". Why do you say it made sense to create a MsSqlFromS3Operator? > > Basically all of these operators could be thought of as "move data from A > to B" or "move data to B from A". I think what feels natural to each > individual will depend on what their frame of reference is, and where > their > main focus is. If you are largely focused on MsSql then I can understand > that it's natural to think "What MsSql operators are there?" and to > not see S3ToMsSqlOperator > as one of those MsSql operators. That's exactly the point I made with my > earlier response; I was so focused on BigQuery that I didn't think to > look > under Cloud Storage documentation for the > GoogleCloudStorageToBigQueryOperator. > > I think it is too hard to draw a very distinct line between what is just > "storage" and what is more. There are going to be fuzzy edge cases, so > picking a single convention is going to much less hassle in my view. As > long as that convention is well documented and the documentation is > improved so that it's easier to find all operators that relate to > BigQuery > or MsSql etc in one place (as is being done by Kamil) then that is the > best > we can do. > > Chris > > > > On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <dpstand...@gmail.com> > wrote: > > One case popped up for us recently, where it made sense to make a MsSql > *From*S3Operator . > > I think using "source" makes sense in general, but in this case calling > this a S3ToMsSqlOperator and putting it under AWS seems silly, even > though > you could say s3 is "source" here. > > I think in most of these cases we say "let's use source" because > source is > where the actual work is done and destination is just storage. > > Does a guideline saying "ignore storage" or "storage is secondary in > object > location" make sense? > > > > On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > It looks like we have general consensus about putting transfer > operators > into "source provider" package. > That's great for me as well. > > Since I will be updating AIP-21 to reflect the "google" vs. "gcp" > case, I > will also update it to add this decision. > > If no-one objects (Lazy Consensus > <https://community.apache.org/committers/lazyConsensus.html < > https://community.apache.org/committers/lazyConsensus.html>>) till > Monday7th of October, 3.20 CEST, we will update AIP-21 with > information > that transfer operators should be placed in the "source" provider > module. > > J. > > On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła < > kamil.breg...@polidea.com > > wrote: > > On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <ch...@crpalmer.com> > wrote: > > On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła < > kamil.breg...@polidea.com > > wrote: > > On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer < > ch...@crpalmer.com> > wrote: > > Is there a reason why we can't use symlinks to have copies > of the > files > show up in both subpackages? So that `gcs_to_s3.py` would be > under > both > `aws/operators/` and `gcp/operators`. I could imagine there > may > be > technical reasons why this is a bad idea, but just thought I > would > ask. > > Symlinks is not supported by git. > > > Why do you say that? This blog post > <https://www.mokacoding.com/blog/symliks-in-git/ < > https://www.mokacoding.com/blog/symliks-in-git/>> details how > you > can > use > them, and the caveats with regards to needing relative links not > absolute. > The example repo he links to at the end includes a symlink which > worked > fine for me when I cloned it. But maybe not relevant given the > below: > > We still have to check if python packages can have links, but I'm > afraid of this mechanism. This is not popular and may cause > unexpected > consequences. > > > Likewise, someone who spends 99% of their time working in > AWS and > using > all > the operators in that subpackage, might not think to look in > the > GCP > package the first time they need a GCS to S3 operator. I'm > admittedly > terrible at documentation, but if duplicating the files via > symlinks > isn't > an option, then is there an easy way we could duplicate the > documentation > for those operators so they are easily findable in both doc > sections? > > > Recently, I updated the documentation: > https://airflow.readthedocs.io/en/latest/integration.html < > https://airflow.readthedocs.io/en/latest/integration.html> > We have list of all integration in AWS, Azure, GCP. If the > operator > concerns two cloud proivders, it repeats in two places. It's > good > for > documentation. DRY rule is only valid for source code. > I am working on documentation for other operators. > My work is part of this ticket: > https://issues.apache.org/jira/browse/AIRFLOW-5431 < > https://issues.apache.org/jira/browse/AIRFLOW-5431> > > > This updated documentation looks great, definitely heading in a > direction > that makes it easier and addresses my concerns. (Although it > took me > a > while to realize those tables can be scrolled horizontally!). > > I'm working on redesign of documentation theme. It's part of AIP-11 > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow > < > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow > > > We are currently at the stage of collecting comments from the first > phase - we sent materials to the community, but also conducted > tests > with real users > > > > > > > > https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E > < > https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | > Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>> > >