Is there a reason why we can't use symlinks to have copies of the files
show up in both subpackages? So that `gcs_to_s3.py` would be under both
`aws/operators/` and `gcp/operators`. I could imagine there may be
technical reasons why this is a bad idea, but just thought I would ask.

If that is not possible, then as a counter to Felix's argument of "without
knowing the source you don‘t need to know where it can be transferred to",
I would equally say that without knowing where you are trying transfer to
you don't need to know the source.

In particular my worry about putting them in the 'source' package is that
they can become harder to find for those that are looking for them. That
convention would need to be well documented and highlighted.

It is also worth noting that this isn't just a problem with cross cloud
providers. For example I have been working with BigQuery and wanted to see
what operators are available so I go to the docs, and find the GCP section
under Integrations and see the BigQuery section
<https://airflow.apache.org/integration.html#bigquery>. I used a bunch of
the operators there, but sometime later was surprised to find that there
wasn't an existing operator to load a file from Google Cloud Storage to
BigQuery, so I started to think about how I could use the available BQ
operators to do that.

Of course there is a GCS to BQ operator but it is found under the Cloud
Storage <https://airflow.apache.org/integration.html#cloud-storage> section
of the documentation, and I didn't initially look there. Yes I was probably
being a bit dumb not looking further, but my point is that I am very
focused on BigQuery. 90% of my tasks are creating BQ tables, and running BQ
queries, and so I was very focused on the "destination" in my case.

Likewise, someone who spends 99% of their time working in AWS and using all
the operators in that subpackage, might not think to look in the GCP
package the first time they need a GCS to S3 operator. I'm admittedly
terrible at documentation, but if duplicating the files via symlinks isn't
an option, then is there an easy way we could duplicate the documentation
for those operators so they are easily findable in both doc sections?

Chris




On Sun, Sep 22, 2019 at 7:57 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> I'm not really in favor of the cross_transfer package. It sounds really
> technical, and if you're new to the project, I would not know what to
> expect in this package.
>
> We had something related in the past with Kubernetes, we solved like this:
>
> https://github.com/apache/airflow/blob/master/airflow/contrib/example_dags/example_kubernetes_operator.py#L66-L69
>
> But this isn't really nice of course. When doing the initdb the examples
> shouldn't be loaded in my opinion, this would solve the whole issue.
>
> WDYT?
>
> Cheers, Fokko
>
> Op za 21 sep. 2019 om 22:53 schreef Tomasz Urbaszek <
> tomasz.urbas...@polidea.com>:
>
> > I also think that transfer operators should be put in origin package.
> Maybe
> > it is also worth to consider to make import available  in “destination”
> for
> > example by import? This would make it easier for user to find a right
> > operator.
> >
> > T.
> >
> >
> > On Sat, 21 Sep 2019 at 22:04, Felix Uellendall <felue...@pm.me.invalid>
> > wrote:
> >
> > > In my opinion the source of the transfer operation is what matters -
> > > without knowing the source you don‘t need to know where it can be
> > > transferred to.
> > >
> > > So I prefer to put those „cross transfer“ operators to its source. For
> > > example: GoogleApiToS3 -> gcp
> > >
> > > Best Regards,
> > > Felix
> > >
> > > Sent from ProtonMail Mobile
> > >
> > > On Sat, Sep 21, 2019 at 21:52, Jarek Potiuk <jarek.pot...@polidea.com>
> > > wrote:
> > >
> > > > I have a question: Should we put all transfer operators between into
> > > > separate "cross_transfer" package ?
> > > >
> > > > *Context:*
> > > >
> > > > We had one unresolved point when we decided about AIP-21 - where to
> put
> > > > transfer operators between service providers. In the middle of
> > > implementing
> > > > it, it turned out that we need to make some decisions as it has some
> > > > undesirable side effects if we just move the transfer operators to
> core
> > > > without any structure. Detailed discussion in this PR:
> > > > https://github.com/apache/airflow/pull/6147
> > > >
> > > > We can solve it easily by choosing "cross_transfer" package for all
> > > > transfer operators that are crossing "service provider" boundary.
> > > >
> > > > This way we will have "gcp" (or maybe even "alphabet" soon), "aws",
> > > "azure"
> > > > etc. and "cross_transfer" for all the S3->GCP, AWS->S3 etc.
> > > >
> > > > What do you think? Anyone strongly against this? Or maybe we can
> follow
> > > > lazy consensus rule for this? Or maybe someone can come up with a
> > better
> > > > name :) ?
> > > >
> > > > J.
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> >
> > --
> >
> > Tomasz Urbaszek
> > Polidea <https://www.polidea.com/> | Junior Software Engineer
> >
> > M: +48 505 628 493 <+48505628493>
> > E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>
> >
> > Unique Tech
> > Check out our projects! <https://www.polidea.com/our-work>
> >
>

Reply via email to