Also providers and SAAS could be merged (taking inspiration from Terraform
here: https://www.terraform.io/docs/providers/index.html
<https://www.terraform.io/docs/providers/index.html> - ignore the menu on the
left, that is just for Docs layout, which we could do too -- Docs grouping
doesn't have to match code grouping 100%)
I would favour fewer sub-packages than more. My only reason for for suggesting
providers was to make it clear when looking at the code what the purpose of a
module is. If "everything" lived under airflow.providers.{$major_cloud,core} or
I would be okay with that.
Can we talk in specifics here too? What package namespaces are you suggesting?
-ash
On 29 October 2019 12:02:54 GMT, "Driesprong, Fokko" <[email protected]>
wrote:
Thanks Jarek for clearing that up.
Personally I would omit the Apache one. We should not step into the
fallacy as before with not being sure if it was in contrib or not. I would
even consider merging software and protocols, as it not entirely clear what
a protocol is or not. In the end, everything is a protocol, might be a high
level (FTP) or a low-level protocol (FS).
Cheers, Fokko
Cheers, Fokko
Op di 29 okt. 2019 om 12:45 schreef Jarek Potiuk <[email protected]>:
Yep. We should definitely discuss the split!
For me these are the criteria:
- fundamentals - those are all the operators/hooks/sensors that are the
"Core" of Airflow (base, dbapi) and allow you to run basic examples,
implements basic logic of Airflow (subdags, branch etc.) + generic
operators being base for others (like generic transfer/sql)
- providers - integration with cloud providers - (PAAS)
- apache - integrations - with other ApacheSoftwareFoundation projects
- software - Integration with other software that is proprietary or
open-source that you can install on-premises (or in the cloud)
- protocols - integration with protocols that can be implemented by any
software (SFTP/mail/etc.)
- services - Integration with SAAS solutions
From the above list I only have doubts about the "apache" one - question is
whether as part of Apache Community we want to somehow group those.
J.
On Tue, Oct 29, 2019 at 11:19 AM Bas Harenslak <
[email protected]> wrote:
1. Sounds good to me
2. Also fine
3. We should have some consensus here. E.g. I’m not sure what groups
“fundamentals” and “software” are meant to be :-)
While we’re at it: we should really move the BaseOperator out of models.
The BaseOperator has no representation in the DB and should be placed
together with other scripts where it belongs, i.e. something like
airflow.operators.base_operator.
Bas
On 29 Oct 2019, at 10:43, Jarek Potiuk <[email protected]<mailto:
[email protected]>> wrote:
After some consideration and seeing the actual move in practice I wanted
to
propose 3rd amendment ;) to the AIP-21.
I have a few observations from seeing the discussions and observing the
actual moving process. I have the following proposals:
*1) Between-providers transfer operators should be kept at the "target"
rather than "source"*
If we end up with splitting operators by groups (AIP-8 and the proposed
Backporting to Airflow 1.10), I think it makes more sense to keep
transfer
operators in the "target" package. For example "S3 to GCS" operator in
"providers/google" package - simply because it is more likely that the
individuals that will be working on the pure "GCP" services will also be
more interested in getting the data from other cloud providers, and
likely
they will even have some transfer services that can be used for that
purpose (rather than using worker to transfer the data) - in the
particular
S3-> GCS case we have GCP's
https://cloud.google.com/storage-transfer/docs/overview
<https://cloud.google.com/storage-transfer/docs/overview> which allows to
transfer data from any other cloud provider to GCS . Same for example if
we
imagine Athena -> Bigquery for example. At least that's the feeling I
have.
I can imagine that the kind of "stewardship" over those groups of
operators
can be somewhat influenced and maybe even performed by those cloud
providers themselves. Corresponding hooks of course should be in
different
"groups".
2) *One-side provider-neutral transfer operators should be kept at the
"provider" regardless if they are target or source.*
For example GCS-> SFTP or SFTP -> GCS. There the hook for SFTP should be
in
the "core" package but both operators should be in "providers/google".
The
reason is quite the same as above - the "stewardship" over all the
operators can be done by the "provider" group.
*3) Grouping non-provider operators/hooks according to their purpose.*
I think it is also the right time to move the other operators/hooks to
different groups within core. We already have some reasonable and nice
groups proposed in the new documentation by Kamil
https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html
<https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html>
and
it only makes sense to move those now (Fundamentals, ASF: Apache Software
Foundation, Azure: Microsoft Azure, AWS: Amazon Web Services, GCP: Google
Cloud Platform, Service integrations, Software integrations, Protocol
integrations). I think it would make sense to use the same approach in
the
code: We could have
fundamentals/asf/azure(microsoft/azure?)/aws(amazon/aws?)/google/services/software/protocols)
packages.
There will be few exceptions probably but we can handle them on
case-by-case basis.
J.
On Fri, Oct 11, 2019 at 3:11 PM Jarek Potiuk <[email protected]
<mailto:[email protected]>>
wrote:
Hello everyone. I updated AIP-21 and updated examples.
Point D. of AIP-21 is now as follows:
*D. * Group operators/sensors/hooks in
*airflow/providers/<PROVIDER>*/operators(sensors,
hooks).
Each provider can define its own internal structure of that package. For
example in case of "google" provider the packages will be further grouped
by "gcp", "gsuite", "core" sub-packages.
In case of transfer operators where two providers are involved, the
transfer operators will be moved to "source" of the transfer. When there
is only one provider as target but source is a database or another
non-provider source, the operator is put to the target provider.
Non-cloud provider ones are moved to airflow/operators(sensors/hooks).
*Drop the prefix.*
Examples:
AWS operator:
-
*airflow/contrib/operators/sns_publish_operator.py
becomes airflow/providers/aws/operators/**sns_publish_operator.py*
*Google GCP operator:*
- *airflow/contrib/operators/dataproc_operator.py*
becomes *airflow/providers/gooogle/gcp/operators/dataproc_operator.py*
Previously GCP-prefixed operator:
-
*airflow/contrib/operators/gcp_bigtable_operator.py *becomes
*airflow/providers/google/**gcp/operators/bigtable_operator.py*
*Transfer from GCP:*
- *airflow/contrib/operators/gcs_to_s3_operator.py*
* becomes airflow/providers/google/gcp/operators/gcs_to_s3_operator.py*
*MySQL to GCS:*
- *airflow/contrib/operators/mysql_to_gcs_operator.py*
* becomes airflow/providers/google/gcp/operators/*
*mysql_to_gcs_operator.py*
*SSH operator:*
-
*airflow/contrib/operators/ssh_operator.py *becomes *airflow/*
*operators/ssh_operator.py*
On Fri, Oct 4, 2019 at 6:22 PM Jarek Potiuk <[email protected]
<mailto:[email protected]>>
wrote:
Yeah. I think the important point is that the latest doc changes by Kamil
index all available operators and hooks nicely and make them easy to
find.
That also includes (as of today) automated CI checking if new operators
and hooks added are added to the documentation :
https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13
<https://github.com/apache/airflow/commit/104a151d6a19b1ba1281cb00c66a2c3409e1bb13>
J.
On Fri, Oct 4, 2019 at 5:21 PM Chris Palmer <[email protected]> wrote:
It's not obvious to me why an S3ToMsSQLOperator in the aws package is
"silly". Why do you say it made sense to create a MsSqlFromS3Operator?
Basically all of these operators could be thought of as "move data from A
to B" or "move data to B from A". I think what feels natural to each
individual will depend on what their frame of reference is, and where
their
main focus is. If you are largely focused on MsSql then I can understand
that it's natural to think "What MsSql operators are there?" and to
not see S3ToMsSqlOperator
as one of those MsSql operators. That's exactly the point I made with my
earlier response; I was so focused on BigQuery that I didn't think to
look
under Cloud Storage documentation for the
GoogleCloudStorageToBigQueryOperator.
I think it is too hard to draw a very distinct line between what is just
"storage" and what is more. There are going to be fuzzy edge cases, so
picking a single convention is going to much less hassle in my view. As
long as that convention is well documented and the documentation is
improved so that it's easier to find all operators that relate to
BigQuery
or MsSql etc in one place (as is being done by Kamil) then that is the
best
we can do.
Chris
On Fri, Oct 4, 2019 at 10:55 AM Daniel Standish <[email protected]>
wrote:
One case popped up for us recently, where it made sense to make a MsSql
*From*S3Operator .
I think using "source" makes sense in general, but in this case calling
this a S3ToMsSqlOperator and putting it under AWS seems silly, even
though
you could say s3 is "source" here.
I think in most of these cases we say "let's use source" because
source is
where the actual work is done and destination is just storage.
Does a guideline saying "ignore storage" or "storage is secondary in
object
location" make sense?
On Fri, Oct 4, 2019 at 6:42 AM Jarek Potiuk <[email protected]>
wrote:
It looks like we have general consensus about putting transfer
operators
into "source provider" package.
That's great for me as well.
Since I will be updating AIP-21 to reflect the "google" vs. "gcp"
case, I
will also update it to add this decision.
If no-one objects (Lazy Consensus
<https://community.apache.org/committers/lazyConsensus.html
<https://community.apache.org/committers/lazyConsensus.html>>) till
Monday7th of October, 3.20 CEST, we will update AIP-21 with
information
that transfer operators should be placed in the "source" provider
module.
J.
On Tue, Sep 24, 2019 at 1:34 PM Kamil Breguła <
[email protected]
wrote:
On Mon, Sep 23, 2019 at 7:42 PM Chris Palmer <[email protected]>
wrote:
On Mon, Sep 23, 2019 at 1:22 PM Kamil Breguła <
[email protected]
wrote:
On Mon, Sep 23, 2019 at 7:04 PM Chris Palmer <
[email protected]>
wrote:
Is there a reason why we can't use symlinks to have copies
of the
files
show up in both subpackages? So that `gcs_to_s3.py` would be
under
both
`aws/operators/` and `gcp/operators`. I could imagine there
may
be
technical reasons why this is a bad idea, but just thought I
would
ask.
Symlinks is not supported by git.
Why do you say that? This blog post
<https://www.mokacoding.com/blog/symliks-in-git/
<https://www.mokacoding.com/blog/symliks-in-git/>> details how
you
can
use
them, and the caveats with regards to needing relative links not
absolute.
The example repo he links to at the end includes a symlink which
worked
fine for me when I cloned it. But maybe not relevant given the
below:
We still have to check if python packages can have links, but I'm
afraid of this mechanism. This is not popular and may cause
unexpected
consequences.
Likewise, someone who spends 99% of their time working in
AWS and
using
all
the operators in that subpackage, might not think to look in
the
GCP
package the first time they need a GCS to S3 operator. I'm
admittedly
terrible at documentation, but if duplicating the files via
symlinks
isn't
an option, then is there an easy way we could duplicate the
documentation
for those operators so they are easily findable in both doc
sections?
Recently, I updated the documentation:
https://airflow.readthedocs.io/en/latest/integration.html
<https://airflow.readthedocs.io/en/latest/integration.html>
We have list of all integration in AWS, Azure, GCP. If the
operator
concerns two cloud proivders, it repeats in two places. It's
good
for
documentation. DRY rule is only valid for source code.
I am working on documentation for other operators.
My work is part of this ticket:
https://issues.apache.org/jira/browse/AIRFLOW-5431
<https://issues.apache.org/jira/browse/AIRFLOW-5431>
This updated documentation looks great, definitely heading in a
direction
that makes it easier and addresses my concerns. (Although it
took me
a
while to realize those tables can be scrolled horizontally!).
I'm working on redesign of documentation theme. It's part of AIP-11
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow
<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-11+Create+a+Landing+Page+for+Apache+Airflow>
We are currently at the stage of collecting comments from the first
phase - we sent materials to the community, but also conducted
tests
with real users
https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E
<https://lists.apache.org/thread.html/6fa1cdceb97ed17752978a8d4202bf1ff1a86c6b50bbc9d09f694166@%3Cdev.airflow.apache.org%3E>
--
Jarek Potiuk
Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal
Software Engineer
M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>
--
Jarek Potiuk
Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal
Software Engineer
M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>
--
Jarek Potiuk
Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal
Software Engineer
M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>
--
Jarek Potiuk
Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal
Software Engineer
M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>
--
Jarek Potiuk
Polidea <https://www.polidea.com/ <https://www.polidea.com/>> | Principal
Software Engineer
M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/ <https://www.polidea.com/>>