I am all for it Kamil! Super happy to treat Apache projects in the same way as "proprietary" providers :). Anyone else has some other comments ?
J. On Mon, Nov 11, 2019 at 2:17 PM Kamil Breguła <kamil.breg...@polidea.com> wrote: > I looked at this list and I'm only worried about two operators. > > airflow.contrib.operators.vertica_to_hive > airflow.contrib.operators.s3_to_hive > > If we want the operators to be grouped according to destination, then > this operator should be in apache package. It is the members of the > Apache community who will care most about this operator being of high > quality. Apache can be treated equally with other large cloud > providers, such as GCP, AWS. I can imagine that a new Apache product > will appear and it will want to promote the same way as products of > cloud providers are promoted. By creating a large number of > integrations that allow you to copy data to its operating range. > There's another cases - building a strong Apache community. As a > member of the Apache community, we should promote Apache products to > ensure that the development of the community is correct, and therefore > also for integration into our products with other products. > > On Mon, Nov 11, 2019 at 12:28 AM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > > > Just to select the "packages" for this update. Anyone has objections for > > this structure (details including transfer operators in > > > > https://docs.google.com/spreadsheets/d/17zA5t2JVxnDdg5Cs1Cg_ > > Mb1GXvGctmesfg2L089QSOk/edit#gid=0? > > > > *Fundamentals (no change)* > > > > > > > > providers > > > > > > > > > > google > > > > > > > > > > cloud > > > > > > > > gsuite > > > > > > > > marketing_platform > > > > > > amazon > > > > > > > > > > aws > > > > > > microsoft > > > > > > > > > > azure > > > > > > apache > > > > > > > > > > cassandra > > > > > > > > druid > > > > > > > > hadoop > > > > > > > > hive > > > > > > > > pig > > > > > > > > pinot > > > > > > > > spark > > > > > > > > sqoop > > > > > > mysql > > > > > > > > jira > > > > > > > > databricks > > > > > > > > datadog > > > > > > > > dingding > > > > > > > > discord > > > > > > > > cloudant > > > > > > > > jenkins > > > > > > > > opsgenie > > > > > > > > qubole > > > > > > > > salesforce > > > > > > > > segment > > > > > > > > slack > > > > > > > > snowflake > > > > > > > > vertica > > > > > > > > zendesk > > > > > > > > celery > > > > > > > > docker > > > > > > > > bash > > > > > > > > kubernetes > > > > > > > > mssql > > > > > > > > mongodb > > > > > > > > mysql > > > > > > > > openfaas > > > > > > > > oracle > > > > > > > > papermill > > > > > > > > postgres > > > > > > > > presto > > > > > > > > python > > > > > > > > redis > > > > > > > > samba > > > > > > > > sqlite > > > > > > > > imap > > > > > > > > ssh > > > > > > > > filesystem > > > > > > > > sftp > > > > > > > > ftp > > > > > > > > http > > > > > > > > grpc > > > > > > > > smtp > > > > > > > > jdbc > > > > > > > > winrm > > > > > > > > On Fri, Nov 8, 2019 at 5:47 PM Jarek Potiuk <jarek.pot...@polidea.com> > > wrote: > > > > > Let me then cancel this vote and I will restart it next week. > > > > > > Yeah. It's a bit like re-opening the Pandora's box but now that we know > > > that we can do it, and we are unblocked in moving to google (which is > now > > > the biggest move in-progress), we can spend more time on getting > better > > > (and more final) consensus. > > > I decided to go through the list from the docs (once again Kamil - > great > > > that you did it) and prepared this spreadsheet showing the structure. I > > > went through ALL the operators and put them in the right place where > our > > > current rules place them. > > > > > > After this exercise, I think that makes sense: > > > - put all the stuff except fundamentals in *"providers"* (everything > > > in "providers" will be potentially backportable). > > > - grouping apache projects under *"apache"* - similar to > > > google/amazon/microsoft (different kind of ownership but still it is an > > > ownership) > > > - for the rest I think what we can do is really to put the operators in > > > folders per "service/company" (without sub-packages). That includes > > > sftp/ssh/ftp etc (should we group [ftp and sftp] or [ssh and sftp] ??). > > > there is no "ownership" there and no reason to group them. That will > put > > > "operators/hooks/sensors" at different levels in the directory tree > but we > > > already have that for fundamentals and I am not too worried about > that. We > > > do not have to have everything at the same level. > > > - I put transfer operators according to the rule where "to" side is > more > > > important unless the other side is a public protocol (so sftp -> gcs > and > > > gcs -> sftp both go to google/gcp). I did not have any doubt where to > put > > > which transfer operator, so this is a good sign: > > > > > > > > > > https://docs.google.com/spreadsheets/d/17zA5t2JVxnDdg5Cs1Cg_Mb1GXvGctmesfg2L089QSOk/edit#gid=0 > > > > > > Can you please take a look and express your opinions here so that we > can > > > have final voting next week (for those who are not yet tired with the > > > discussion ;)). > > > > > > J. > > > > > > On Fri, Nov 8, 2019 at 4:38 PM Kaxil Naik <kaxiln...@gmail.com> wrote: > > > > > >> Yes, that makes sense. > > >> > > >> On Fri, Nov 8, 2019 at 3:22 PM Kamil Breguła < > kamil.breg...@polidea.com> > > >> wrote: > > >> > > >> > In the case of Hadoop, it is published by Apache, so it can be in > the > > >> > apache directory. This will mimic the grouping presented in the > > >> > documentation. > > >> > > > >> > https://airflow.readthedocs.io/en/latest/operators-and-hooks-ref.html#software-operators-and-hooks > > >> > > > >> > On Fri, Nov 8, 2019 at 3:47 PM Kaxil Naik <kaxiln...@gmail.com> > wrote: > > >> > > > > >> > > I think we should keep the vote open at least until mid next week > to > > >> have > > >> > > more thought and inputs on this one. > > >> > > > > >> > > In general, I am happy with the approach but operators/hooks and > > >> sensors > > >> > > shouldn't be a provider. "hadoop" can be its provider and hdfs > can be > > >> a > > >> > > part of it. > > >> > > > > >> > > providers/ > > >> > > google > > >> > > cloud > > >> > > operators > > >> > > hooks > > >> > > sensors > > >> > > gsuite > > >> > > operators > > >> > > ... > > >> > > amazon > > >> > > aws > > >> > > operators > > >> > > ... > > >> > > microsoft > > >> > > azure > > >> > > operators > > >> > > ... > > >> > > hadoop > > >> > > hdfs > > >> > > operators > > >> > > ... > > >> > > > > >> > > We can also define what is a "provider" so we know what to add in > it > > >> in > > >> > the > > >> > > future. SSH/FTP/SFTP belongs to the same family group. Do we want > to > > >> have > > >> > > separate providers for each one of them ??? > > >> > > > > >> > > Regards, > > >> > > Kaxil > > >> > > > > >> > > On Fri, Nov 8, 2019 at 9:08 AM Jarek Potiuk < > jarek.pot...@polidea.com > > >> > > > >> > > wrote: > > >> > > > > >> > > > I really like to make everything a provider. That's a great > idea ! > > >> > This way > > >> > > > everything "backportable" will have to be in "providers" > package. > > >> > Really > > >> > > > nice and clean separation (and less mess in "airflow"). And we > will > > >> not > > >> > > > have to have any artificial grouping (we can still group them > at the > > >> > > > documentation level). > > >> > > > > > >> > > > We do not need backport in name. And I think it's more of > technical > > >> > detail > > >> > > > on naming the package which we can work out while reviewing PRs > and > > >> we > > >> > can > > >> > > > agree final naming of the released packaged on PMC level (PMCs > will > > >> > have to > > >> > > > vote on releasing those). > > >> > > > > > >> > > > The thinking is that it's intention is really to be only > backported > > >> to > > >> > 1.10 > > >> > > > - we are not going (yet) to use the packages in Airflow 2.*. so > I > > >> > thought > > >> > > > by naming them backport we can express that intent more clearly. > > >> > > > > > >> > > > So let me clarify the structure of folders we are going to have > if > > >> we > > >> > > > follow it (i just added some examples) including the already > agreed > > >> > changes > > >> > > > from AIP-21: > > >> > > > > > >> > > > providers/ > > >> > > > google > > >> > > > cloud > > >> > > > operators > > >> > > > hooks > > >> > > > sensors > > >> > > > gsuite > > >> > > > operators > > >> > > > ... > > >> > > > amazon > > >> > > > aws > > >> > > > operators > > >> > > > ... > > >> > > > microsoft > > >> > > > azure > > >> > > > operators > > >> > > > ... > > >> > > > operators > > >> > > > sqlite.py > > >> > > > oracle.py > > >> > > > docker.py > > >> > > > hooks > > >> > > > hdfs.py > > >> > > > sqlite.py > > >> > > > sensors > > >> > > > http.py > > >> > > > sql.py > > >> > > > > > >> > > > > > >> > > > J. > > >> > > > > > >> > > > On Fri, Nov 8, 2019 at 9:43 AM Ash Berlin-Taylor < > a...@apache.org> > > >> > wrote: > > >> > > > > > >> > > > > Do we need to include `-backport,`? What was the thinking > behind > > >> > that? > > >> > > > > > > >> > > > > I think software and protocol should be merged. I would also > say > > >> > > > > _everything_ is a provider, so > airflow.providers.ssh.SSHOperator > > >> for > > >> > > > > instance is what I would prefer > > >> > > > > > > >> > > > > -a > > >> > > > > > > >> > > > > On 8 November 2019 08:32:42 GMT, Jarek Potiuk < > > >> > jarek.pot...@polidea.com> > > >> > > > > wrote: > > >> > > > > >One more day to go. I would love to see some opinions on this > > >> AIP-21 > > >> > > > > >update > > >> > > > > >:). > > >> > > > > > > > >> > > > > >Executive summary: > > >> > > > > > > > >> > > > > >* we will be moving a number of integrations to sub-packages > of > > >> > > > > >airflow. > > >> > > > > >* they will be backportable to 1.10.*. There will be > > >> > > > > >'apache-airflow-[package]-backport' pypi installable with > python > > >> 3 > > >> > that > > >> > > > > >will make Airflow 2.0 operators/hooks etc. available with > 1.10* > > >> > > > > >operators. > > >> > > > > >* the current proposal for sub-packages is > > >> > > > > >"protocols/software/providers/" > > >> > > > > >(but if you think merging protocols and software makes sense > - > > >> > please > > >> > > > > >express your opinion > > >> > > > > >* we are not moving "fundamental" operators/hooks etc.. > > >> > > > > >* Airflow 2.0 is still going to be installed as a single > package > > >> > with > > >> > > > > >all > > >> > > > > >operators (so we are not yet implementing AIP-8) > > >> > > > > > > > >> > > > > >J. > > >> > > > > > > > >> > > > > >On Wed, Nov 6, 2019 at 10:07 AM Jarek Potiuk < > > >> > jarek.pot...@polidea.com> > > >> > > > > >wrote: > > >> > > > > > > > >> > > > > >> I think all this cases are valid but maybe I was not > > >> super-clear. > > >> > > > > >It's > > >> > > > > >> only the transfer operators that we need to decide where to > > >> put - > > >> > not > > >> > > > > >> hooks. > > >> > > > > >> Usually the complexity of communication with particular > > >> storages > > >> > is > > >> > > > > >(or at > > >> > > > > >> least should be) in the Hooks rather than Operators. > > >> > > > > >> > > >> > > > > >> Operators should be just thin wrappers over the logic in > the > > >> > hooks. > > >> > > > > >> Hooks are going to stay where they belong - S3 Hooks in > amazon, > > >> > GCS > > >> > > > > >Hooks > > >> > > > > >> in google.cloud, GoogleSheet Hooks in google.gsuite. > > >> > > > > >> > > >> > > > > >> Since we actually have mono-repo - this will be no problem > > >> (and no > > >> > > > > >cross > > >> > > > > >> dependencies problem) to have S3 -> GCS operator in > google and > > >> > use > > >> > > > > >hooks > > >> > > > > >> from both google/amazon. > > >> > > > > >> > > >> > > > > >> I hope this alleviates your concern Daniel ? > > >> > > > > >> > > >> > > > > >> J. > > >> > > > > >> > > >> > > > > >> > > >> > > > > >>> What about GoogleSheetsToS3? GoogleSheetsToGCS? These > you > > >> would > > >> > > > > >put in > > >> > > > > >>> the target, i.e. the storage? But GoogleSheetsToSftp > would > > >> be in > > >> > > > > >google > > >> > > > > >>> sheets operators file? The complexity, and the shared > code, > > >> are > > >> > in > > >> > > > > >the > > >> > > > > >>> gsheet component -- not into the storage destination. > > >> > > > > >>> > > >> > > > > >>> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >>> On Tue, Nov 5, 2019 at 5:46 PM Jarek Potiuk > > >> > > > > ><jarek.pot...@polidea.com> > > >> > > > > >>> wrote: > > >> > > > > >>> > > >> > > > > >>> > Hello Airflow Community, > > >> > > > > >>> > > > >> > > > > >>> > The email calls for a vote to update AIP-21 Changes in > > >> import > > >> > > > > >paths > > >> > > > > >>> > < > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > >> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-21%3A+Changes+in+import+paths > > >> > > > > >>> > > > > >> > > > > >>> > with > > >> > > > > >>> > the changes described below. The vote will last till > > >> Saturday > > >> > 8th > > >> > > > > >2am > > >> > > > > >>> CEST > > >> > > > > >>> > (72 hours). Committers have a binding vote but everyone > from > > >> > the > > >> > > > > >>> community > > >> > > > > >>> > is encouraged to cast an advisory vote. > > >> > > > > >>> > > > >> > > > > >>> > *Summary*: > > >> > > > > >>> > > > >> > > > > >>> > The proposal is to update AIP-21 to move all non-core > > >> > > > > >>> > operators/hooks/sensor (and related files) to > sub-packages > > >> > within > > >> > > > > >>> airflow > > >> > > > > >>> > (protocols/software/providers) or (software/providers). > > >> > > > > >>> > I am also happy to merge protocols+software, so if you > have > > >> a > > >> > > > > >strong > > >> > > > > >>> > opinion on it - please state it with your vote and we > can > > >> > decide > > >> > > > > >based > > >> > > > > >>> on > > >> > > > > >>> > majority. > > >> > > > > >>> > > > >> > > > > >>> > Those packages will be separately released > (schedule/process > > >> > TBD) > > >> > > > > >and > > >> > > > > >>> will > > >> > > > > >>> > be backportable to 1.10.* airflow series, so that users > can > > >> > > > > >install it > > >> > > > > >>> and > > >> > > > > >>> > start using new Airflow2.0 operators in their Python 3 > > >> Airflow > > >> > > > > >1.10 > > >> > > > > >>> > environments (only Python 3.5+ is supported). > > >> > > > > >>> > > > >> > > > > >>> > We will proceed with migrating the providers package to > > >> already > > >> > > > > >agreed > > >> > > > > >>> > paths without waiting for the final vote (following > current > > >> > > > > >version of > > >> > > > > >>> > AIP-21). Since we have working POC - we know the agreed > > >> paths > > >> > will > > >> > > > > >work > > >> > > > > >>> for > > >> > > > > >>> > us. > > >> > > > > >>> > > > >> > > > > >>> > *Previous discussions: * > > >> > > > > >>> > > > >> > > > > >>> > - > > >> > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > >> > https://lists.apache.org/thread.html/b07a93c9114e3d3c55d4ee514955bac79bc012c7a00db627c6b4c55f@%3Cdev.airflow.apache.org%3E > > >> > > > > >>> > - > > >> > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > >> > https://lists.apache.org/thread.html/e25ddc546e367a4af3e594fecbd4431959bd5a89045e748e4206e7ff@%3Cdev.airflow.apache.org%3E > > >> > > > > >>> > > > >> > > > > >>> > *More Details*: > > >> > > > > >>> > > > >> > > > > >>> > 1) Information that we are going in the direction of > AIP-8 > > >> but > > >> > not > > >> > > > > >yet > > >> > > > > >>> > reaching it - focusing on separating out backportable > > >> packages > > >> > > > > >>> installable > > >> > > > > >>> > in Airflow releases 1.10.* . Airflow 2.0 will still be > > >> > installed > > >> > > > > >as a > > >> > > > > >>> whole > > >> > > > > >>> > and all the source will be kept in one repo, but we now > > >> have a > > >> > way > > >> > > > > >to > > >> > > > > >>> build > > >> > > > > >>> > backportable packages for groups of operators. POC > available > > >> > here: > > >> > > > > >>> > https://github.com/apache/airflow/pull/6507 (based on > Ash's > > >> > > > > >>> > https://github.com/ashb/airflow-submodule-test) > > >> > > > > >>> > > > >> > > > > >>> > 2) We move all integrations to new packages (keeping > > >> deprecated > > >> > > > > >import > > >> > > > > >>> > aliases in the old places). The following split > (according > > >> to > > >> > > > > >>> "stewardship" > > >> > > > > >>> > over the integrations): > > >> > > > > >>> > > > >> > > > > >>> > - *fundamentals* - core of ariflow - they are really > > >> part of > > >> > > > > >Apache > > >> > > > > >>> > Airflow. Stewards - core Airflow team. Not > > >> > > > > >backportable/separated > > >> > > > > >>> out. > > >> > > > > >>> > - *protocols* - are not owned by anyone, they are > public > > >> and > > >> > > > > >the > > >> > > > > >>> > implementation is fully "open". There are no > particular > > >> > > > > >stewards (no > > >> > > > > >>> > need). > > >> > > > > >>> > Users of particular protocols should mainly maintain > > >> those > > >> > and > > >> > > > > >add > > >> > > > > >>> > support > > >> > > > > >>> > for different versions of the protocols. > > >> > > > > >>> > - *software* - both API and software are controlled > by > > >> > someone > > >> > > > > >>> outside > > >> > > > > >>> > of Airflow (commercial or open-source project), but > the > > >> > > > > >deployment of > > >> > > > > >>> > that > > >> > > > > >>> > software is "owned" by the user installing Airflow. > The > > >> > > > > >"stewardship" > > >> > > > > >>> > might > > >> > > > > >>> > be also the users but the controlling party (Oracle > for > > >> > > > > >example) > > >> > > > > >>> might > > >> > > > > >>> > be > > >> > > > > >>> > interested in maintaining those operators as well. > > >> > > > > >>> > - *providers* - API/software/deployments are fully > > >> > controlled > > >> > > > > >by a > > >> > > > > >>> 3rd > > >> > > > > >>> > party. Here most likely "provider" will be > interested in > > >> > > > > >maintaining > > >> > > > > >>> the > > >> > > > > >>> > operators (and for example like Google - provide > > >> integration > > >> > > > > >>> guidelines > > >> > > > > >>> > < > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > >> > https://docs.google.com/document/d/1_rTdJSLCt0eyrAylmmgYc3yZr-_h51fVlnvMmWqhCkY/edit?usp=drive_web&ouid=112320280470690058978 > > >> > > > > >>> > > > > >> > > > > >>> > for > > >> > > > > >>> > their hooks/operators/sensors) > > >> > > > > >>> > > > >> > > > > >>> > > > >> > > > > >>> > 3) Between-providers transfer operators should be kept > at > > >> the > > >> > > > > >"target" > > >> > > > > >>> > rather than "source" > > >> > > > > >>> > For example S3 -> GCS should be in "google" provider, > but > > >> > GCS-> S3 > > >> > > > > >>> should > > >> > > > > >>> > be in "amazon". > > >> > > > > >>> > > > >> > > > > >>> > 4) One-side provider transfer operators should be kept > at > > >> the > > >> > > > > >"provider" > > >> > > > > >>> > regardless if they are target or source. > > >> > > > > >>> > For example GCS-> SFTP or SFTP -> GCS should be in > "google" > > >> > > > > >provider. > > >> > > > > >>> > > > >> > > > > >>> > 5) If in doubt we will discuss individual cases > separately. > > >> > > > > >>> > > > >> > > > > >>> > J. > > >> > > > > >>> > > > >> > > > > >>> > -- > > >> > > > > >>> > > > >> > > > > >>> > Jarek Potiuk > > >> > > > > >>> > Polidea <https://www.polidea.com/> | Principal Software > > >> > Engineer > > >> > > > > >>> > > > >> > > > > >>> > M: +48 660 796 129 <+48660796129> > > >> > > > > >>> > [image: Polidea] <https://www.polidea.com/> > > >> > > > > >>> > > > >> > > > > >>> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> -- > > >> > > > > >> > > >> > > > > >> Jarek Potiuk > > >> > > > > >> Polidea <https://www.polidea.com/> | Principal Software > > >> Engineer > > >> > > > > >> > > >> > > > > >> M: +48 660 796 129 <+48660796129> > > >> > > > > >> [image: Polidea] <https://www.polidea.com/> > > >> > > > > >> > > >> > > > > >> > > >> > > > > > > > >> > > > > >-- > > >> > > > > > > > >> > > > > >Jarek Potiuk > > >> > > > > >Polidea <https://www.polidea.com/> | Principal Software > Engineer > > >> > > > > > > > >> > > > > >M: +48 660 796 129 <+48660796129> > > >> > > > > >[image: Polidea] <https://www.polidea.com/> > > >> > > > > > > >> > > > > > >> > > > > > >> > > > -- > > >> > > > > > >> > > > Jarek Potiuk > > >> > > > Polidea <https://www.polidea.com/> | Principal Software > Engineer > > >> > > > > > >> > > > M: +48 660 796 129 <+48660796129> > > >> > > > [image: Polidea] <https://www.polidea.com/> > > >> > > > > > >> > > > >> > > > > > > > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>