Re: Generic Transfer Operator

2020-09-06 Thread Tomasz Urbaszek
I checked it with our Beam team and DirectRunner is supported by Python SDK and requires no JVM. That's the main reason I think it's worth considering it :) Hard dependency od JVM would be probably a no-go for us. https://beam.apache.org/documentation/runners/direct/ Tomek On Sun, Sep 6, 2020 at

Re: Generic Transfer Operator

2020-09-06 Thread Daniel Imberman
Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I think that initially offering a local or block storage based solution with easy extensions for users is totally in line with airflow philosophy. I think that offering alternative transfer operators inproviders is a great idea!

Re: Generic Transfer Operator

2020-09-06 Thread Ash Berlin-Taylor
No strong opinion - but it seems like generic is the easiest for us to code (as we have most of it already via hooks?) and adopt (and doesn't place a hard requirement on Beam/JVM, even if JVM would only be runtime. Still) This is possibly where Airflow has a core TransferOperator, and providers

Re: Generic Transfer Operator

2020-09-06 Thread Jarek Potiuk
+1. I'd also propose also to consider "both" rather than vs. They do not have to be implemented at the same time nor even by the same people. Those could even be done in two AIPs and we could vote whether we implement one, or both. J. On Sun, Sep 6, 2020 at 5:20 PM Tomasz Urbaszek wrote: > T

Re: Generic Transfer Operator

2020-09-06 Thread Tomasz Urbaszek
Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This one looks really interesting for blob storages transfer! As stated in the initial design doc I don't think we should focus on best performance but rather on versatility. Currently, we have many AtoB operators that do not yield t

Re: PR Descriptions

2020-09-06 Thread Jarek Potiuk
+10 On Sun, Sep 6, 2020 at 12:56 PM Kaxil Naik wrote: > > Hi all, > > I have brought this topic on multiple occasions earlier too on the mailing > list. I sincerely request all the contributors and Committers (including > myself) that we add PR descriptions. > > This helps the community understan

Re: Generic Transfer Operator

2020-09-06 Thread Ash Berlin-Taylor
For background: in the past I had an S3 to S3 transfer using smartopen (since we wanted to split one giant ~300GB file onto smaller parts) and it took about 10mins, so even "large" uses can work fine in Airflow - no JVM required. -ash On 6 September 2020 12:01:24 BST, Tomasz Urbaszek wrote: >I

Re: Generic Transfer Operator

2020-09-06 Thread Tomasz Urbaszek
I think using direct runner as default with the option to specify other setup is a win-win. However, there are few doubts I have about Beam based approach: 1. Dependency management. If I do `pip install apache-airflow[gcp]` will it install `apache-beam[gcp]`? What if there's a version clash betwee

PR Descriptions

2020-09-06 Thread Kaxil Naik
Hi all, I have brought this topic on multiple occasions earlier too on the mailing list. I sincerely request all the contributors and Committers (including myself) that we add PR descriptions. This helps the community understand what the PRs do and abides by the ASF motto about "Community above C