I checked it with our Beam team and DirectRunner is supported by
Python SDK and requires no JVM. That's the main reason I think it's
worth considering it :) Hard dependency od JVM would be probably a
no-go for us.
https://beam.apache.org/documentation/runners/direct/
Tomek
On Sun, Sep 6, 2020 at
Oof ok yeah. I hadn't realized that beam had a hard JVM requirement. I
think that initially offering a local or block storage based solution with
easy extensions for users is totally in line with airflow philosophy. I
think that offering alternative transfer operators inproviders is a great
idea!
No strong opinion - but it seems like generic is the easiest for us to code (as
we have most of it already via hooks?) and adopt (and doesn't place a hard
requirement on Beam/JVM, even if JVM would only be runtime. Still)
This is possibly where Airflow has a core TransferOperator, and
providers
+1. I'd also propose also to consider "both" rather than vs. They do not
have to be implemented at the same time nor even by the same people.
Those could even be done in two AIPs and we could vote whether we implement
one, or both.
J.
On Sun, Sep 6, 2020 at 5:20 PM Tomasz Urbaszek wrote:
> T
Thanks, Ash for pointing to https://pypi.org/project/smart-open/ This
one looks really interesting for blob storages transfer!
As stated in the initial design doc I don't think we should focus on
best performance but rather on versatility. Currently, we have many
AtoB operators that do not yield t
+10
On Sun, Sep 6, 2020 at 12:56 PM Kaxil Naik wrote:
>
> Hi all,
>
> I have brought this topic on multiple occasions earlier too on the mailing
> list. I sincerely request all the contributors and Committers (including
> myself) that we add PR descriptions.
>
> This helps the community understan
For background: in the past I had an S3 to S3 transfer using smartopen (since
we wanted to split one giant ~300GB file onto smaller parts) and it took about
10mins, so even "large" uses can work fine in Airflow - no JVM required.
-ash
On 6 September 2020 12:01:24 BST, Tomasz Urbaszek wrote:
>I
I think using direct runner as default with the option to specify
other setup is a win-win. However, there are few doubts I have about
Beam based approach:
1. Dependency management. If I do `pip install apache-airflow[gcp]`
will it install `apache-beam[gcp]`? What if there's a version clash
betwee
Hi all,
I have brought this topic on multiple occasions earlier too on the mailing
list. I sincerely request all the contributors and Committers (including
myself) that we add PR descriptions.
This helps the community understand what the PRs do and abides by the ASF
motto about "Community above C