For background: in the past I had an S3 to S3 transfer using smartopen (since 
we wanted to split one giant ~300GB file onto smaller parts) and it took about 
10mins, so even "large" uses can work fine in Airflow - no JVM required.

-ash

On 6 September 2020 12:01:24 BST, Tomasz Urbaszek <[email protected]> wrote:
>I think using direct runner as default with the option to specify
>other setup is a win-win. However, there are few doubts I have about
>Beam based approach:
>
>1. Dependency management. If I do `pip install apache-airflow[gcp]`
>will it install `apache-beam[gcp]`? What if there's a version clash
>between dependencies?
>
>2. The initial approach using `DataSource` concept allowed users to
>use it in any operator (not only transfer ones). In case of relying on
>Beam we are losing this.
>
>3. I'm not a Beam expert but it seems to not support any data lineage
>solution?
>
>
>On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
><[email protected]> wrote:
>>
>> I think there are absolutely use-cases for both. I’m totally fine
>with saying “for small/medium use-cases, we come with an in-house
>system. However for larger cases, you’ll require spark/Flink/S3. That’s
>totally in line with PLENTY of use-cases. This would be especially cool
>when matched with fast-follow as we could EVEN potentially tie in data
>locality.
>>
>> via Newton Mail
>[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
>> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett
><[email protected]> wrote:
>> I believe - for not large data - the direct runner is wholly doable,
>which
>> seems in line with airflow patterns. I have, and have spoken with
>several
>> others that have, been productive with that runner.
>>
>> For much larger transfers, the generic operator could accept
>parameters for
>> submitting the compute to an actual runner. Though, imagining that
>> (needing a runner) would not be the primary use case for such an
>operator.
>>
>>
>> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <[email protected]>
>wrote:
>>
>> > Austin, you are right, Beam covers all (and more) important IOs.
>> > However, using Apache Beam to design a generic transfer operator
>> > requires Airflow users to have additional resources that will be
>used
>> > as a runner (Spark, Flink, etc.). Unless you suggest using
>> > DirectRunner?
>> >
>> > Can you please tell us more how exactly you think we can use Beam
>for
>> > those Airflow transfer operators?
>> >
>> > Best,
>> > Tomek
>> >
>> >
>> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
>> > <[email protected]> wrote:
>> > >
>> > > Are there IOs that would be desired for a generic transfer
>operator that
>> > > don't exist in:
>https://beam.apache.org/documentation/io/built-in/ <-
>> > > there is pretty solid coverage?
>> > >
>> > > Beam is getting to the point where even python beam can leverage
>the java
>> > > IOs, which increases the range of IOs (and performance).
>> > >
>> > >
>> > >
>> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk
><[email protected]>
>> > > wrote:
>> > >
>> > > > But I believe those two ideas are separate ones as Tomek
>explained :)
>> > > >
>> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk
><[email protected]
>> > >
>> > > > wrote:
>> > > >
>> > > > > I love the idea of connecting the projects more closely!
>> > > > >
>> > > > > I've been helping recently as a consultant in improving the
>Apache
>> > Beam
>> > > > > build infrastructure (in many parts based on my Airflow
>experience
>> > and
>> > > > > Github Actions - even recently they adopted the "cancel"
>action I
>> > > > developed
>> > > > > for Apache Airflow).
>https://github.com/apache/beam/pull/12729
>> > > > >
>> > > > > Synergies in Apache projects are cool.
>> > > > >
>> > > > > J.
>> > > > >
>> > > > >
>> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
>> > > > > <[email protected]> wrote:
>> > > > >
>> > > > >> Agree on keeping those separate, just intervened as I
>believe its a
>> > > > great
>> > > > >> idea. But lets keep @beam and @spark to a separate thread.
>> > > > >>
>> > > > >>
>> > > > >> Gerard Casas Saez
>> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
>> > > > >>
>> > > > >>
>> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
>> > [email protected]>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Daniel is right we have few Apache Beam committers in
>Polidea so
>> > we
>> > > > >> > will ask for advice. However, I would be highly in favor
>of
>> > having it
>> > > > >> > as Gerard suggested as @beam decorator. This is something
>we
>> > should
>> > > > >> > put into another AIP together with the mentioned @spark
>decorator.
>> > > > >> >
>> > > > >> > Our proposition of transfer operators was mainly to create
>> > something
>> > > > >> > Airflow-native that works out of the box and allows us to
>simplify
>> > > > >> > read/write from external sources. Thus, it requires no
>external
>> > > > >> > dependency other than the library to communicate with the
>API. In
>> > the
>> > > > >> > case of Beam we need more than that I think.
>> > > > >> >
>> > > > >> > Additionally, the ideas of Source and Destination play
>nicely with
>> > > > >> > data lineage and may bring more interest to this feature
>of
>> > Airflow.
>> > > > >> >
>> > > > >> > Cheers,
>> > > > >> > Tomek
>> > > > >> >
>> > > > >> >
>> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik
><[email protected]>
>> > > > wrote:
>> > > > >> > >
>> > > > >> > > Nice. Just a note here, we will need to make sure that
>those
>> > > > "Source"
>> > > > >> and
>> > > > >> > > "Destination" needs to be serializable.
>> > > > >> > >
>> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
>> > > > [email protected]
>> > > > >> >
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > > > Interesting! Beam also could potentially allow
>transfers
>> > within
>> > > > >> > Dask/any
>> > > > >> > > > other system with a java/python SDK? I think @jarek
>and
>> > Polidea
>> > > > do a
>> > > > >> > lot of
>> > > > >> > > > work with Beam as well so I’d love their thoughts if
>this a
>> > good
>> > > > >> > use-case.
>> > > > >> > > >
>> > > > >> > > > via Newton Mail [
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
>> > > > >> > > > ]
>> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
>> > > > >> > [email protected]>
>> > > > >> > > > wrote:
>> > > > >> > > > I would be highly in favour of having a generic Beam
>operator.
>> > > > >> Similar
>> > > > >> > > > to @spark_task decorator. Something where you can
>easily
>> > define
>> > > > and
>> > > > >> > wrap a
>> > > > >> > > > beam pipeline and convert it to an Airflow operator.
>> > > > >> > > >
>> > > > >> > > > Gerard Casas Saez
>> > > > >> > > > Twitter | Cortex | @casassaez
><http://twitter.com/casassaez>
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
>> > > > >> > > > [email protected]>
>> > > > >> > > > wrote:
>> > > > >> > > >
>> > > > >> > > > > Are you guys familiar with Beam
><https://beam.apache.org>?
>> > Esp.
>> > > > >> if
>> > > > >> > not
>> > > > >> > > > > doing transforms, it might rather straightforward to
>rely
>> > on the
>> > > > >> > > > ecosystem
>> > > > >> > > > > of connectors in that Apache Project to use as the
>> > foundations
>> > > > >> for a
>> > > > >> > > > > generic transfer operator.
>> > > > >> > > > >
>> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
>> > > > >> > [email protected]>
>> > > > >> > > > > wrote:
>> > > > >> > > > >
>> > > > >> > > > > > +1
>> > > > >> > > > > >
>> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
>> > > > >> > > > > > [email protected]>
>> > > > >> > > > > > wrote:
>> > > > >> > > > > >
>> > > > >> > > > > > > Hello all,
>> > > > >> > > > > > > since there have been no new comments shared in
>the POC
>> > doc
>> > > > >> > > > > > > <
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
>> > > > >> > > > > > > >
>> > > > >> > > > > > > for a couple of days, then I will proceed with
>creating
>> > an
>> > > > AIP
>> > > > >> > for
>> > > > >> > > > this
>> > > > >> > > > > > > feature, if that is ok with everybody.
>> > > > >> > > > > > > Best regards,
>> > > > >> > > > > > > Kamil
>> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek
><
>> > > > >> > > > [email protected]
>> > > > >> > > > > >
>> > > > >> > > > > > > wrote:
>> > > > >> > > > > > >
>> > > > >> > > > > > > > I like the approach as it itnroduces another
>> > interesting
>> > > > >> > operators'
>> > > > >> > > > > > > > interface standarization. It would be awesome
>to here
>> > more
>> > > > >> > opinions
>> > > > >> > > > > :)
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > Cheers,
>> > > > >> > > > > > > > Tomek
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
>> > > > >> > > > > [email protected]
>> > > > >> > > > > > >
>> > > > >> > > > > > > > wrote:
>> > > > >> > > > > > > >
>> > > > >> > > > > > > > > I like the idea a lot. Similar things have
>been
>> > > > discussed
>> > > > >> > before
>> > > > >> > > > > but
>> > > > >> > > > > > > the
>> > > > >> > > > > > > > > proposal is I think rather pragmatic and
>solves a
>> > real
>> > > > >> > problem
>> > > > >> > > > (and
>> > > > >> > > > > > it
>> > > > >> > > > > > > > does
>> > > > >> > > > > > > > > not seem to be too complex to implement)
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > There is some discussion about it already in
>the
>> > > > document
>> > > > >> > (please
>> > > > >> > > > > > > > chime-in
>> > > > >> > > > > > > > > for those interested) but here a few points
>why I
>> > like
>> > > > it:
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - performance and optimization is not a
>focus for
>> > that.
>> > > > >> For
>> > > > >> > > > generic
>> > > > >> > > > > > > stuff
>> > > > >> > > > > > > > > it is usually to write "optimal" solution
>but once
>> > you
>> > > > >> admit
>> > > > >> > you
>> > > > >> > > > > are
>> > > > >> > > > > > > not
>> > > > >> > > > > > > > > going to focus for optimisation, you come
>with
>> > simpler
>> > > > and
>> > > > >> > easier
>> > > > >> > > > > to
>> > > > >> > > > > > > use
>> > > > >> > > > > > > > > solutions
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - on the other hand - it uses very
>"Python'y"
>> > approach
>> > > > >> with
>> > > > >> > using
>> > > > >> > > > > > > > > Airflow's familiar concepts (connection,
>transfer)
>> > and
>> > > > has
>> > > > >> > the
>> > > > >> > > > > > > potential
>> > > > >> > > > > > > > of
>> > > > >> > > > > > > > > plugging in into 100s of hooks we have
>already
>> > easily -
>> > > > >> > > > leveraging
>> > > > >> > > > > > all
>> > > > >> > > > > > > > the
>> > > > >> > > > > > > > > "providers" richness of Airflow.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it aims to be easy to do "quick start" -
>if you
>> > have a
>> > > > >> > number
>> > > > >> > > > of
>> > > > >> > > > > > > > > different sources/targets and as a data
>scientist
>> > you
>> > > > >> would
>> > > > >> > like
>> > > > >> > > > to
>> > > > >> > > > > > > > quickly
>> > > > >> > > > > > > > > start transferring data between them - you
>can do it
>> > > > >> easily
>> > > > >> > with
>> > > > >> > > > > > only
>> > > > >> > > > > > > > > basic python knowledge and simple DAG
>structure.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it should be possible to plug it in into
>our new
>> > > > >> functional
>> > > > >> > > > > > approach
>> > > > >> > > > > > > as
>> > > > >> > > > > > > > > well as future lineage discussions as it
>makes
>> > > > connection
>> > > > >> > between
>> > > > >> > > > > > > sources
>> > > > >> > > > > > > > > and targets
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > - it opens up possibilities of adding simple
>and
>> > > > flexible
>> > > > >> > data
>> > > > >> > > > > > > > > transformation on-transfer. Not a
>replacement for
>> > any of
>> > > > >> the
>> > > > >> > > > > external
>> > > > >> > > > > > > > > services that Airflow should use (Airflow is
>an
>> > > > >> > orchestrator, not
>> > > > >> > > > > > data
>> > > > >> > > > > > > > > processing solution) but for the kind of
>quick-start
>> > > > >> > scenarios I
>> > > > >> > > > > > > foresee
>> > > > >> > > > > > > > it
>> > > > >> > > > > > > > > might be most useful, being able to apply
>simple
>> > data
>> > > > >> > > > > transformation
>> > > > >> > > > > > on
>> > > > >> > > > > > > > the
>> > > > >> > > > > > > > > fly by data scientist might be a big plus.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of
>the
>> > "data"
>> > > > >> > component
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Kamil - you should have access now.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > J.
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil
>Olszewski <
>> > > > >> > > > > > > > > [email protected]>
>> > > > >> > > > > > > > > wrote:
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > > Hello all,
>> > > > >> > > > > > > > > > in Polidea we have come up with an idea
>for a
>> > generic
>> > > > >> > transfer
>> > > > >> > > > > > > operator
>> > > > >> > > > > > > > > > that would be able to transport data
>between two
>> > > > >> > destinations
>> > > > >> > > > of
>> > > > >> > > > > > > > various
>> > > > >> > > > > > > > > > types (file, database, storage, etc.) -
>please
>> > find
>> > > > the
>> > > > >> > link
>> > > > >> > > > > with a
>> > > > >> > > > > > > > short
>> > > > >> > > > > > > > > > doc with POC
>> > > > >> > > > > > > > > > <
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> >
>> > > > >>
>> > > >
>> >
>https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
>> > > > >> > > > > > > > > > >
>> > > > >> > > > > > > > > > where we can discuss the design initially.
>Once we
>> > > > come
>> > > > >> to
>> > > > >> > the
>> > > > >> > > > > > > initial
>> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki -
>can I
>> > ask
>> > > > for
>> > > > >> > > > > permission
>> > > > >> > > > > > to
>> > > > >> > > > > > > > do
>> > > > >> > > > > > > > > so
>> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe
>that
>> > during
>> > > > the
>> > > > >> > > > > discussion
>> > > > >> > > > > > we
>> > > > >> > > > > > > > > > should definitely aim for this feature to
>be
>> > released
>> > > > >> only
>> > > > >> > > > after
>> > > > >> > > > > > > > Airflow
>> > > > >> > > > > > > > > > 2.0 is out.
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > What do you think about this idea? Would
>you find
>> > such
>> > > > >> an
>> > > > >> > > > > operator
>> > > > >> > > > > > > > > helpful
>> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a
>similar
>> > > > >> > solution or
>> > > > >> > > > > know
>> > > > >> > > > > > > > > > packages that could be used to implement
>it?
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Best regards,
>> > > > >> > > > > > > > > > --
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Kamil Olszewski
>> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> |
>Software
>> > Engineer
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > M: +48 503 361 783
>> > > > >> > > > > > > > > > E: [email protected]
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > > > Unique Tech
>> > > > >> > > > > > > > > > Check out our projects! <
>> > > > >> https://www.polidea.com/our-work>
>> > > > >> > > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > --
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > Jarek Potiuk
>> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> |
>Principal
>> > Software
>> > > > >> > Engineer
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
>> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >> > > > > > > > >
>> > > > >> > > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > >
>> > > > >> > > > > > > --
>> > > > >> > > > > > >
>> > > > >> > > > > > > Kamil Olszewski
>> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software
>Engineer
>> > > > >> > > > > > >
>> > > > >> > > > > > > M: +48 503 361 783
>> > > > >> > > > > > > E: [email protected]
>> > > > >> > > > > > >
>> > > > >> > > > > > > Unique Tech
>> > > > >> > > > > > > Check out our projects! <
>> > https://www.polidea.com/our-work>
>> > > > >> > > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > >
>> > > > >> > > > > > --
>> > > > >> > > > > >
>> > > > >> > > > > > Jarek Potiuk
>> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal
>Software
>> > > > >> Engineer
>> > > > >> > > > > >
>> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
>> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > --
>> > > > >> >
>> > > > >> > Tomasz Urbaszek
>> > > > >> > Polidea | Software Engineer
>> > > > >> >
>> > > > >> > M: +48 505 628 493
>> > > > >> > E: [email protected]
>> > > > >> >
>> > > > >> > Unique Tech
>> > > > >> > Check out our projects!
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > Jarek Potiuk
>> > > > > Polidea <https://www.polidea.com/> | Principal Software
>Engineer
>> > > > >
>> > > > > M: +48 660 796 129 <+48660796129>
>> > > > > [image: Polidea] <https://www.polidea.com/>
>> > > > >
>> > > > >
>> > > >
>> > > > --
>> > > >
>> > > > Jarek Potiuk
>> > > > Polidea <https://www.polidea.com/> | Principal Software
>Engineer
>> > > >
>> > > > M: +48 660 796 129 <+48660796129>
>> > > > [image: Polidea] <https://www.polidea.com/>
>> > > >
>> >

Reply via email to