I think using direct runner as default with the option to specify
other setup is a win-win. However, there are few doubts I have about
Beam based approach:

1. Dependency management. If I do `pip install apache-airflow[gcp]`
will it install `apache-beam[gcp]`? What if there's a version clash
between dependencies?

2. The initial approach using `DataSource` concept allowed users to
use it in any operator (not only transfer ones). In case of relying on
Beam we are losing this.

3. I'm not a Beam expert but it seems to not support any data lineage solution?


On Sun, Sep 6, 2020 at 6:15 AM Daniel Imberman
<[email protected]> wrote:
>
> I think there are absolutely use-cases for both. I’m totally fine with saying 
> “for small/medium use-cases, we come with an in-house system. However for 
> larger cases, you’ll require spark/Flink/S3. That’s totally in line with 
> PLENTY of use-cases. This would be especially cool when matched with 
> fast-follow as we could EVEN potentially tie in data locality.
>
> via Newton Mail 
> [https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2]
> On Sat, Sep 5, 2020 at 5:11 PM, Austin Bennett <[email protected]> 
> wrote:
> I believe - for not large data - the direct runner is wholly doable, which
> seems in line with airflow patterns. I have, and have spoken with several
> others that have, been productive with that runner.
>
> For much larger transfers, the generic operator could accept parameters for
> submitting the compute to an actual runner. Though, imagining that
> (needing a runner) would not be the primary use case for such an operator.
>
>
> On Tue, Sep 1, 2020, 11:52 PM Tomasz Urbaszek <[email protected]> wrote:
>
> > Austin, you are right, Beam covers all (and more) important IOs.
> > However, using Apache Beam to design a generic transfer operator
> > requires Airflow users to have additional resources that will be used
> > as a runner (Spark, Flink, etc.). Unless you suggest using
> > DirectRunner?
> >
> > Can you please tell us more how exactly you think we can use Beam for
> > those Airflow transfer operators?
> >
> > Best,
> > Tomek
> >
> >
> > On Wed, Sep 2, 2020 at 12:37 AM Austin Bennett
> > <[email protected]> wrote:
> > >
> > > Are there IOs that would be desired for a generic transfer operator that
> > > don't exist in: https://beam.apache.org/documentation/io/built-in/ <-
> > > there is pretty solid coverage?
> > >
> > > Beam is getting to the point where even python beam can leverage the java
> > > IOs, which increases the range of IOs (and performance).
> > >
> > >
> > >
> > > On Tue, Sep 1, 2020 at 3:24 PM Jarek Potiuk <[email protected]>
> > > wrote:
> > >
> > > > But I believe those two ideas are separate ones as Tomek explained :)
> > > >
> > > > On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <[email protected]
> > >
> > > > wrote:
> > > >
> > > > > I love the idea of connecting the projects more closely!
> > > > >
> > > > > I've been helping recently as a consultant in improving the Apache
> > Beam
> > > > > build infrastructure (in many parts based on my Airflow experience
> > and
> > > > > Github Actions - even recently they adopted the "cancel" action I
> > > > developed
> > > > > for Apache Airflow). https://github.com/apache/beam/pull/12729
> > > > >
> > > > > Synergies in Apache projects are cool.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez
> > > > > <[email protected]> wrote:
> > > > >
> > > > >> Agree on keeping those separate, just intervened as I believe its a
> > > > great
> > > > >> idea. But lets keep @beam and @spark to a separate thread.
> > > > >>
> > > > >>
> > > > >> Gerard Casas Saez
> > > > >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > > >>
> > > > >>
> > > > >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <
> > [email protected]>
> > > > >> wrote:
> > > > >>
> > > > >> > Daniel is right we have few Apache Beam committers in Polidea so
> > we
> > > > >> > will ask for advice. However, I would be highly in favor of
> > having it
> > > > >> > as Gerard suggested as @beam decorator. This is something we
> > should
> > > > >> > put into another AIP together with the mentioned @spark decorator.
> > > > >> >
> > > > >> > Our proposition of transfer operators was mainly to create
> > something
> > > > >> > Airflow-native that works out of the box and allows us to simplify
> > > > >> > read/write from external sources. Thus, it requires no external
> > > > >> > dependency other than the library to communicate with the API. In
> > the
> > > > >> > case of Beam we need more than that I think.
> > > > >> >
> > > > >> > Additionally, the ideas of Source and Destination play nicely with
> > > > >> > data lineage and may bring more interest to this feature of
> > Airflow.
> > > > >> >
> > > > >> > Cheers,
> > > > >> > Tomek
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <[email protected]>
> > > > wrote:
> > > > >> > >
> > > > >> > > Nice. Just a note here, we will need to make sure that those
> > > > "Source"
> > > > >> and
> > > > >> > > "Destination" needs to be serializable.
> > > > >> > >
> > > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <
> > > > [email protected]
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Interesting! Beam also could potentially allow transfers
> > within
> > > > >> > Dask/any
> > > > >> > > > other system with a java/python SDK? I think @jarek and
> > Polidea
> > > > do a
> > > > >> > lot of
> > > > >> > > > work with Beam as well so I’d love their thoughts if this a
> > good
> > > > >> > use-case.
> > > > >> > > >
> > > > >> > > > via Newton Mail [
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2
> > > > >> > > > ]
> > > > >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez <
> > > > >> > [email protected]>
> > > > >> > > > wrote:
> > > > >> > > > I would be highly in favour of having a generic Beam operator.
> > > > >> Similar
> > > > >> > > > to @spark_task decorator. Something where you can easily
> > define
> > > > and
> > > > >> > wrap a
> > > > >> > > > beam pipeline and convert it to an Airflow operator.
> > > > >> > > >
> > > > >> > > > Gerard Casas Saez
> > > > >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez>
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett <
> > > > >> > > > [email protected]>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>?
> > Esp.
> > > > >> if
> > > > >> > not
> > > > >> > > > > doing transforms, it might rather straightforward to rely
> > on the
> > > > >> > > > ecosystem
> > > > >> > > > > of connectors in that Apache Project to use as the
> > foundations
> > > > >> for a
> > > > >> > > > > generic transfer operator.
> > > > >> > > > >
> > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk <
> > > > >> > [email protected]>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > +1
> > > > >> > > > > >
> > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski <
> > > > >> > > > > > [email protected]>
> > > > >> > > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hello all,
> > > > >> > > > > > > since there have been no new comments shared in the POC
> > doc
> > > > >> > > > > > > <
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit
> > > > >> > > > > > > >
> > > > >> > > > > > > for a couple of days, then I will proceed with creating
> > an
> > > > AIP
> > > > >> > for
> > > > >> > > > this
> > > > >> > > > > > > feature, if that is ok with everybody.
> > > > >> > > > > > > Best regards,
> > > > >> > > > > > > Kamil
> > > > >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek <
> > > > >> > > > [email protected]
> > > > >> > > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > I like the approach as it itnroduces another
> > interesting
> > > > >> > operators'
> > > > >> > > > > > > > interface standarization. It would be awesome to here
> > more
> > > > >> > opinions
> > > > >> > > > > :)
> > > > >> > > > > > > >
> > > > >> > > > > > > > Cheers,
> > > > >> > > > > > > > Tomek
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk <
> > > > >> > > > > [email protected]
> > > > >> > > > > > >
> > > > >> > > > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > I like the idea a lot. Similar things have been
> > > > discussed
> > > > >> > before
> > > > >> > > > > but
> > > > >> > > > > > > the
> > > > >> > > > > > > > > proposal is I think rather pragmatic and solves a
> > real
> > > > >> > problem
> > > > >> > > > (and
> > > > >> > > > > > it
> > > > >> > > > > > > > does
> > > > >> > > > > > > > > not seem to be too complex to implement)
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > There is some discussion about it already in the
> > > > document
> > > > >> > (please
> > > > >> > > > > > > > chime-in
> > > > >> > > > > > > > > for those interested) but here a few points why I
> > like
> > > > it:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - performance and optimization is not a focus for
> > that.
> > > > >> For
> > > > >> > > > generic
> > > > >> > > > > > > stuff
> > > > >> > > > > > > > > it is usually to write "optimal" solution but once
> > you
> > > > >> admit
> > > > >> > you
> > > > >> > > > > are
> > > > >> > > > > > > not
> > > > >> > > > > > > > > going to focus for optimisation, you come with
> > simpler
> > > > and
> > > > >> > easier
> > > > >> > > > > to
> > > > >> > > > > > > use
> > > > >> > > > > > > > > solutions
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - on the other hand - it uses very "Python'y"
> > approach
> > > > >> with
> > > > >> > using
> > > > >> > > > > > > > > Airflow's familiar concepts (connection, transfer)
> > and
> > > > has
> > > > >> > the
> > > > >> > > > > > > potential
> > > > >> > > > > > > > of
> > > > >> > > > > > > > > plugging in into 100s of hooks we have already
> > easily -
> > > > >> > > > leveraging
> > > > >> > > > > > all
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > "providers" richness of Airflow.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you
> > have a
> > > > >> > number
> > > > >> > > > of
> > > > >> > > > > > > > > different sources/targets and as a data scientist
> > you
> > > > >> would
> > > > >> > like
> > > > >> > > > to
> > > > >> > > > > > > > quickly
> > > > >> > > > > > > > > start transferring data between them - you can do it
> > > > >> easily
> > > > >> > with
> > > > >> > > > > > only
> > > > >> > > > > > > > > basic python knowledge and simple DAG structure.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it should be possible to plug it in into our new
> > > > >> functional
> > > > >> > > > > > approach
> > > > >> > > > > > > as
> > > > >> > > > > > > > > well as future lineage discussions as it makes
> > > > connection
> > > > >> > between
> > > > >> > > > > > > sources
> > > > >> > > > > > > > > and targets
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > - it opens up possibilities of adding simple and
> > > > flexible
> > > > >> > data
> > > > >> > > > > > > > > transformation on-transfer. Not a replacement for
> > any of
> > > > >> the
> > > > >> > > > > external
> > > > >> > > > > > > > > services that Airflow should use (Airflow is an
> > > > >> > orchestrator, not
> > > > >> > > > > > data
> > > > >> > > > > > > > > processing solution) but for the kind of quick-start
> > > > >> > scenarios I
> > > > >> > > > > > > foresee
> > > > >> > > > > > > > it
> > > > >> > > > > > > > > might be most useful, being able to apply simple
> > data
> > > > >> > > > > transformation
> > > > >> > > > > > on
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > fly by data scientist might be a big plus.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the
> > "data"
> > > > >> > component
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Kamil - you should have access now.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > J.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski <
> > > > >> > > > > > > > > [email protected]>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hello all,
> > > > >> > > > > > > > > > in Polidea we have come up with an idea for a
> > generic
> > > > >> > transfer
> > > > >> > > > > > > operator
> > > > >> > > > > > > > > > that would be able to transport data between two
> > > > >> > destinations
> > > > >> > > > of
> > > > >> > > > > > > > various
> > > > >> > > > > > > > > > types (file, database, storage, etc.) - please
> > find
> > > > the
> > > > >> > link
> > > > >> > > > > with a
> > > > >> > > > > > > > short
> > > > >> > > > > > > > > > doc with POC
> > > > >> > > > > > > > > > <
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > where we can discuss the design initially. Once we
> > > > come
> > > > >> to
> > > > >> > the
> > > > >> > > > > > > initial
> > > > >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I
> > ask
> > > > for
> > > > >> > > > > permission
> > > > >> > > > > > to
> > > > >> > > > > > > > do
> > > > >> > > > > > > > > so
> > > > >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that
> > during
> > > > the
> > > > >> > > > > discussion
> > > > >> > > > > > we
> > > > >> > > > > > > > > > should definitely aim for this feature to be
> > released
> > > > >> only
> > > > >> > > > after
> > > > >> > > > > > > > Airflow
> > > > >> > > > > > > > > > 2.0 is out.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > What do you think about this idea? Would you find
> > such
> > > > >> an
> > > > >> > > > > operator
> > > > >> > > > > > > > > helpful
> > > > >> > > > > > > > > > in your pipelines? Maybe you already use a similar
> > > > >> > solution or
> > > > >> > > > > know
> > > > >> > > > > > > > > > packages that could be used to implement it?
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Best regards,
> > > > >> > > > > > > > > > --
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Kamil Olszewski
> > > > >> > > > > > > > > > Polidea <https://www.polidea.com> | Software
> > Engineer
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > M: +48 503 361 783
> > > > >> > > > > > > > > > E: [email protected]
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Unique Tech
> > > > >> > > > > > > > > > Check out our projects! <
> > > > >> https://www.polidea.com/our-work>
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Jarek Potiuk
> > > > >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal
> > Software
> > > > >> > Engineer
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> > > > > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > --
> > > > >> > > > > > >
> > > > >> > > > > > > Kamil Olszewski
> > > > >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer
> > > > >> > > > > > >
> > > > >> > > > > > > M: +48 503 361 783
> > > > >> > > > > > > E: [email protected]
> > > > >> > > > > > >
> > > > >> > > > > > > Unique Tech
> > > > >> > > > > > > Check out our projects! <
> > https://www.polidea.com/our-work>
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > --
> > > > >> > > > > >
> > > > >> > > > > > Jarek Potiuk
> > > > >> > > > > > Polidea <https://www.polidea.com/> | Principal Software
> > > > >> Engineer
> > > > >> > > > > >
> > > > >> > > > > > M: +48 660 796 129 <+48660796129>
> > > > >> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > Tomasz Urbaszek
> > > > >> > Polidea | Software Engineer
> > > > >> >
> > > > >> > M: +48 505 628 493
> > > > >> > E: [email protected]
> > > > >> >
> > > > >> > Unique Tech
> > > > >> > Check out our projects!
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> >

Reply via email to