But I believe those two ideas are separate ones as Tomek explained :) On Wed, Sep 2, 2020 at 12:03 AM Jarek Potiuk <[email protected]> wrote:
> I love the idea of connecting the projects more closely! > > I've been helping recently as a consultant in improving the Apache Beam > build infrastructure (in many parts based on my Airflow experience and > Github Actions - even recently they adopted the "cancel" action I developed > for Apache Airflow). https://github.com/apache/beam/pull/12729 > > Synergies in Apache projects are cool. > > J. > > > On Tue, Sep 1, 2020 at 11:16 PM Gerard Casas Saez > <[email protected]> wrote: > >> Agree on keeping those separate, just intervened as I believe its a great >> idea. But lets keep @beam and @spark to a separate thread. >> >> >> Gerard Casas Saez >> Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >> >> >> On Tue, Sep 1, 2020 at 2:14 PM Tomasz Urbaszek <[email protected]> >> wrote: >> >> > Daniel is right we have few Apache Beam committers in Polidea so we >> > will ask for advice. However, I would be highly in favor of having it >> > as Gerard suggested as @beam decorator. This is something we should >> > put into another AIP together with the mentioned @spark decorator. >> > >> > Our proposition of transfer operators was mainly to create something >> > Airflow-native that works out of the box and allows us to simplify >> > read/write from external sources. Thus, it requires no external >> > dependency other than the library to communicate with the API. In the >> > case of Beam we need more than that I think. >> > >> > Additionally, the ideas of Source and Destination play nicely with >> > data lineage and may bring more interest to this feature of Airflow. >> > >> > Cheers, >> > Tomek >> > >> > >> > On Tue, Sep 1, 2020 at 9:31 PM Kaxil Naik <[email protected]> wrote: >> > > >> > > Nice. Just a note here, we will need to make sure that those "Source" >> and >> > > "Destination" needs to be serializable. >> > > >> > > On Tue, Sep 1, 2020, 20:00 Daniel Imberman <[email protected] >> > >> > > wrote: >> > > >> > > > Interesting! Beam also could potentially allow transfers within >> > Dask/any >> > > > other system with a java/python SDK? I think @jarek and Polidea do a >> > lot of >> > > > work with Beam as well so I’d love their thoughts if this a good >> > use-case. >> > > > >> > > > via Newton Mail [ >> > > > >> > >> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.50&pv=10.15.6&source=email_footer_2 >> > > > ] >> > > > On Tue, Sep 1, 2020 at 11:46 AM, Gerard Casas Saez < >> > [email protected]> >> > > > wrote: >> > > > I would be highly in favour of having a generic Beam operator. >> Similar >> > > > to @spark_task decorator. Something where you can easily define and >> > wrap a >> > > > beam pipeline and convert it to an Airflow operator. >> > > > >> > > > Gerard Casas Saez >> > > > Twitter | Cortex | @casassaez <http://twitter.com/casassaez> >> > > > >> > > > >> > > > On Tue, Sep 1, 2020 at 12:44 PM Austin Bennett < >> > > > [email protected]> >> > > > wrote: >> > > > >> > > > > Are you guys familiar with Beam <https://beam.apache.org>? Esp. >> if >> > not >> > > > > doing transforms, it might rather straightforward to rely on the >> > > > ecosystem >> > > > > of connectors in that Apache Project to use as the foundations >> for a >> > > > > generic transfer operator. >> > > > > >> > > > > On Tue, Sep 1, 2020 at 11:05 AM Jarek Potiuk < >> > [email protected]> >> > > > > wrote: >> > > > > >> > > > > > +1 >> > > > > > >> > > > > > On Tue, Sep 1, 2020 at 1:35 PM Kamil Olszewski < >> > > > > > [email protected]> >> > > > > > wrote: >> > > > > > >> > > > > > > Hello all, >> > > > > > > since there have been no new comments shared in the POC doc >> > > > > > > < >> > > > > > > >> > > > > > >> > > > > >> > > > >> > >> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit >> > > > > > > > >> > > > > > > for a couple of days, then I will proceed with creating an AIP >> > for >> > > > this >> > > > > > > feature, if that is ok with everybody. >> > > > > > > Best regards, >> > > > > > > Kamil >> > > > > > > On Thu, Aug 27, 2020 at 10:50 AM Tomasz Urbaszek < >> > > > [email protected] >> > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > I like the approach as it itnroduces another interesting >> > operators' >> > > > > > > > interface standarization. It would be awesome to here more >> > opinions >> > > > > :) >> > > > > > > > >> > > > > > > > Cheers, >> > > > > > > > Tomek >> > > > > > > > >> > > > > > > > On Wed, Aug 19, 2020 at 8:10 PM Jarek Potiuk < >> > > > > [email protected] >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > I like the idea a lot. Similar things have been discussed >> > before >> > > > > but >> > > > > > > the >> > > > > > > > > proposal is I think rather pragmatic and solves a real >> > problem >> > > > (and >> > > > > > it >> > > > > > > > does >> > > > > > > > > not seem to be too complex to implement) >> > > > > > > > > >> > > > > > > > > There is some discussion about it already in the document >> > (please >> > > > > > > > chime-in >> > > > > > > > > for those interested) but here a few points why I like it: >> > > > > > > > > >> > > > > > > > > - performance and optimization is not a focus for that. >> For >> > > > generic >> > > > > > > stuff >> > > > > > > > > it is usually to write "optimal" solution but once you >> admit >> > you >> > > > > are >> > > > > > > not >> > > > > > > > > going to focus for optimisation, you come with simpler and >> > easier >> > > > > to >> > > > > > > use >> > > > > > > > > solutions >> > > > > > > > > >> > > > > > > > > - on the other hand - it uses very "Python'y" approach >> with >> > using >> > > > > > > > > Airflow's familiar concepts (connection, transfer) and has >> > the >> > > > > > > potential >> > > > > > > > of >> > > > > > > > > plugging in into 100s of hooks we have already easily - >> > > > leveraging >> > > > > > all >> > > > > > > > the >> > > > > > > > > "providers" richness of Airflow. >> > > > > > > > > >> > > > > > > > > - it aims to be easy to do "quick start" - if you have a >> > number >> > > > of >> > > > > > > > > different sources/targets and as a data scientist you >> would >> > like >> > > > to >> > > > > > > > quickly >> > > > > > > > > start transferring data between them - you can do it >> easily >> > with >> > > > > > only >> > > > > > > > > basic python knowledge and simple DAG structure. >> > > > > > > > > >> > > > > > > > > - it should be possible to plug it in into our new >> functional >> > > > > > approach >> > > > > > > as >> > > > > > > > > well as future lineage discussions as it makes connection >> > between >> > > > > > > sources >> > > > > > > > > and targets >> > > > > > > > > >> > > > > > > > > - it opens up possibilities of adding simple and flexible >> > data >> > > > > > > > > transformation on-transfer. Not a replacement for any of >> the >> > > > > external >> > > > > > > > > services that Airflow should use (Airflow is an >> > orchestrator, not >> > > > > > data >> > > > > > > > > processing solution) but for the kind of quick-start >> > scenarios I >> > > > > > > foresee >> > > > > > > > it >> > > > > > > > > might be most useful, being able to apply simple data >> > > > > transformation >> > > > > > on >> > > > > > > > the >> > > > > > > > > fly by data scientist might be a big plus. >> > > > > > > > > >> > > > > > > > > Suggestion: Panda DataFrame as the format of the "data" >> > component >> > > > > > > > > >> > > > > > > > > Kamil - you should have access now. >> > > > > > > > > >> > > > > > > > > J. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Tue, Aug 18, 2020 at 6:53 PM Kamil Olszewski < >> > > > > > > > > [email protected]> >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hello all, >> > > > > > > > > > in Polidea we have come up with an idea for a generic >> > transfer >> > > > > > > operator >> > > > > > > > > > that would be able to transport data between two >> > destinations >> > > > of >> > > > > > > > various >> > > > > > > > > > types (file, database, storage, etc.) - please find the >> > link >> > > > > with a >> > > > > > > > short >> > > > > > > > > > doc with POC >> > > > > > > > > > < >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > >> https://docs.google.com/document/d/1o7Ph7RRNqLWkTbe7xkWjb100eFaK1Apjv27LaqHgNkE/edit?usp=sharing >> > > > > > > > > > > >> > > > > > > > > > where we can discuss the design initially. Once we come >> to >> > the >> > > > > > > initial >> > > > > > > > > > conclusion I can create an AIP on cWiki - can I ask for >> > > > > permission >> > > > > > to >> > > > > > > > do >> > > > > > > > > so >> > > > > > > > > > (my id is 'kamil.olszewski')? I believe that during the >> > > > > discussion >> > > > > > we >> > > > > > > > > > should definitely aim for this feature to be released >> only >> > > > after >> > > > > > > > Airflow >> > > > > > > > > > 2.0 is out. >> > > > > > > > > > >> > > > > > > > > > What do you think about this idea? Would you find such >> an >> > > > > operator >> > > > > > > > > helpful >> > > > > > > > > > in your pipelines? Maybe you already use a similar >> > solution or >> > > > > know >> > > > > > > > > > packages that could be used to implement it? >> > > > > > > > > > >> > > > > > > > > > Best regards, >> > > > > > > > > > -- >> > > > > > > > > > >> > > > > > > > > > Kamil Olszewski >> > > > > > > > > > Polidea <https://www.polidea.com> | Software Engineer >> > > > > > > > > > >> > > > > > > > > > M: +48 503 361 783 >> > > > > > > > > > E: [email protected] >> > > > > > > > > > >> > > > > > > > > > Unique Tech >> > > > > > > > > > Check out our projects! < >> https://www.polidea.com/our-work> >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > >> > > > > > > > > Jarek Potiuk >> > > > > > > > > Polidea <https://www.polidea.com/> | Principal Software >> > Engineer >> > > > > > > > > >> > > > > > > > > M: +48 660 796 129 <+48660796129> >> > > > > > > > > [image: Polidea] <https://www.polidea.com/> >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > -- >> > > > > > > >> > > > > > > Kamil Olszewski >> > > > > > > Polidea <https://www.polidea.com> | Software Engineer >> > > > > > > >> > > > > > > M: +48 503 361 783 >> > > > > > > E: [email protected] >> > > > > > > >> > > > > > > Unique Tech >> > > > > > > Check out our projects! <https://www.polidea.com/our-work> >> > > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > >> > > > > > Jarek Potiuk >> > > > > > Polidea <https://www.polidea.com/> | Principal Software >> Engineer >> > > > > > >> > > > > > M: +48 660 796 129 <+48660796129> >> > > > > > [image: Polidea] <https://www.polidea.com/> >> > > > > > >> > > > > >> > >> > >> > >> > -- >> > >> > Tomasz Urbaszek >> > Polidea | Software Engineer >> > >> > M: +48 505 628 493 >> > E: [email protected] >> > >> > Unique Tech >> > Check out our projects! >> > >> > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
