I think it would be great to hear from others what they think - but here
are my thoughts:

1) CWL-Airflow package is more of a converter. We created a special class
> CWLDAG (inherited from DAG), which allows us to set the path to the CWL
> file (a separate YAML file). Once parsed by Airflow the newly created DAG
> will have the structure similar to the workflow described in CWL file. The
> actual dag file that you keep in the dags folder is still a python file,
> but it would be nice if we could parse CWL files directly from the dags
> folder without any intermediate python files.
>

I think leaving it as converted to .py files is far better idea from
Airflow POV. Using Python as "source of truth" is the core of Airflow and I
think if we would like something that you describe would require two
things: 1) the conversion will have to be perfect and produce always
perfect 1-1 equivalent DAG (more on that in oozie-2-airflow below and why
O2A intentionally is not perfect converter). 2) It would require some kind
of abstract representation of the workflow.
We are working on some abstract representation of DAG in the DAG
serialization effort, but it is still very much python-driven (it's a
product of running python code) and I think we would not want to be tied to
CWL definition so keeping python code as common source of truth seems
better for us.


> 2) We are trying to keep our package as simple as possible, so it
> shouldn’t cause any dependency issues in the future. We more depend on the
> changes in the CWL standard and its reference implementation (cwltool)
> rather than changes in Airflow. We are actively using XCom to transfer some
> metadata between the tasks, but, as far as I understand, it’s one of the
> “core ideas” of Airflow, so it shouldn’t be dramatically changed in the
> future version.
>

That's why also keeping python code as common "interface" seems like a good
idea. That's the most "stable" part I think of Airflow definition of
workflows because the goal is to keep the minimum number of changes for the
users already having plenty of DAGs. So even if we introduce some
incompatibilities in 2.0, maybe we will even provide some tools to
automatically convert 1.10 dags into 2.0 dags (something I thought about
already but haven't yet discussed in community). The internal/more abstract
representations of DAGs are prone to change (and they will change for sure
in 2.0).


> 3) I would like to see CWL-Airflow be actively used by other people.
> Currently, CWL community is more centered around scientific pipelines,
> whereas Airflow is closer to commercial or business-related workflows. I
> believe that CWL as a standard can be used in both of these areas.
> Additionally, CWL standard is more suitable for data-driven workflow
> management model, which, as far as I understand, Airflow lacks at the
> moment. As for the release cycle of CWL and Airflow, I'm not ready to
> answer it now.


I still do not see what would be the benefits of maintaining CWL converter
by Airflow team :). I think we will always think in "Python" way. That's
one of the main benefits and I think the advantages of Airflow - especially
in a world of rapid prototyping and fast iterating on workflows. People
writing the DAGs (data scientists) do not have to learn
yet-another-standard but they use familiar language - Python - and
ready-to-use building blocks - Operators to define their workflows. I think
the whole community of Airflow is centered around that concept. I do not
think Airflow community would like to swap that paradigm with something
else (I actually love that paradigm). But I might be wrong - of course - so
I will let others chime-in.


> 4) Agree, it makes sense, but then we will discuss how to become a part of
> that ecosystem of packages:)


>From what I see now I think what we can do - rather than donating the code
to Airflow- we are currently working on a new "landing page" of Airflow
(Current prototype here: https://airflow-website-s5z26d2t7a-ew.a.run.app/)
so maybe we can simply add a page there to add "integration with other
workflow engines" and have Oozie2Airflow, CWLToAirflow and any future "To
Airflow" converters. I can talk to the team that works on it and I think
it's generally a good idea to have such page.

5) I think every spec has it's pros and cons and it will take ages to
> include all of them. Moreover, it will become a nightmare for those who
> will try to use that system. That's why we started with CWL as the most
> promising one. Of course, it's only from our point of view:) Our main
> intention was to choose that standard that will be the easiest for not
> programmers (in our case biologists) to create new pipelines by themselves.


Yeah. That's I think is the biggest concept difference. Our main group of
users are Data Scientists who actually use Python as their primary way to
talk to computers. I think this is a huge paradigm difference and where
combining CWL and Airflow paradigm is next-to-impossible because we simply
have different foundational assumptions about where the DAGs "source of
truth" is. I think using Python as main workflow language is just great for
those people and I don't think it will go away.


> 6) Unfortunately, I cannot answer this question now. I need to get more
> information about oozie-to-airflow approach.
>

Speaking of oozie-2-airflow - as one of the creators of it. Here is the
repo: https://github.com/GoogleCloudPlatform/oozie-to-airflow

And some basic high-level concepts for O2A:

   - The world becomes a better place - one XML less at a time. This was
   our motto. From what we know - Ooozie XML workflows are pretty much
   universally hated (well maybe just disliked) in the data processing world.
   But there are people who have 1000s of oozie workflows, so we wanted to
   give people a tool to easily converts many XMLs to Airflow DAGs.
   - It's one-way conversion. The aim is to get Oozie XML workflows as
   source, convert it to DAGs and continue working on the generated DAGs.
   - We accept are not perfect. We know we can convert many Ooozie
   Worfkflows and they will work but we have a number of limitations (
   
https://github.com/GoogleCloudPlatform/oozie-to-airflow#common-known-limitations).
   We have also a list of issues:
   https://github.com/GoogleCloudPlatform/oozie-to-airflow/issues that are
   still open and need solving for some more sophisticated features of Ooozie.
   But this is OK. We accept that we are not perfect - the DAGs generated by
   O2A can be further manually modified and evolved as Python DAGs to add
   missing features.
   - The Python code generated by O2A are generated with the focus of
   readability. Those python files are well written, formatted and structured
   in a way that resembles as if a human programmer wrote it so that it is
   easy to take them forward and modify them  manually. For example we can
   then add all the Airflow-specific features Maxime mentioned - queues, xcom,
   callbacks, etc. etc. ). So we do not have to be perfect.

 J.


> Best regards,
> Michael
>
> On 2019/11/11 09:54:21, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> > Hello Andrey,
> >
> > I think both myself and Maxime - we asked some important questions. If
> you
> > want to proceed with the donation, I think it would be great if you let
> us
> > know what do you think about the issues we mentioned. I know also Michael
> > whom I met at the workshops in Berlin - was very interested in this - so
> > maybe you can take part in the discussion.  If you are willing to donate
> > the code and continue the discussion on it , I think we have to start
> well
> > ... discussing :).
> >
> > I just copied our point below to make it easier to answer both of us at
> the
> > same time:
> >
> > Jarek:
> >
> > 1) is the CWL package more of a converter of CWL to Python DAG files
> (that
> > can then be scheduled as usual) or whether it is running alongside of the
> > scheduler and schedules tasks and operators separately using different
> > scheduling engine?. As a reference there is an
> > https://github.com/GoogleCloudPlatform/oozie-to-airflow converter from
> > Oozie XML to airflow DAGs. I think the biggest advantage of Airflow is
> > being able to modify and iterate quickly using python code so having
> > aPython Dag generated from CWL might be a good idea - even if it is not
> > perfect, user can still modify it and extend later manually rather than
> > relaying that all the features of CWL are implemented.
> >
> > 2 I'd also like to understand what dependencies it introduces on Airflow
> -
> > whether it relies on certain internals of Airflow that could make
> Airflow's
> > evolution more difficult? Also we have a roadmap for Airflow 2.0 already
> > and there are certain incompatibilities implemented, more is planned
> > already (and more to come not planned yet). Is the CWL importer 1.10
> > compatible or both 1.10 and (current state of)  2.0? Have you been
> > following some of the discussions with 2.0 and are you aware of some
> > potential incompatibilities?
> >
> > 3) What are the benefits you see to have Airflow CWL package  managed by
> > the Airflow community rather than CWL one? It could work both ways - it
> > could be managed by either of the communities (as usual in case of such
> > imports), but I think it has to be carefully weighted who maintains it
> > eventually - it all depends on how much one could rely on other, what is
> > the release cycle of CWL new versions  vs. Airflow versions etc. Could
> you
> > share your thought process and why you think it should be part of
> Airflow ?
> >
> > Maxime:
> >
> > 4) Personally I like the idea of an ecosystem of packages (and repos)
> > managed
> > and maintained by their specialist. That way they can have their own CI,
> > their own release processes and cycles, and "namespaced" notifications.
> If
> > anything I'd rather push in the direction of breaking Airflow into many
> > smaller packages (core, scheduler, web, ...) as opposed to tacking other
> > projects on top of it.
> >
> > 5) Also arguably Airflow's DSL may be more "common" than CWL. Clearly CWL
> > has
> > more focussed intentions around creating something universal, but to me
> > that doesn't necessarily make it more legitimate or common than other
> specs
> > (Oozie, Azkaban , Informatica, ...) and should be treated similarly
> (would
> > we want to include extensions to all these as part of Airflow?).
> >
> > 6) I also prefer the codegen/migration approach (I think the
> > `oozie-to-airflow` tool does that) to allow a path that resolves the
> common
> > denominator lmitations. How can this tooling expose features that are
> > proper to Airflow (pools, priority weights, xcoms, callbacks!, ...)?
> >
> > J.
> >
> > On Thu, Oct 31, 2019 at 1:57 AM Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> > > As someone who has spent a lot of time acting as a maintainer, a code
> > > "donation" seems like dangerous gift to accept.
> > >
> > > Personally I like the idea of an ecosystem of packages (and repos)
> managed
> > > and maintained by their specialist. That way they can have their own
> CI,
> > > their own release processes and cycles, and "namespaced"
> notifications. If
> > > anything I'd rather push in the direction of breaking Airflow into many
> > > smaller packages (core, scheduler, web, ...) as opposed to tacking
> other
> > > projects on top of it.
> > >
> > > Also arguably Airflow's DSL may be more "common" than CWL. Clearly CWL
> has
> > > more focussed intentions around creating something universal, but to me
> > > that doesn't necessarily make it more legitimate or common than other
> specs
> > > (Oozie, Azkaban , Informatica, ...) and should be treated similarly
> (would
> > > we want to include extensions to all these as part of Airflow?).
> > >
> > > I also prefer the codegen/migration approach (I think the
> > > `oozie-to-airflow` tool does that) to allow a path that resolves the
> common
> > > denominator lmitations. How can this tooling expose features that are
> > > proper to Airflow (pools, priority weights, xcoms, callbacks!, ...)?
> > >
> > > Max
> > >
> > > On Wed, Oct 30, 2019 at 12:32 PM Andrey Kartashov <por...@porter.st>
> > > wrote:
> > >
> > > > My name is Andrey and I'm developer behind CWL-Airflow.
> > > > This message is follow up slack conversation. I copy past some
> messages
> > > > from there here.
> > > >
> > > >
> > > > >> Slack chat:
> > > >
> > > > When I've met CWL team there were no pipeline managers to support it.
> > > I've
> > > > picked up Airflow to just prove the concept that it is possible.
> > > >
> > > > The same time I was looking for a pipeline manager to use  for
> > > > bioinformatic analysis and asked tons of questions from Airflow team
> as a
> > > > result special note in documentation: "Beyond the Horizon".
> > > Nevertheless, I
> > > > adopted Airflow for our bioinformatic use
> > > >
> > > > There are more than 200 different pipeline managers, and to believe
> that
> > > > in nearest future there will the only one and perfect one sounds
> > > > impossible. So, to exchange pipeline logic between different pipeline
> > > > managers and people it is good to have a standard (CWL is a a perfect
> > > fit)
> > > > like JavaScript standard and different executers, browsers...
> > > >
> > > > Apache taverna (pipeline manager) is working on adopting CWL for a
> while
> > > > now, we have  code it is already working.
> > > >
> > > > So yes, CWL-Airflow is developed and the use is simple it extends
> Airflow
> > > > DAG class. However it is still required to put .py file with DAG
> (CWLDAG
> > > in
> > > > our case) to the dag directory. I would like just to put .cwl file
> into
> > > DAG
> > > > directory to simplify the usage
> > > >
> > > > I'm ready to develop what is necessary, but I'm not quite sure (I'm
> not a
> > > > big expert in airflow code) which way to go, plugin or some native
> core
> > > > code, or ...
> > > >
> > > > The project by itself lives
> https://github.com/Barski-lab/cwl-airflow,
> > > > there are tons of CWL tests
> > > > https://ci.commonwl.org/job/airflow-conformance/
> > > >
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to