Hi everyone,

Sorry for the late response:) I'm Michael. I wrote the post that Michael R. 
Crusoe mentioned at the very beginning of this conversation. Both Andrey and I 
work on CWL-Airflow. I will try to answer some of your questions.

1) CWL-Airflow package is more of a converter. We created a special class 
CWLDAG (inherited from DAG), which allows us to set the path to the CWL file (a 
separate YAML file). Once parsed by Airflow the newly created DAG will have the 
structure similar to the workflow described in CWL file. The actual dag file 
that you keep in the dags folder is still a python file, but it would be nice 
if we could parse CWL files directly from the dags folder without any 
intermediate python files.

2) We are trying to keep our package as simple as possible, so it shouldn’t 
cause any dependency issues in the future. We more depend on the changes in the 
CWL standard and its reference implementation (cwltool) rather than changes in 
Airflow. We are actively using XCom to transfer some metadata between the 
tasks, but, as far as I understand, it’s one of the “core ideas” of Airflow, so 
it shouldn’t be dramatically changed in the future version.

3) I would like to see CWL-Airflow be actively used by other people. Currently, 
CWL community is more centered around scientific pipelines, whereas Airflow is 
closer to commercial or business-related workflows. I believe that CWL as a 
standard can be used in both of these areas. Additionally, CWL standard is more 
suitable for data-driven workflow management model, which, as far as I 
understand, Airflow lacks at the moment. As for the release cycle of CWL and 
Airflow, I'm not ready to answer it now.

4) Agree, it makes sense, but then we will discuss how to become a part of that 
ecosystem of packages:)

5) I think every spec has it's pros and cons and it will take ages to include 
all of them. Moreover, it will become a nightmare for those who will try to use 
that system. That's why we started with CWL as the most promising one. Of 
course, it's only from our point of view:) Our main intention was to choose 
that standard that will be the easiest for not programmers (in our case 
biologists) to create new pipelines by themselves.

6) Unfortunately, I cannot answer this question now. I need to get more 
information about oozie-to-airflow approach.

Best regards,
Michael

On 2019/11/11 09:54:21, Jarek Potiuk <jarek.pot...@polidea.com> wrote: 
> Hello Andrey,
> 
> I think both myself and Maxime - we asked some important questions. If you
> want to proceed with the donation, I think it would be great if you let us
> know what do you think about the issues we mentioned. I know also Michael
> whom I met at the workshops in Berlin - was very interested in this - so
> maybe you can take part in the discussion.  If you are willing to donate
> the code and continue the discussion on it , I think we have to start well
> ... discussing :).
> 
> I just copied our point below to make it easier to answer both of us at the
> same time:
> 
> Jarek:
> 
> 1) is the CWL package more of a converter of CWL to Python DAG files (that
> can then be scheduled as usual) or whether it is running alongside of the
> scheduler and schedules tasks and operators separately using different
> scheduling engine?. As a reference there is an
> https://github.com/GoogleCloudPlatform/oozie-to-airflow converter from
> Oozie XML to airflow DAGs. I think the biggest advantage of Airflow is
> being able to modify and iterate quickly using python code so having
> aPython Dag generated from CWL might be a good idea - even if it is not
> perfect, user can still modify it and extend later manually rather than
> relaying that all the features of CWL are implemented.
> 
> 2 I'd also like to understand what dependencies it introduces on Airflow -
> whether it relies on certain internals of Airflow that could make Airflow's
> evolution more difficult? Also we have a roadmap for Airflow 2.0 already
> and there are certain incompatibilities implemented, more is planned
> already (and more to come not planned yet). Is the CWL importer 1.10
> compatible or both 1.10 and (current state of)  2.0? Have you been
> following some of the discussions with 2.0 and are you aware of some
> potential incompatibilities?
> 
> 3) What are the benefits you see to have Airflow CWL package  managed by
> the Airflow community rather than CWL one? It could work both ways - it
> could be managed by either of the communities (as usual in case of such
> imports), but I think it has to be carefully weighted who maintains it
> eventually - it all depends on how much one could rely on other, what is
> the release cycle of CWL new versions  vs. Airflow versions etc. Could you
> share your thought process and why you think it should be part of Airflow ?
> 
> Maxime:
> 
> 4) Personally I like the idea of an ecosystem of packages (and repos)
> managed
> and maintained by their specialist. That way they can have their own CI,
> their own release processes and cycles, and "namespaced" notifications. If
> anything I'd rather push in the direction of breaking Airflow into many
> smaller packages (core, scheduler, web, ...) as opposed to tacking other
> projects on top of it.
> 
> 5) Also arguably Airflow's DSL may be more "common" than CWL. Clearly CWL
> has
> more focussed intentions around creating something universal, but to me
> that doesn't necessarily make it more legitimate or common than other specs
> (Oozie, Azkaban , Informatica, ...) and should be treated similarly (would
> we want to include extensions to all these as part of Airflow?).
> 
> 6) I also prefer the codegen/migration approach (I think the
> `oozie-to-airflow` tool does that) to allow a path that resolves the common
> denominator lmitations. How can this tooling expose features that are
> proper to Airflow (pools, priority weights, xcoms, callbacks!, ...)?
> 
> J.
> 
> On Thu, Oct 31, 2019 at 1:57 AM Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> 
> > As someone who has spent a lot of time acting as a maintainer, a code
> > "donation" seems like dangerous gift to accept.
> >
> > Personally I like the idea of an ecosystem of packages (and repos) managed
> > and maintained by their specialist. That way they can have their own CI,
> > their own release processes and cycles, and "namespaced" notifications. If
> > anything I'd rather push in the direction of breaking Airflow into many
> > smaller packages (core, scheduler, web, ...) as opposed to tacking other
> > projects on top of it.
> >
> > Also arguably Airflow's DSL may be more "common" than CWL. Clearly CWL has
> > more focussed intentions around creating something universal, but to me
> > that doesn't necessarily make it more legitimate or common than other specs
> > (Oozie, Azkaban , Informatica, ...) and should be treated similarly (would
> > we want to include extensions to all these as part of Airflow?).
> >
> > I also prefer the codegen/migration approach (I think the
> > `oozie-to-airflow` tool does that) to allow a path that resolves the common
> > denominator lmitations. How can this tooling expose features that are
> > proper to Airflow (pools, priority weights, xcoms, callbacks!, ...)?
> >
> > Max
> >
> > On Wed, Oct 30, 2019 at 12:32 PM Andrey Kartashov <por...@porter.st>
> > wrote:
> >
> > > My name is Andrey and I'm developer behind CWL-Airflow.
> > > This message is follow up slack conversation. I copy past some messages
> > > from there here.
> > >
> > >
> > > >> Slack chat:
> > >
> > > When I've met CWL team there were no pipeline managers to support it.
> > I've
> > > picked up Airflow to just prove the concept that it is possible.
> > >
> > > The same time I was looking for a pipeline manager to use  for
> > > bioinformatic analysis and asked tons of questions from Airflow team as a
> > > result special note in documentation: "Beyond the Horizon".
> > Nevertheless, I
> > > adopted Airflow for our bioinformatic use
> > >
> > > There are more than 200 different pipeline managers, and to believe that
> > > in nearest future there will the only one and perfect one sounds
> > > impossible. So, to exchange pipeline logic between different pipeline
> > > managers and people it is good to have a standard (CWL is a a perfect
> > fit)
> > > like JavaScript standard and different executers, browsers...
> > >
> > > Apache taverna (pipeline manager) is working on adopting CWL for a while
> > > now, we have  code it is already working.
> > >
> > > So yes, CWL-Airflow is developed and the use is simple it extends Airflow
> > > DAG class. However it is still required to put .py file with DAG (CWLDAG
> > in
> > > our case) to the dag directory. I would like just to put .cwl file into
> > DAG
> > > directory to simplify the usage
> > >
> > > I'm ready to develop what is necessary, but I'm not quite sure (I'm not a
> > > big expert in airflow code) which way to go, plugin or some native core
> > > code, or ...
> > >
> > > The project by itself lives https://github.com/Barski-lab/cwl-airflow,
> > > there are tons of CWL tests
> > > https://ci.commonwl.org/job/airflow-conformance/
> > >
> >
> 
> 
> -- 
> 
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
> 
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
> 

Reply via email to