On Sat, Nov 16, 2019 at 2:25 PM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> >
> > 1) The more dags I have in a dags folder, the longer time it takes to
> > parse them all. Taking into account that in my case I have also to parse
> > CWL files, it takes even more time for such a simple operation. So I was
> > wondering is there any common solution to approach this issue. Also, I
> was
> > thinking if I can use your Plugins mechanism to integrate some additional
> > functionality such as parsing CWL files directly without making any
> changes
> > in the core of Airflow.
> >
>
> As a follow-up from the "political" decision, I would say, the best
> solution will be to treat CWL-airflow as a separate "converter" really
> rather than closely integrate it with AIrflow. I would imagine that you
> have a separate folder with CWL files and you have a daemon watching that
> folder and starting the conversion process whenever any of the CWL files
> change and creating python DAG files in Airflow's dag folder. That seems
> like very loosely coupled and relying on the basic behaviour of Airflow.
> Also then it can be  easily combined with Git-sync solution for Kubernetes
> or another way of synchronising DAGs.
>
> Decoupling makes sense. At Sunbird <https://www.sunbird.org/>, we have
developed a machine learning workbench, daggit
<https://github.com/project-sunbird/sunbird-ml-workbench>, which relies on
airflow underneath. Approach is very similar, we drop all dag-specs in a
folder, load those graphs in-mem via the converter. The spec is similar to
cwl, but we like to develop a DSL for composing DAGs (not just s spec for
read/write DAGs but modify, reference subgraphs, edges, nodes). A concept
note on the need for a DSL is here
<https://docs.google.com/document/d/1-l0EZveZJAxcRNdTOvNzVArq6V37Aj24H0CGd76Nh80/edit?usp=sharing>.
We closely watch airflow, and like to leverage all the benefits of airflow
(such as a k8s integration, for eg) . airflow is all but *one* of the
frameworks required to fulfil the promises of a machine learning workbench.
In that sense, clean, well-separated frameworks, coming together makes
whole lot of sense, then one guy doing al -- my observations!


>
> > 2) I'm working on running CWL pipelines in Kubernetes through Airflow and
> > one of the problems that I have to deal with is sharing directories
> between
> > the PODs. It looks like Kubernetes doesn't provide the direct solution to
> > this problem and mostly relies on the platform where it is installed. I
> > will appreciate if you direct me to the proper discussions/threads where
> > people solve similar problems.
> >
>
> There are two ways of sharing DAGs - persistent volume claims and git sync
> currently. Generally the approach is that you need 3rd-party distributed
> storage to share the dags and the synchronisation mechanism is not (yet)
> built-in Airflow. There is the AIP-5 Remote DAG Fetcher (
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
> )
> where it has been discussed at length and there is the accompanying
> discussion thread (shorter than discussion at the doc)
>
> https://lists.apache.org/thread.html/224d1e7d1b11e0b8314075f21b1b81708749f2899f4cce5af295e8a8@%3Cdev.airflow.apache.org%3E
> but
> I don't think anyone in the community is working actively on AIP-5
> currently. I think the consensus in the community is that Airflow is
> solving scheduling but it does not solve distribution - so it delegates
> distributing files to dedicated solutions (and you can choose whichever
> solution you already have to do the task). This is really targeted for
> "corporate" deployments where usually corporates have already some
> distributed storage in place. Rather than force a single "distribution"
> solution for them, assumption is that Airflow will use whatever solution is
> deployed at that company. Also as next step we have plans to get rid of it
> completely in Airflow 2.0. Providing that we will implement full DAG
> Serialisation - this problem will be gone. All the DAG data will be stored
> in the database and hopefully no more volume sharing will be needed.
>
> Here you can find simple description of using PVC's with Airflow on
> Kubernetes:
>
> https://medium.com/@ramandumcs/how-to-run-apache-airflow-on-kubernetes-1cb809a8c7ea
> .
> Git Sync is also nice - but requires a shared Git repo where DAGs are
> shared.
>
> There are other solutions - Composer team for example uses 'gcsfuse' - a
> user-space synchronisation from a GCS bucket to local pod volume (they have
> two containers in a pod - gcsfuse as side-car to airflow worker, scheduler,
> UI sharing single volume). Then it is a matter of putting the generated
> Dags to a GCS bucket (your daemon could do just that). And you can use
> similar solutions for other dedicated "artifact" sharing. For example we've
> implemented similar side-car pod for Nexus - where production DAG files
> were shared as Nexus artifacts.
>
>
> > Thanks a lot,
> > Michael
> >
> >
> >
> >
> >
> >
> >
> > On 2019/11/15 10:17:30, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> > > I am also -1. But I am happy to help with surfacing the CWL integration
> > on
> > > - both the new package (together with Oozie-2-airflow and maybe other
> > > converters) and having it easily installable as external Package. I
> will
> > > talk to Andrey separately about this so that we do not clutter the
> > devlist.
> > >
> > > J.
> > >
> > > On Fri, Nov 15, 2019 at 7:37 AM Maxime Beauchemin <
> > > maximebeauche...@gmail.com> wrote:
> > >
> > > > After all the exploration of this topic here in this thread, I'm a
> > pretty
> > > > hard -1 on this one.
> > > >
> > > > I think CWL and CWL-Airflow are great projects, but they can't rely
> on
> > the
> > > > Airflow community to evolve/maintain/package this integration.
> > > >
> > > > Personally I think that generally and *within reason* (winking at the
> > npm
> > > > communities ;) that smaller, targeted and loosely coupled packages
> [and
> > > > their corresponding smaller repositories with their own set of
> > maintainers]
> > > > is better than bigger monoliths. Some reasons:
> > > > * separation of concerns
> > > > * faster, more targeted builds and test suites
> > > > * independent release cycles
> > > > * clearer ownership
> > > > * independent and adapted level of rigor / styling / standards
> > > > * more targeted notifications for people watching repos
> > > > * ...
> > > >
> > > > Max
> > > >
> > > > On Thu, Nov 14, 2019 at 12:33 PM Andrey Kartashov <por...@porter.st>
> > > > wrote:
> > > >
> > > > >
> > > > >
> > > > >  I looked at the
> > > > > >
> > > > >
> > > >
> >
> https://cwl-airflow.readthedocs.io/en/1.0.18/readme/how_it_works.html#what-s-inside
> > > > > > to
> > > > > > understand what CWL is and that's where I took the descriptor +
> > job (in
> > > > > Key
> > > > > > Concepts).
> > > > > >
> > > > >
> > > > > Oh this is an old one, but even new one probably does not reflect
> the
> > > > real
> > > > > picture.
> > > > >
> > > > >
> > > > > OK. So as I understand finally the problem you want to solve - "To
> > make
> > > > > > Airflow more accessible to people who already use CWL or who will
> > find
> > > > it
> > > > > > easier to write dags in CWL". I still think this does not
> > necessarily
> > > > > have
> > > > > > to be solved by donating CWL code to Airflow (see below).
> > > > > >
> > > > >
> > > > > I think there are many ways.
> > > > >
> > > > >
> > > > > > Ok. So what you basically say is that you think Airflow community
> > has
> > > > > more
> > > > > > capacity than CWL community to maintain CWL converter.
> > > > >
> > > > > My understanding CWL community just developing common standard
> (CWL)
> > not
> > > > > converters or converter :). For me the CWL-Airflow developer
> > definitely
> > > > > Airflow community has far more capacity that me alone :)
> > > > >
> > > > > > I am not so sure
> > > > > > about it (precisely because of the lost opportunities). But
> maybe a
> > > > > better
> > > > > > solution is to ask in the airflow community whether there are
> > people
> > > > who
> > > > > > could join the CWL-airflow converter and increase the community
> > there.
> > > > > >
> > > > >
> > > > > That probably a good start just to check and see the interest
> > > > >
> > > > > > I would not say for the whole community, but I would not feel
> > > > comfortable
> > > > > > as a community to take responsibility on the converter without
> > prior
> > > > > > knowledge and understanding CWL in detail. Especially that it is
> > rather
> > > > > for
> > > > > > small group of users (at least initially). But I find CWL as an
> > idea
> > > > very
> > > > > > interesting and maybe there are some people in the community who
> > would
> > > > > love
> > > > > > to contribute to your project?  Suggestion - maybe just ask -
> here
> > and
> > > > in
> > > > > > slack - if there is enough interest in contributing to
> CWL-Airflow,
> > > > > rather
> > > > > > than donating the code to Airflow ? Just promote your project in
> > the
> > > > > > community and ask for help.
> > > > >
> > > > > I tried but have not got any feedback :) but I’m not a promoter or
> > seller
> > > > >
> > > > >
> > > > > >
> > > > > > I can see this as the best of both worlds - if you find a few
> > people
> > > > who
> > > > > > would like to help and get familiar with it and they are also
> part
> > of
> > > > the
> > > > > > Airflow community and we get collective knowledge about it - then
> > > > > > eventually it might lead to incorporating it to Airflow itself if
> > our
> > > > > > community gets more familiar with CWL. I think this is the best
> > way to
> > > > > > achieve the final goal of finally incorporating CWL as part of
> > Airflow.
> > > > > >
> > > > >
> > > > > Works for me
> > > > >
> > > > >
> > > > > > In the meantime - I am happy to help to make Airflow more "CWL
> > > > friendly"
> > > > > > for the users - both from documentation and Helm chart POV.
> > > > > >
> > > > >
> > > > > Thank you, I appreciate that, how we proceed?
> > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Reply via email to