On Sat, Nov 16, 2019 at 2:25 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> > > > 1) The more dags I have in a dags folder, the longer time it takes to > > parse them all. Taking into account that in my case I have also to parse > > CWL files, it takes even more time for such a simple operation. So I was > > wondering is there any common solution to approach this issue. Also, I > was > > thinking if I can use your Plugins mechanism to integrate some additional > > functionality such as parsing CWL files directly without making any > changes > > in the core of Airflow. > > > > As a follow-up from the "political" decision, I would say, the best > solution will be to treat CWL-airflow as a separate "converter" really > rather than closely integrate it with AIrflow. I would imagine that you > have a separate folder with CWL files and you have a daemon watching that > folder and starting the conversion process whenever any of the CWL files > change and creating python DAG files in Airflow's dag folder. That seems > like very loosely coupled and relying on the basic behaviour of Airflow. > Also then it can be easily combined with Git-sync solution for Kubernetes > or another way of synchronising DAGs. > > Decoupling makes sense. At Sunbird <https://www.sunbird.org/>, we have developed a machine learning workbench, daggit <https://github.com/project-sunbird/sunbird-ml-workbench>, which relies on airflow underneath. Approach is very similar, we drop all dag-specs in a folder, load those graphs in-mem via the converter. The spec is similar to cwl, but we like to develop a DSL for composing DAGs (not just s spec for read/write DAGs but modify, reference subgraphs, edges, nodes). A concept note on the need for a DSL is here <https://docs.google.com/document/d/1-l0EZveZJAxcRNdTOvNzVArq6V37Aj24H0CGd76Nh80/edit?usp=sharing>. We closely watch airflow, and like to leverage all the benefits of airflow (such as a k8s integration, for eg) . airflow is all but *one* of the frameworks required to fulfil the promises of a machine learning workbench. In that sense, clean, well-separated frameworks, coming together makes whole lot of sense, then one guy doing al -- my observations! > > > 2) I'm working on running CWL pipelines in Kubernetes through Airflow and > > one of the problems that I have to deal with is sharing directories > between > > the PODs. It looks like Kubernetes doesn't provide the direct solution to > > this problem and mostly relies on the platform where it is installed. I > > will appreciate if you direct me to the proper discussions/threads where > > people solve similar problems. > > > > There are two ways of sharing DAGs - persistent volume claims and git sync > currently. Generally the approach is that you need 3rd-party distributed > storage to share the dags and the synchronisation mechanism is not (yet) > built-in Airflow. There is the AIP-5 Remote DAG Fetcher ( > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher > ) > where it has been discussed at length and there is the accompanying > discussion thread (shorter than discussion at the doc) > > https://lists.apache.org/thread.html/224d1e7d1b11e0b8314075f21b1b81708749f2899f4cce5af295e8a8@%3Cdev.airflow.apache.org%3E > but > I don't think anyone in the community is working actively on AIP-5 > currently. I think the consensus in the community is that Airflow is > solving scheduling but it does not solve distribution - so it delegates > distributing files to dedicated solutions (and you can choose whichever > solution you already have to do the task). This is really targeted for > "corporate" deployments where usually corporates have already some > distributed storage in place. Rather than force a single "distribution" > solution for them, assumption is that Airflow will use whatever solution is > deployed at that company. Also as next step we have plans to get rid of it > completely in Airflow 2.0. Providing that we will implement full DAG > Serialisation - this problem will be gone. All the DAG data will be stored > in the database and hopefully no more volume sharing will be needed. > > Here you can find simple description of using PVC's with Airflow on > Kubernetes: > > https://medium.com/@ramandumcs/how-to-run-apache-airflow-on-kubernetes-1cb809a8c7ea > . > Git Sync is also nice - but requires a shared Git repo where DAGs are > shared. > > There are other solutions - Composer team for example uses 'gcsfuse' - a > user-space synchronisation from a GCS bucket to local pod volume (they have > two containers in a pod - gcsfuse as side-car to airflow worker, scheduler, > UI sharing single volume). Then it is a matter of putting the generated > Dags to a GCS bucket (your daemon could do just that). And you can use > similar solutions for other dedicated "artifact" sharing. For example we've > implemented similar side-car pod for Nexus - where production DAG files > were shared as Nexus artifacts. > > > > Thanks a lot, > > Michael > > > > > > > > > > > > > > > > On 2019/11/15 10:17:30, Jarek Potiuk <jarek.pot...@polidea.com> wrote: > > > I am also -1. But I am happy to help with surfacing the CWL integration > > on > > > - both the new package (together with Oozie-2-airflow and maybe other > > > converters) and having it easily installable as external Package. I > will > > > talk to Andrey separately about this so that we do not clutter the > > devlist. > > > > > > J. > > > > > > On Fri, Nov 15, 2019 at 7:37 AM Maxime Beauchemin < > > > maximebeauche...@gmail.com> wrote: > > > > > > > After all the exploration of this topic here in this thread, I'm a > > pretty > > > > hard -1 on this one. > > > > > > > > I think CWL and CWL-Airflow are great projects, but they can't rely > on > > the > > > > Airflow community to evolve/maintain/package this integration. > > > > > > > > Personally I think that generally and *within reason* (winking at the > > npm > > > > communities ;) that smaller, targeted and loosely coupled packages > [and > > > > their corresponding smaller repositories with their own set of > > maintainers] > > > > is better than bigger monoliths. Some reasons: > > > > * separation of concerns > > > > * faster, more targeted builds and test suites > > > > * independent release cycles > > > > * clearer ownership > > > > * independent and adapted level of rigor / styling / standards > > > > * more targeted notifications for people watching repos > > > > * ... > > > > > > > > Max > > > > > > > > On Thu, Nov 14, 2019 at 12:33 PM Andrey Kartashov <por...@porter.st> > > > > wrote: > > > > > > > > > > > > > > > > > > > I looked at the > > > > > > > > > > > > > > > > > > https://cwl-airflow.readthedocs.io/en/1.0.18/readme/how_it_works.html#what-s-inside > > > > > > to > > > > > > understand what CWL is and that's where I took the descriptor + > > job (in > > > > > Key > > > > > > Concepts). > > > > > > > > > > > > > > > > Oh this is an old one, but even new one probably does not reflect > the > > > > real > > > > > picture. > > > > > > > > > > > > > > > OK. So as I understand finally the problem you want to solve - "To > > make > > > > > > Airflow more accessible to people who already use CWL or who will > > find > > > > it > > > > > > easier to write dags in CWL". I still think this does not > > necessarily > > > > > have > > > > > > to be solved by donating CWL code to Airflow (see below). > > > > > > > > > > > > > > > > I think there are many ways. > > > > > > > > > > > > > > > > Ok. So what you basically say is that you think Airflow community > > has > > > > > more > > > > > > capacity than CWL community to maintain CWL converter. > > > > > > > > > > My understanding CWL community just developing common standard > (CWL) > > not > > > > > converters or converter :). For me the CWL-Airflow developer > > definitely > > > > > Airflow community has far more capacity that me alone :) > > > > > > > > > > > I am not so sure > > > > > > about it (precisely because of the lost opportunities). But > maybe a > > > > > better > > > > > > solution is to ask in the airflow community whether there are > > people > > > > who > > > > > > could join the CWL-airflow converter and increase the community > > there. > > > > > > > > > > > > > > > > That probably a good start just to check and see the interest > > > > > > > > > > > I would not say for the whole community, but I would not feel > > > > comfortable > > > > > > as a community to take responsibility on the converter without > > prior > > > > > > knowledge and understanding CWL in detail. Especially that it is > > rather > > > > > for > > > > > > small group of users (at least initially). But I find CWL as an > > idea > > > > very > > > > > > interesting and maybe there are some people in the community who > > would > > > > > love > > > > > > to contribute to your project? Suggestion - maybe just ask - > here > > and > > > > in > > > > > > slack - if there is enough interest in contributing to > CWL-Airflow, > > > > > rather > > > > > > than donating the code to Airflow ? Just promote your project in > > the > > > > > > community and ask for help. > > > > > > > > > > I tried but have not got any feedback :) but I’m not a promoter or > > seller > > > > > > > > > > > > > > > > > > > > > > I can see this as the best of both worlds - if you find a few > > people > > > > who > > > > > > would like to help and get familiar with it and they are also > part > > of > > > > the > > > > > > Airflow community and we get collective knowledge about it - then > > > > > > eventually it might lead to incorporating it to Airflow itself if > > our > > > > > > community gets more familiar with CWL. I think this is the best > > way to > > > > > > achieve the final goal of finally incorporating CWL as part of > > Airflow. > > > > > > > > > > > > > > > > Works for me > > > > > > > > > > > > > > > > In the meantime - I am happy to help to make Airflow more "CWL > > > > friendly" > > > > > > for the users - both from documentation and Helm chart POV. > > > > > > > > > > > > > > > > Thank you, I appreciate that, how we proceed? > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> >