Indeed - I think it is a recurring pattern I see learning from different
users. We saw a number of people writing specific solution to their
problems and using Airflow underneath, Airflow's big strengths are:

a) distributed execution model built-in
b) solid and well tested integrations with multiple external services
c) total flexibility in the way how graphs are defined - having Python as
common "language" of graphs make Airflow fantastic tool for anyone trying
to generate Python DAGs not only to write them.

We saw it with oozie-2-airflow - it was rather easy to convert any idea
that came from a slightly different world into a proper and nicely written
python DAG - we had pretty much no limitations/constraints and at the end
we came up with great, maintainable, easy to understand and monitor DAGs
that were very well suited for Airflow "execution engine".

Somewhat surprisingly and counter-intuitively - Python seems to be a very
good "interface" to express DAGs in extremely flexible way - which is not
easy to match in any declarative way of defining DAGs. It can act as true
common denominator for all the various ways of expressing DAGs - one that
is maybe not well "scientifically" defined but in the case of execution
model, I think flexibility beats the "specification".

Learning from different users who do similar things - one another example
is databand.ai - they convert ML-specific Python DAGs on higher level of
abstraction (with their own DSL)  into Airflow DAGs and use Airflow as an
execution engine, which is pretty cool.

I think at Airflow we should focus on doing one thing well and make Airflow
accessible to others if they want to write their own specific DSL and focus
on doing the "execution piece" well. With Airflow 2.0 we are going in the
right direction - simplifying the dependencies and setup with DAG
serialisation, focusing on performance and simplicity, keeping the great
integrations we have (and making it easier for people to migrate from 1.0).

J.


On Sun, Nov 17, 2019 at 3:22 AM Soma S Dhavala <soma.dhav...@gmail.com>
wrote:

> On Sat, Nov 16, 2019 at 2:25 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > >
> > > 1) The more dags I have in a dags folder, the longer time it takes to
> > > parse them all. Taking into account that in my case I have also to
> parse
> > > CWL files, it takes even more time for such a simple operation. So I
> was
> > > wondering is there any common solution to approach this issue. Also, I
> > was
> > > thinking if I can use your Plugins mechanism to integrate some
> additional
> > > functionality such as parsing CWL files directly without making any
> > changes
> > > in the core of Airflow.
> > >
> >
> > As a follow-up from the "political" decision, I would say, the best
> > solution will be to treat CWL-airflow as a separate "converter" really
> > rather than closely integrate it with AIrflow. I would imagine that you
> > have a separate folder with CWL files and you have a daemon watching that
> > folder and starting the conversion process whenever any of the CWL files
> > change and creating python DAG files in Airflow's dag folder. That seems
> > like very loosely coupled and relying on the basic behaviour of Airflow.
> > Also then it can be  easily combined with Git-sync solution for
> Kubernetes
> > or another way of synchronising DAGs.
> >
> > Decoupling makes sense. At Sunbird <https://www.sunbird.org/>, we have
> developed a machine learning workbench, daggit
> <https://github.com/project-sunbird/sunbird-ml-workbench>, which relies on
> airflow underneath. Approach is very similar, we drop all dag-specs in a
> folder, load those graphs in-mem via the converter. The spec is similar to
> cwl, but we like to develop a DSL for composing DAGs (not just s spec for
> read/write DAGs but modify, reference subgraphs, edges, nodes). A concept
> note on the need for a DSL is here
> <
> https://docs.google.com/document/d/1-l0EZveZJAxcRNdTOvNzVArq6V37Aj24H0CGd76Nh80/edit?usp=sharing
> >.
> We closely watch airflow, and like to leverage all the benefits of airflow
> (such as a k8s integration, for eg) . airflow is all but *one* of the
> frameworks required to fulfil the promises of a machine learning workbench.
> In that sense, clean, well-separated frameworks, coming together makes
> whole lot of sense, then one guy doing al -- my observations!
>
>
> >
> > > 2) I'm working on running CWL pipelines in Kubernetes through Airflow
> and
> > > one of the problems that I have to deal with is sharing directories
> > between
> > > the PODs. It looks like Kubernetes doesn't provide the direct solution
> to
> > > this problem and mostly relies on the platform where it is installed. I
> > > will appreciate if you direct me to the proper discussions/threads
> where
> > > people solve similar problems.
> > >
> >
> > There are two ways of sharing DAGs - persistent volume claims and git
> sync
> > currently. Generally the approach is that you need 3rd-party distributed
> > storage to share the dags and the synchronisation mechanism is not (yet)
> > built-in Airflow. There is the AIP-5 Remote DAG Fetcher (
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
> > )
> > where it has been discussed at length and there is the accompanying
> > discussion thread (shorter than discussion at the doc)
> >
> >
> https://lists.apache.org/thread.html/224d1e7d1b11e0b8314075f21b1b81708749f2899f4cce5af295e8a8@%3Cdev.airflow.apache.org%3E
> > but
> > I don't think anyone in the community is working actively on AIP-5
> > currently. I think the consensus in the community is that Airflow is
> > solving scheduling but it does not solve distribution - so it delegates
> > distributing files to dedicated solutions (and you can choose whichever
> > solution you already have to do the task). This is really targeted for
> > "corporate" deployments where usually corporates have already some
> > distributed storage in place. Rather than force a single "distribution"
> > solution for them, assumption is that Airflow will use whatever solution
> is
> > deployed at that company. Also as next step we have plans to get rid of
> it
> > completely in Airflow 2.0. Providing that we will implement full DAG
> > Serialisation - this problem will be gone. All the DAG data will be
> stored
> > in the database and hopefully no more volume sharing will be needed.
> >
> > Here you can find simple description of using PVC's with Airflow on
> > Kubernetes:
> >
> >
> https://medium.com/@ramandumcs/how-to-run-apache-airflow-on-kubernetes-1cb809a8c7ea
> > .
> > Git Sync is also nice - but requires a shared Git repo where DAGs are
> > shared.
> >
> > There are other solutions - Composer team for example uses 'gcsfuse' - a
> > user-space synchronisation from a GCS bucket to local pod volume (they
> have
> > two containers in a pod - gcsfuse as side-car to airflow worker,
> scheduler,
> > UI sharing single volume). Then it is a matter of putting the generated
> > Dags to a GCS bucket (your daemon could do just that). And you can use
> > similar solutions for other dedicated "artifact" sharing. For example
> we've
> > implemented similar side-car pod for Nexus - where production DAG files
> > were shared as Nexus artifacts.
> >
> >
> > > Thanks a lot,
> > > Michael
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 2019/11/15 10:17:30, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> > > > I am also -1. But I am happy to help with surfacing the CWL
> integration
> > > on
> > > > - both the new package (together with Oozie-2-airflow and maybe other
> > > > converters) and having it easily installable as external Package. I
> > will
> > > > talk to Andrey separately about this so that we do not clutter the
> > > devlist.
> > > >
> > > > J.
> > > >
> > > > On Fri, Nov 15, 2019 at 7:37 AM Maxime Beauchemin <
> > > > maximebeauche...@gmail.com> wrote:
> > > >
> > > > > After all the exploration of this topic here in this thread, I'm a
> > > pretty
> > > > > hard -1 on this one.
> > > > >
> > > > > I think CWL and CWL-Airflow are great projects, but they can't rely
> > on
> > > the
> > > > > Airflow community to evolve/maintain/package this integration.
> > > > >
> > > > > Personally I think that generally and *within reason* (winking at
> the
> > > npm
> > > > > communities ;) that smaller, targeted and loosely coupled packages
> > [and
> > > > > their corresponding smaller repositories with their own set of
> > > maintainers]
> > > > > is better than bigger monoliths. Some reasons:
> > > > > * separation of concerns
> > > > > * faster, more targeted builds and test suites
> > > > > * independent release cycles
> > > > > * clearer ownership
> > > > > * independent and adapted level of rigor / styling / standards
> > > > > * more targeted notifications for people watching repos
> > > > > * ...
> > > > >
> > > > > Max
> > > > >
> > > > > On Thu, Nov 14, 2019 at 12:33 PM Andrey Kartashov <
> por...@porter.st>
> > > > > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > >  I looked at the
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwl-airflow.readthedocs.io/en/1.0.18/readme/how_it_works.html#what-s-inside
> > > > > > > to
> > > > > > > understand what CWL is and that's where I took the descriptor +
> > > job (in
> > > > > > Key
> > > > > > > Concepts).
> > > > > > >
> > > > > >
> > > > > > Oh this is an old one, but even new one probably does not reflect
> > the
> > > > > real
> > > > > > picture.
> > > > > >
> > > > > >
> > > > > > OK. So as I understand finally the problem you want to solve -
> "To
> > > make
> > > > > > > Airflow more accessible to people who already use CWL or who
> will
> > > find
> > > > > it
> > > > > > > easier to write dags in CWL". I still think this does not
> > > necessarily
> > > > > > have
> > > > > > > to be solved by donating CWL code to Airflow (see below).
> > > > > > >
> > > > > >
> > > > > > I think there are many ways.
> > > > > >
> > > > > >
> > > > > > > Ok. So what you basically say is that you think Airflow
> community
> > > has
> > > > > > more
> > > > > > > capacity than CWL community to maintain CWL converter.
> > > > > >
> > > > > > My understanding CWL community just developing common standard
> > (CWL)
> > > not
> > > > > > converters or converter :). For me the CWL-Airflow developer
> > > definitely
> > > > > > Airflow community has far more capacity that me alone :)
> > > > > >
> > > > > > > I am not so sure
> > > > > > > about it (precisely because of the lost opportunities). But
> > maybe a
> > > > > > better
> > > > > > > solution is to ask in the airflow community whether there are
> > > people
> > > > > who
> > > > > > > could join the CWL-airflow converter and increase the community
> > > there.
> > > > > > >
> > > > > >
> > > > > > That probably a good start just to check and see the interest
> > > > > >
> > > > > > > I would not say for the whole community, but I would not feel
> > > > > comfortable
> > > > > > > as a community to take responsibility on the converter without
> > > prior
> > > > > > > knowledge and understanding CWL in detail. Especially that it
> is
> > > rather
> > > > > > for
> > > > > > > small group of users (at least initially). But I find CWL as an
> > > idea
> > > > > very
> > > > > > > interesting and maybe there are some people in the community
> who
> > > would
> > > > > > love
> > > > > > > to contribute to your project?  Suggestion - maybe just ask -
> > here
> > > and
> > > > > in
> > > > > > > slack - if there is enough interest in contributing to
> > CWL-Airflow,
> > > > > > rather
> > > > > > > than donating the code to Airflow ? Just promote your project
> in
> > > the
> > > > > > > community and ask for help.
> > > > > >
> > > > > > I tried but have not got any feedback :) but I’m not a promoter
> or
> > > seller
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I can see this as the best of both worlds - if you find a few
> > > people
> > > > > who
> > > > > > > would like to help and get familiar with it and they are also
> > part
> > > of
> > > > > the
> > > > > > > Airflow community and we get collective knowledge about it -
> then
> > > > > > > eventually it might lead to incorporating it to Airflow itself
> if
> > > our
> > > > > > > community gets more familiar with CWL. I think this is the best
> > > way to
> > > > > > > achieve the final goal of finally incorporating CWL as part of
> > > Airflow.
> > > > > > >
> > > > > >
> > > > > > Works for me
> > > > > >
> > > > > >
> > > > > > > In the meantime - I am happy to help to make Airflow more "CWL
> > > > > friendly"
> > > > > > > for the users - both from documentation and Helm chart POV.
> > > > > > >
> > > > > >
> > > > > > Thank you, I appreciate that, how we proceed?
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to