At project sunbird, we built daggit <https://github.com/project-sunbird/sunbird-ml-workbench>, an open source ML-As-A-Service platform on the top of airflow. While airflow and other ML platforms have taken *code-as- * *configuration* approach, we like to have users declaratively specify their ML Apps via yaml/jsons. We have to parse those ML App specs, and programmatically write the DAGs that airflow can understand.
pain points: programmatically creating dags seems like a drag. some specific keywords have to be placed in the auto generated DAG file, otherwise, DAG bag wont be filled. Not sure something has changed with airflow > 1.9. On Thu, Feb 20, 2020 at 8:42 AM Daniel Imberman <[email protected]> wrote: > Thank you everyone for this feedback! I will organize these (and other) > ideas and look forward to the conversation it starts! > > On Wed, Feb 19, 2020 at 9:54 AM, Ben Tallman <[email protected]> wrote: > I don’t really have time to unpack a lot here, but we use airflow to > extensively orchestrate Databricks Notebook based jobs. To date, we haven’t > really exposed the notebook visualizations in the Airflow UI, but instead > provide deep links to the job output. > > We spent a not insignificant amount of time building handlers into our > operators that take convention based XCom data and pass it from job to job > through the pipeline. In many cases, these aren’t ML jobs though, but they > are Notebook style pipelines and we use XCom in this way to break the jobs > up between notebooks. > > Thanks, > Ben > > -- > Ben Tallman > Chief Technology Officer > > M Science LLC > 101 SW Main Street, Suite 350 > Portland, OR 97204 > 503-433-1552 (o/m) > [email protected]<“mailto:[email protected]”> > mscience.com<“https://mscience.com”> > ________________________________ > From: Maxime Beauchemin <[email protected]> > Sent: Wednesday, February 19, 2020 9:30:30 AM > To: [email protected] <[email protected]> > Subject: Re: Airflow and Machine Learning > > I'd have a lot of thoughts to unpack here, but top of mind is a deeper > integration with [jupyter] notebooks and/or hosted notebooks-type systems. > Notebooks [with papermill <https://github.com/nteract/papermill>] can be > parameterized predictably, and notebook files provide rich log outputs > (organized by cells, can show data samples, charts, ...). For many ML > practitioners, it seems like a system that can execute and orchestrate > notebooks is a large chunk of what they need. > > Maybe a special [deeply integrated] notebook operator that can 1) bootstrap > a specified docker image, 2) visualize ipynb in place of logs in the > Airflow UI. On top of that maybe an Airflow plugin that enables people to > execute or schedule notebooks without crafting a DAG, though there's > probably a need for control mechanisms to be in place in that case. > > Max > > On Wed, Feb 19, 2020 at 8:47 AM Dan Davydov <[email protected]> > wrote: > > > Twitter uses Airflow primarily for ML, to create automated pipelines for > > retraining data, but also for more ad-hoc training jobs. > > > > The biggest gaps are on the experimentation side. It takes too long for a > > new user to set up and run a pipeline and then iterate on it. This > problem > > is a bit more unique to ML than other domains because 1) training jobs > can > > take a very long time to run, and 2) users have the need to launch > multiple > > experiments in parallel for the same model pipeline. > > > > Biggest Gaps: > > - Too much boilerplate to write DAGs compared to Dagster/etc, and > > difficulty in message passing (XCom). There was a proposal recently to > > improve this in Airflow which should be entering AIP soon. > > - Lack of pipeline isolation which hurts model experimentation (being > able > > to run a DAG, modify it, and run it again without affecting the previous > > run), lack of isolation of DAGs from Airflow infrastructure (inability to > > redeploy Airflow infra without also redeploying DAGs) also hurts. > > - Lack of multi-tenancy; it's hard for customers to quickly launch an > > ad-hoc pipeline, the overhead of setting up a cluster and all of its > > dependencies is quite high > > - Lack of integration with data visualization plugins (e.g. plugins for > > rendering data related to a task when you click a task instance in the > UI). > > - Lack of simpler abstractions for users with limited knowledge of > Airflow > > or even python to build simple pipelines (not really an Airflow problem, > > but rather the need for a good abstraction that sits on top of Airflow > like > > a drag-and-drop pipeline builder) > > > > FWIW my personal feeling is that a fair number companies in the ML space > > are moving to alternate solutions like TFX Pipelines due to the focus > these > > platforms these have on ML (ML pipelines are first-class citizens), and > > support from Google. Would be great if we could change that. The ML > > orchestration/tooling space is definitely evolving very rapidly and there > > are also new promising entrants as well. > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy > > <[email protected]> wrote: > > > > > Hello Daniel, > > > > > > In my company we use airflow to update our ML models and to predict. > > > > > > As we use kubernetesOperator to trigger jobs, each ML DAG are similar > and > > > ML/Data science engineer can reuse a template and choose which type of > > > machine they needs (highcpu, highmem, GPU or not..etc) > > > > > > We have a process in place describe in the second part of this article > > > (Industrializing machine learning pipeline) : > > > > > > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f > > > > > > Hope this help. > > > > > > Germain. > > > > > > On 19/02/2020 16:42, "Daniel Imberman" <[email protected]> > > wrote: > > > > > > Hello everyone! > > > > > > I’m working on a few proposals to make Apache Airflow more friendly > > > for ML/Data science use-cases, and I wanted to reach out in hopes of > > > hearing from people that are using/wish to use Airflow for ML. If you > > have > > > any opinions on the subject, I’d love to hear what you’re all working > on! > > > > > > Current questions I’m looking into: > > > > > > 1. How do you use Airflow for your ML? Has it worked out well for > > you? > > > 2. Are there any features that would improve your experience of > > > building models on Airflow? > > > 3. Have you built anything on top of airflow/around Airflow to aide > > > you in this process? > > > > > > Thank you so much for your time! > > > > > > via Newton Mail [ > > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0 > > > ] > > > > > > > >
