Re: Airflow and Machine Learning

Evgeny Shulman Wed, 19 Feb 2020 23:36:36 -0800

Hey Everybody
(Fully agreed on Dan's post. These are the main pain points we see/trying to 
fix. Here is our reply on the thread topic)

We have numerous ML engineers that use our open source project (DBND) with 
Airflow for their everyday work.  We help them create and monitor ML/DATA 
pipelines of different complexity levels and infra requirements. After 1.5 
years doing that as a company now, and a few years doing it as a part of a big 
enterprise organization before we started Databand, these are the main pain 
points we think about when it comes to Airflow:

A. DAG Versioning - ML teams change DAGs constantly. The first limitation they 
see is being able to review historical information of previous DAG runs based 
on the exact version of the DAG that executed. Our plugin 
'dbnd-airflow-versioned-dag'  is our approach to that. We save and show in the 
Airflow UI every specific version of the DAG. This is important in ML use cases 
because of the data science experimentation cycle and the need to trace exactly 
what code/data went into a model.

B. A better version of the backfill command - We  had to reimplement 
BackfillJob class to be able to run specific DAG versions.

C. Running the same DAG in different environments - People want to run the same 
DAG locally and at GCP/AWS without changing all the code. We have done that by 
abstracting Spark/Python/Docker code execution so we can easily switch from one 
infra to another. We did that by wrapping all infra logic in a generic gateway 
"operators" with extensive use of existing Airflow hooks and operators.

D. Data passing & versioning - being able to pass data from Operator to 
Operator,  version the data. Being able to do that with easy authoring of DAGs 
& sub-DAGs - Pipelines grow in complexity very quickly.  It will be hard to 
agree on what is the "right" SDK here to implement.  Airflow is very "built by 
engineers for engineers", DAGs are created to be executed as Scheduled 
Production Jobs. It's going to be a long journey to get to the common 
conclusion on what's needs to be done on a higher level around task/data 
management. Some people from the airflow community went and started new 
Orchestration companies after they didn't manage to have a significant change 
in the Data model of Airflow. 

Our biggest wish list item in Airflow as advanced user: 

* A low-level API  to generate and run DAGs *. 
So far there are numerous extensions, and all of them solve this by creating 
another dag.py file with the dag generation. But neither Scheduler nor UI can 
support that fully. The moment the scheduler together with UI will be open for 
"versioned DAGs",  a lot of nice DSLs and extensions will emerge out of that. 
Data Analysts will get more GUI driven tools to generate DAGs, ML engineers 
will be able to run and iterate on their algorithms, Data engineers will be 
able to implement their DAG DSL/SDK the way they see it suits their company. 

Most users of DBND author their ML pipelines without knowing that Airflow is 
orchestrating behind the scenes.  They submit Python/Spark/Notebooks without 
knowing that the DAG is going to be run through the Airflow subsystem. Only 
when they see the Airflow webserver they start to discover that there is 
Airflow. And this is the way it should be. ML developers don't like new 
frameworks, they just like to see data flowing from task to task, and ways to 
push work to production with minimal "external" code involved.

Evgeny.

On 2020/02/19 16:46:44, Dan Davydov <[email protected]> wrote: 
> Twitter uses Airflow primarily for ML, to create automated pipelines for
> retraining data, but also for more ad-hoc training jobs.
> 
> The biggest gaps are on the experimentation side. It takes too long for a
> new user to set up and run a pipeline and then iterate on it. This problem
> is a bit more unique to ML than other domains because 1) training jobs can
> take a very long time to run, and 2) users have the need to launch multiple
> experiments in parallel for the same model pipeline.
> 
> Biggest Gaps:
> - Too much boilerplate to write DAGs compared to Dagster/etc, and
> difficulty in message passing (XCom). There was a proposal recently to
> improve this in Airflow which should be entering AIP soon.
> - Lack of pipeline isolation which hurts model experimentation (being able
> to run a DAG, modify it, and run it again without affecting the previous
> run), lack of isolation of DAGs from Airflow infrastructure (inability to
> redeploy Airflow infra without also redeploying DAGs) also hurts.
> - Lack of multi-tenancy; it's hard for customers to quickly launch an
> ad-hoc pipeline, the overhead of setting up a cluster and all of its
> dependencies is quite high
> - Lack of integration with data visualization plugins (e.g. plugins for
> rendering data related to a task when you click a task instance in the UI).
> - Lack of simpler abstractions for users with limited knowledge of Airflow
> or even python to build simple pipelines (not really an Airflow problem,
> but rather the need for a good abstraction that sits on top of Airflow like
> a drag-and-drop pipeline builder)
> 
> FWIW my personal feeling is that a fair number companies in the ML space
> are moving to alternate solutions like TFX Pipelines due to the focus these
> platforms these have on ML (ML pipelines are first-class citizens), and
> support from Google. Would be great if we could change that. The ML
> orchestration/tooling space is definitely evolving very rapidly and there
> are also new promising entrants as well.
> 
> On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
> <[email protected]> wrote:
> 
> > Hello Daniel,
> >
> > In my company we use airflow to update our ML models and to predict.
> >
> > As we use kubernetesOperator to trigger jobs, each ML DAG are similar and
> > ML/Data science engineer can reuse a template and choose which type of
> > machine they needs (highcpu, highmem, GPU or not..etc)
> >
> > We have a process in place describe in the second part of this article
> > (Industrializing machine learning pipeline) :
> > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
> >
> > Hope this help.
> >
> > Germain.
> >
> > On 19/02/2020 16:42, "Daniel Imberman" <[email protected]> wrote:
> >
> >     Hello everyone!
> >
> >     I’m working on a few proposals to make Apache Airflow more friendly
> > for ML/Data science use-cases, and I wanted to reach out in hopes of
> > hearing from people that are using/wish to use Airflow for ML. If you have
> > any opinions on the subject, I’d love to hear what you’re all working on!
> >
> >     Current questions I’m looking into:
> >
> >      1. How do you use Airflow for your ML? Has it worked out well for you?
> >      2. Are there any features that would improve your experience of
> > building models on Airflow?
> >      3. Have you built anything on top of airflow/around Airflow to aide
> > you in this process?
> >
> >     Thank you so much for your time!
> >
> >     via Newton Mail [
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0
> > ]
> >
> >
>

Re: Airflow and Machine Learning

Reply via email to