Re: Airflow and Machine Learning

Dan Davydov Wed, 19 Feb 2020 08:48:16 -0800

Twitter uses Airflow primarily for ML, to create automated pipelines for
retraining data, but also for more ad-hoc training jobs.


The biggest gaps are on the experimentation side. It takes too long for a
new user to set up and run a pipeline and then iterate on it. This problem
is a bit more unique to ML than other domains because 1) training jobs can
take a very long time to run, and 2) users have the need to launch multiple
experiments in parallel for the same model pipeline.

Biggest Gaps:
- Too much boilerplate to write DAGs compared to Dagster/etc, and
difficulty in message passing (XCom). There was a proposal recently to
improve this in Airflow which should be entering AIP soon.
- Lack of pipeline isolation which hurts model experimentation (being able
to run a DAG, modify it, and run it again without affecting the previous
run), lack of isolation of DAGs from Airflow infrastructure (inability to
redeploy Airflow infra without also redeploying DAGs) also hurts.
- Lack of multi-tenancy; it's hard for customers to quickly launch an
ad-hoc pipeline, the overhead of setting up a cluster and all of its
dependencies is quite high
- Lack of integration with data visualization plugins (e.g. plugins for
rendering data related to a task when you click a task instance in the UI).
- Lack of simpler abstractions for users with limited knowledge of Airflow
or even python to build simple pipelines (not really an Airflow problem,
but rather the need for a good abstraction that sits on top of Airflow like
a drag-and-drop pipeline builder)

FWIW my personal feeling is that a fair number companies in the ML space
are moving to alternate solutions like TFX Pipelines due to the focus these
platforms these have on ML (ML pipelines are first-class citizens), and
support from Google. Would be great if we could change that. The ML
orchestration/tooling space is definitely evolving very rapidly and there
are also new promising entrants as well.

On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
<germain.tan...@dailymotion.com.invalid> wrote:

> Hello Daniel,
>
> In my company we use airflow to update our ML models and to predict.
>
> As we use kubernetesOperator to trigger jobs, each ML DAG are similar and
> ML/Data science engineer can reuse a template and choose which type of
> machine they needs (highcpu, highmem, GPU or not..etc)
>
> We have a process in place describe in the second part of this article
> (Industrializing machine learning pipeline) :
> https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
>
> Hope this help.
>
> Germain.
>
> On 19/02/2020 16:42, "Daniel Imberman" <daniel.imber...@gmail.com> wrote:
>
>     Hello everyone!
>
>     I’m working on a few proposals to make Apache Airflow more friendly
> for ML/Data science use-cases, and I wanted to reach out in hopes of
> hearing from people that are using/wish to use Airflow for ML. If you have
> any opinions on the subject, I’d love to hear what you’re all working on!
>
>     Current questions I’m looking into:
>
>      1. How do you use Airflow for your ML? Has it worked out well for you?
>      2. Are there any features that would improve your experience of
> building models on Airflow?
>      3. Have you built anything on top of airflow/around Airflow to aide
> you in this process?
>
>     Thank you so much for your time!
>
>     via Newton Mail [
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&amp;data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&amp;sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&amp;reserved=0
> ]
>
>

Re: Airflow and Machine Learning

Reply via email to