I'd have a lot of thoughts to unpack here, but top of mind is a deeper
integration with [jupyter] notebooks and/or hosted notebooks-type systems.
Notebooks [with papermill <https://github.com/nteract/papermill>] can be
parameterized predictably, and notebook files provide rich log outputs
(organized by cells, can show data samples, charts, ...). For many ML
practitioners, it seems like a system that can execute and orchestrate
notebooks is a large chunk of what they need.

Maybe a special [deeply integrated] notebook operator that can 1) bootstrap
a specified docker image, 2) visualize ipynb in place of logs in the
Airflow UI. On top of that maybe an Airflow plugin that enables people to
execute or schedule notebooks without crafting a DAG, though there's
probably a need for control mechanisms to be in place in that case.

Max

On Wed, Feb 19, 2020 at 8:47 AM Dan Davydov <ddavy...@twitter.com.invalid>
wrote:

> Twitter uses Airflow primarily for ML, to create automated pipelines for
> retraining data, but also for more ad-hoc training jobs.
>
> The biggest gaps are on the experimentation side. It takes too long for a
> new user to set up and run a pipeline and then iterate on it. This problem
> is a bit more unique to ML than other domains because 1) training jobs can
> take a very long time to run, and 2) users have the need to launch multiple
> experiments in parallel for the same model pipeline.
>
> Biggest Gaps:
> - Too much boilerplate to write DAGs compared to Dagster/etc, and
> difficulty in message passing (XCom). There was a proposal recently to
> improve this in Airflow which should be entering AIP soon.
> - Lack of pipeline isolation which hurts model experimentation (being able
> to run a DAG, modify it, and run it again without affecting the previous
> run), lack of isolation of DAGs from Airflow infrastructure (inability to
> redeploy Airflow infra without also redeploying DAGs) also hurts.
> - Lack of multi-tenancy; it's hard for customers to quickly launch an
> ad-hoc pipeline, the overhead of setting up a cluster and all of its
> dependencies is quite high
> - Lack of integration with data visualization plugins (e.g. plugins for
> rendering data related to a task when you click a task instance in the UI).
> - Lack of simpler abstractions for users with limited knowledge of Airflow
> or even python to build simple pipelines (not really an Airflow problem,
> but rather the need for a good abstraction that sits on top of Airflow like
> a drag-and-drop pipeline builder)
>
> FWIW my personal feeling is that a fair number companies in the ML space
> are moving to alternate solutions like TFX Pipelines due to the focus these
> platforms these have on ML (ML pipelines are first-class citizens), and
> support from Google. Would be great if we could change that. The ML
> orchestration/tooling space is definitely evolving very rapidly and there
> are also new promising entrants as well.
>
> On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
> <germain.tan...@dailymotion.com.invalid> wrote:
>
> > Hello Daniel,
> >
> > In my company we use airflow to update our ML models and to predict.
> >
> > As we use kubernetesOperator to trigger jobs, each ML DAG are similar and
> > ML/Data science engineer can reuse a template and choose which type of
> > machine they needs (highcpu, highmem, GPU or not..etc)
> >
> > We have a process in place describe in the second part of this article
> > (Industrializing machine learning pipeline) :
> >
> https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
> >
> > Hope this help.
> >
> > Germain.
> >
> > On 19/02/2020 16:42, "Daniel Imberman" <daniel.imber...@gmail.com>
> wrote:
> >
> >     Hello everyone!
> >
> >     I’m working on a few proposals to make Apache Airflow more friendly
> > for ML/Data science use-cases, and I wanted to reach out in hopes of
> > hearing from people that are using/wish to use Airflow for ML. If you
> have
> > any opinions on the subject, I’d love to hear what you’re all working on!
> >
> >     Current questions I’m looking into:
> >
> >      1. How do you use Airflow for your ML? Has it worked out well for
> you?
> >      2. Are there any features that would improve your experience of
> > building models on Airflow?
> >      3. Have you built anything on top of airflow/around Airflow to aide
> > you in this process?
> >
> >     Thank you so much for your time!
> >
> >     via Newton Mail [
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&amp;data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&amp;sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&amp;reserved=0
> > ]
> >
> >
>

Reply via email to