Hey Everybody (Fully agreed on Dan's post. These are the main pain points we see/trying to fix. Here is our reply on the thread topic)
We have numerous ML engineers that use our open source project (DBND) with Airflow for their everyday work. We help them create and monitor ML/DATA pipelines of different complexity levels and infra requirements. After 1.5 years doing that as a company now, and a few years doing it as a part of a big enterprise organization before we started Databand, these are the main pain points we think about when it comes to Airflow: A. DAG Versioning - ML teams change DAGs constantly. The first limitation they see is being able to review historical information of previous DAG runs based on the exact version of the DAG that executed. Our plugin 'dbnd-airflow-versioned-dag' is our approach to that. We save and show in the Airflow UI every specific version of the DAG. This is important in ML use cases because of the data science experimentation cycle and the need to trace exactly what code/data went into a model. B. A better version of the backfill command - We had to reimplement BackfillJob class to be able to run specific DAG versions. C. Running the same DAG in different environments - People want to run the same DAG locally and at GCP/AWS without changing all the code. We have done that by abstracting Spark/Python/Docker code execution so we can easily switch from one infra to another. We did that by wrapping all infra logic in a generic gateway "operators" with extensive use of existing Airflow hooks and operators. D. Data passing & versioning - being able to pass data from Operator to Operator, version the data. Being able to do that with easy authoring of DAGs & sub-DAGs - Pipelines grow in complexity very quickly. It will be hard to agree on what is the "right" SDK here to implement. Airflow is very "built by engineers for engineers", DAGs are created to be executed as Scheduled Production Jobs. It's going to be a long journey to get to the common conclusion on what's needs to be done on a higher level around task/data management. Some people from the airflow community went and started new Orchestration companies after they didn't manage to have a significant change in the Data model of Airflow. Our biggest wish list item in Airflow as advanced user: * A low-level API to generate and run DAGs *. So far there are numerous extensions, and all of them solve this by creating another dag.py file with the dag generation. But neither Scheduler nor UI can support that fully. The moment the scheduler together with UI will be open for "versioned DAGs", a lot of nice DSLs and extensions will emerge out of that. Data Analysts will get more GUI driven tools to generate DAGs, ML engineers will be able to run and iterate on their algorithms, Data engineers will be able to implement their DAG DSL/SDK the way they see it suits their company. Most users of DBND author their ML pipelines without knowing that Airflow is orchestrating behind the scenes. They submit Python/Spark/Notebooks without knowing that the DAG is going to be run through the Airflow subsystem. Only when they see the Airflow webserver they start to discover that there is Airflow. And this is the way it should be. ML developers don't like new frameworks, they just like to see data flowing from task to task, and ways to push work to production with minimal "external" code involved. Evgeny. On 2020/02/19 16:46:44, Dan Davydov <[email protected]> wrote: > Twitter uses Airflow primarily for ML, to create automated pipelines for > retraining data, but also for more ad-hoc training jobs. > > The biggest gaps are on the experimentation side. It takes too long for a > new user to set up and run a pipeline and then iterate on it. This problem > is a bit more unique to ML than other domains because 1) training jobs can > take a very long time to run, and 2) users have the need to launch multiple > experiments in parallel for the same model pipeline. > > Biggest Gaps: > - Too much boilerplate to write DAGs compared to Dagster/etc, and > difficulty in message passing (XCom). There was a proposal recently to > improve this in Airflow which should be entering AIP soon. > - Lack of pipeline isolation which hurts model experimentation (being able > to run a DAG, modify it, and run it again without affecting the previous > run), lack of isolation of DAGs from Airflow infrastructure (inability to > redeploy Airflow infra without also redeploying DAGs) also hurts. > - Lack of multi-tenancy; it's hard for customers to quickly launch an > ad-hoc pipeline, the overhead of setting up a cluster and all of its > dependencies is quite high > - Lack of integration with data visualization plugins (e.g. plugins for > rendering data related to a task when you click a task instance in the UI). > - Lack of simpler abstractions for users with limited knowledge of Airflow > or even python to build simple pipelines (not really an Airflow problem, > but rather the need for a good abstraction that sits on top of Airflow like > a drag-and-drop pipeline builder) > > FWIW my personal feeling is that a fair number companies in the ML space > are moving to alternate solutions like TFX Pipelines due to the focus these > platforms these have on ML (ML pipelines are first-class citizens), and > support from Google. Would be great if we could change that. The ML > orchestration/tooling space is definitely evolving very rapidly and there > are also new promising entrants as well. > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy > <[email protected]> wrote: > > > Hello Daniel, > > > > In my company we use airflow to update our ML models and to predict. > > > > As we use kubernetesOperator to trigger jobs, each ML DAG are similar and > > ML/Data science engineer can reuse a template and choose which type of > > machine they needs (highcpu, highmem, GPU or not..etc) > > > > We have a process in place describe in the second part of this article > > (Industrializing machine learning pipeline) : > > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f > > > > Hope this help. > > > > Germain. > > > > On 19/02/2020 16:42, "Daniel Imberman" <[email protected]> wrote: > > > > Hello everyone! > > > > I’m working on a few proposals to make Apache Airflow more friendly > > for ML/Data science use-cases, and I wanted to reach out in hopes of > > hearing from people that are using/wish to use Airflow for ML. If you have > > any opinions on the subject, I’d love to hear what you’re all working on! > > > > Current questions I’m looking into: > > > > 1. How do you use Airflow for your ML? Has it worked out well for you? > > 2. Are there any features that would improve your experience of > > building models on Airflow? > > 3. Have you built anything on top of airflow/around Airflow to aide > > you in this process? > > > > Thank you so much for your time! > > > > via Newton Mail [ > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0 > > ] > > > > >
