Re: Airflow and Machine Learning

James Meickle Mon, 24 Feb 2020 14:49:00 -0800

I appreciate where you're coming from on wanting to enhance productivity
for different types of users, but as a cluster administrator, I _really_
don't want to be running software that's managing its own Docker builds,
virtualenvs, zip uploads, etc.! It will almost certainly not do so in a way
compliant or consistent with our policies. If anything goes wrong,
disentangling what happened and how to fix it will be different from all of
our other software builds, which already have defined processes in place.
Also, "self-building" software that works this way often leaves ops teams
like mine looking like the "bad guys" by solving the easy local
development/single-instance/single-user case, and then failing to work "the
same way" in a more restricted shared cluster.


I think this is exactly why a structured API is good, though. For example,
requiring DAGs to implement a manifest/retrieval API; where a basic
implementation is a hash plus a local filesystem path, a provided but
optional implementation is a git commit and checkout creds, and a third
party module implements a multi-user Python notebook integration.

Basically, I am not opposed to "monitor code for a Dockerfile change,
rebuild/push the Dockerfile, and trigger the DAG" to be something you could
build on top of Airflow. I think that even some basic commitment to making
DAG functionality API-driven would enable a third party to do exactly that.
I would not want to see that functionality baked into Airflow's core,
though, because it's spanning so many problem domains that involve extra
security and operational concerns. As-is, we want to get more of those
concerns under our control, rather than Airflow's. (e.g. we'd rather notify
Airflow of DAG changes as part of a deployment pipeline, rather than having
it constantly poll for updated definitions, which leads to resource
utilization and inconsistency.)

On Mon, Feb 24, 2020 at 5:16 PM Jarek Potiuk <[email protected]>
wrote:

> I think what would help a lot to solve the problems of env/deps/DAGS is the
> wheel packaging we started to talk about in another thread.
>
> I personally think what argo does is too much "cloud native" - you have to
> build the image, push to registry, get it pulled by the engine, execute
> etc. I've been working with ML tool we developed internally for NoMagic.ai
> which did something quite opposite - it was packaging all the python code
> developed locally in a .zip file, stored it in GCS, submitted it to be
> executed - it was unpacked and executed  on a remote machine. We actually
> evaluated Airflow back then (and that's how I learned about the project)
> but it was too much overhead for our case.
>
> The first approach - Argo - is generally slow and "complex" if you just
> change one line of code. The second approach - default Airflow approach is
> bad when you have new binary or python dependencies. I think both solutions
> are valid, but for different cases of code deployment. And none of them
> handles the in-between case where we have only new python
> dependencies added/modified. I think with Airflow we can target a
> "combined" solution which will be able to handle well all cases.
>
> I think  the problem is pretty independent on whether to store the changes
> in a git repo or not. I can imagine that iteration is done based on just
> (shared) file system changes or through git - it is just a matter of
> whether you want to "freeze' the state in git or not. As a software
> engineer I always prefer git - but I perfectly understand as a data
> scientist git-commit and solving potential conflicts might be a burden,
> especially when you want to iterate quickly and experiment. But I
> understand that for this quick iteration case everything should work
> quickly and transparently without unnecessary overhead when you *just*
> change a  python file or *just* increase python dependency version.
>
> What could work is if we start treating differently the three types of
> changes and not try to handle them in the same case. I think this can be
> all automated by Airflow's dag submission mechanism. I think it can address
> several things at various levels - addressing the needs of individuals,
> teams, and companies at various levels:
>
> - super-fast speed of iteration by individual people (on a DAG-code level)
> - ability to use different python dependencies by different people in the
> same team using the same deployment (on dependency level)
> - ability to keep the environment evolving with new binary dependencies and
> eventually landing in prod environment (on environment/container level)
>
> Working it out backwards from the heaviest to lightest changes:
>
> 1) whenever we have a binary dependency change (.so, binary, new external
> dependency to be added) we  should change dockerfile that installs it. We
> don't currently handle that case, but I can imagine with prod image support
> and being able to modify thin layer of new binaries added to base Airflow
> image this should be easy to automate - when you iterate locally you should
> be able to build the image automatically, send a new version to the
> registry and restart whole deployment of Airflow to use this new binary. It
> should be a rare case, it should be impacting the whole deployment (for
> example whole dev or staging deployment)  - i.e. everyone in Airflow
> installation should get the new image as a base. The drawback here is that
> everyone who is using the same Airflow deployment is impacted.
>
> 2) whenever we have a Python dependency change (new version, new library
> added etc.) those deps could be pre-packaged in a binary .whl file
> (incremental) and submitted together with the DAG (and stored in a shared
> cache). This could be done via a new requirements.txt file and changes to
> it. We should be able to detect the changes and scheduler should be able to
> install new requirements in dynamically created virtualenv (and produce a
> binary wheel), parse the DAG in the context of that virtualenv and submit
> the DAG together with the new wheel package - so that the wheel package can
> be picked up by the workers to execute the tasks, create a venv for that
> and run tasks with this venv.  The benefit of this is that you do not have
> to update the whole deployment for that - ie. people in the same team,
> using the same Airflow Deployment can use different dependencies without
> impacting each other. The whl packages/venvs can be cached on the
> workers/scheduler and eventually, when you commit and release such
> dependency changes, they can be embedded in the shared docker image from
> point 1). This whl file does not have to be stored in a database - just
> storing the "requirements.txt" per DagRun is enough - we can always rebuild
> the whl from that when needed and we have no cache.
>
> 3) Whenever it's a change to just your own code (or code of any imported
> files you used) - there is no need to package and build the whl packages.
> You can package your own code in a .zip (or another .whl) and submit for
> execution (and store per-DagRun), This can be also automated by scheduler -
> it can traverse the whole python structure while parsing the file and
> package all dependent files. This can be done using current .zip support of
> Airflow + new versioning. We would have to add a feature that each of such
> DAGs will be a different version even if the same dag_id is used. We
> already plan it - as a feature for 2.0 to support different version of each
> DAG (and we have a discussion about it tomorrow!). Here the benefit is that
> even if two people are modifying the same DAG file - they can run them
> independently and iterate independently on it. It will be fast, and the
> packaged ".zip" file is the "track" of what was exactly tried. Often you do
> not have to store it in git as often those will be experiments.
>
> I think we cannot have a "one solution" to all the use cases - but with
> treating those three cases differently, we can do very well for The ML
> case. And we could automate it all - we could detect what kind of change
> user ldid locally and act appropriately.
>
> J.
>
>
>
> On Mon, Feb 24, 2020 at 4:54 PM Ash Berlin-Taylor <[email protected]> wrote:
>
> >
> > > DAG state (currently stored directly in the DB)
> > Can you expand on this point James? What is the problem or limitation
> > here? And would those be solved by expanding on the APIs to allow this to
> > be set by some external process?
> > On Feb 24 2020, at 3:45 pm, James Meickle <[email protected]
> .INVALID>
> > wrote:
> > > I really agree with most of what was posted above but particularly love
> > > what Evgeny wrote about having a DAG API. As an end user, I would love
> to
> > > be able to provide different implementations of core DAG functionality,
> > > similar to how hExecutor can already be subclassed. Some key behavior
> > > points I either have personally had to override/work around, or would
> > like
> > > to do if it were easier:
> > >
> > > DAG schedule (currently defined by the DAG after it has been parsed)
> > > DAG discovery (currently always a scheduler subprocess)
> > > DAG state (currently stored directly in the DB)
> > > Loading content for discovered DAGs (currently always on-disk, causes
> > > versioning problems)
> > > Parsing content from discovered DAGs (currently always a scheduler
> > > subprocess, causes performance problems)
> > > Providing DAG result transfer/persistence (currently only XCOM, causes
> > many
> > > problems)
> > >
> > > Breaking DAG functionality into a set of related APIs would allow
> Airflow
> > > to still have a good "out of the box" experience, or a simple git-based
> > > deployment mode; while unlocking a lot of capability for users with
> more
> > > sophisticated needs.
> > >
> > > For example, we're using Argo Workflows nowadays, which is more
> > Kubernetes
> > > native but also more limited. I could easily envision a DAG
> > implementation
> > > where Airflow stores historical executions; stores git commits and
> > > retrieves DAG definitions as needed; launches Argo Workflows to perform
> > > both DAG parsing _and_ task execution; and stores results in Airflow.
> > This
> > > would turn Airflow into the system of record (permanent history), and
> > Argo
> > > into the ephemeral execution layer (delete after a few days). End users
> > > could submit from Airflow even without Kubernetes access, and wouldn't
> > need
> > > Kubernetes access to view results or logs.
> > >
> > > Another example: we have notebook users who often need to schedule
> > > pipelines, which they do in ad hoc ways. Instead they could import
> > Airflow
> > > and define a DAG in Python, then call a command to remotely execute it.
> > > This would run on our "staging" Airflow cluster, with access to staging
> > > credentials, but as a DAG named something like
> > > "username.git_repo.notebook_name". Each run would freeze the current
> > > notebook (like to S3), execute it, and store results via papermill;
> this
> > > would let users launch multiple iterations of a notebook (like looping
> > over
> > > a parameter list), run them for a while, and pick the one that does
> what
> > > they need.
> > >
> > > In general, there's been no end to the frustration of DAGs being
> tightly
> > > coupled to a specific on-disk layout, specific Python packaging, etc.
> and
> > > I'd love to be able to cleanly write alternative implementations.
> > >
> > > On Fri, Feb 21, 2020 at 4:44 PM Massy Bourennani <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi all,
> > > > We are using Airflow to put batch ML models in production (only the
> > > > prediction is done).
> > > > [image: image.png]
> > > >
> > > > Above is an example of a DAG we are using to execute an ML model
> > written
> > > > in Python (scikit learn)
> > > >
> > > > 1. At first, data is extracted from BigQuery. The SQL query is on a
> DS
> > > > owned GitHub repository
> > > > 2. Then, the trained model, which is also serialized in pickle, is
> > > > taken from Github
> > > > 3. After that, an Apache Beam job is started on Dataflow which is
> > > > responsible for reading the input files, downloading the model from
> > GCS,
> > > > deserializing it, and using to predict the score of each data point.
> > Below
> > > > you can see the expected interface of every model
> > > > 4. At the end, the results are saved in Bigquery/GCS
> > > >
> > > > class WhateverModel: def predict(batch: collections.abc.Iterable) ->
> > > > collections.abc.Iterable: """ :param batch: a collection of dicts
> :type
> > > > batch: collection(dict) :return: a collection of dicts. there should
> > be a
> > > > score in one of the fields of every dict (every datapoint) """ pass
> > > >
> > > > key points:
> > > > * every input: SQL query used to extract the dataset, ML model,
> custom
> > > > packages used by the model (used to setup Dataflow workers) is in
> > Github.
> > > > So we can go from one version to another and use it fairly easily,
> all
> > we
> > > > need are some qualifiers: GitHub repo, path to file, and tag version.
> > > > * DS are free to use whatever python library they want thanks to
> Apache
> > > > Beam that provides a way to initialize workers with custom packages.
> > > >
> > > > weak points:
> > > > * Whenever DS update one of the input (SQL query, python packages, or
> > > > model) we, DE, need to update the DAG
> > > > * It's really specific to Batch Python ML models
> > > >
> > > > Below is a code snippet of a DAG instantiation
> > > > with Py3BatchInferenceDAG(dag_id='mydags.execute_my_ml_model',
> > > > sql_repo_name='app/data-learning',
> > > > sql_file_path='data_learning/my_ml_model/sql/predict.sql',
> > > > sql_tag_name='my_ml_model_0.0.12',
> > > > model_repo_name='app/data-models',
> > > > model_file_path='repository/my_ml_model/4/model.pkl',
> > > > model_tag_name='my_ml_model_0.0.4',
> > > > python_packages_repo_name='app/data-learning',
> > > > python_packages_tag_name='my_ml_model_0.0.9',
> > > >
> >
> python_packages_paths=['data_learning/my_ml_model/python_packages/package_one/'],
> > > > params={ ########## setup of Dataflow workers
> > > > 'custom_commands': ['apt-get update',
> > > > 'gsutil cp -r gs://bucket/package_one /tmp/',
> > > > 'pip install /tmp/package_one/'
> > > > ],
> > > > 'required_packages': ['dill==0.2.9', 'numpy', 'pandas',
> 'scikit-learn'
> > > > ,'google-cloud-storage']
> > > > },
> > > > external_dag_requires=[
> > > > 'mydags.dependency_one$ds',
> > > > 'mydags.dependency_two$ds'
> > > > ],
> > > > destination_project_dataset_table='mydags.my_ml_model${{ ds_nodash
> }}',
> > > > schema_path=Common.get_file_path(__file__, "schema.yaml"),
> > > > start_date=datetime(2019, 12, 10),
> > > > schedule_interval=CronPresets.daily()) as dag:
> > > > pass
> > > >
> > > > I hope it helps,
> > > > Regards
> > > > Massy
> > > >
> > > > On Thu, Feb 20, 2020 at 8:36 AM Evgeny Shulman <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hey Everybody
> > > > > (Fully agreed on Dan's post. These are the main pain points we
> > see/trying
> > > > > to fix. Here is our reply on the thread topic)
> > > > >
> > > > > We have numerous ML engineers that use our open source project
> (DBND)
> > > > > with Airflow for their everyday work. We help them create and
> monitor
> > > > > ML/DATA pipelines of different complexity levels and infra
> > requirements.
> > > > > After 1.5 years doing that as a company now, and a few years doing
> > it as a
> > > > > part of a big enterprise organization before we started Databand,
> > these are
> > > > > the main pain points we think about when it comes to Airflow:
> > > > >
> > > > > A. DAG Versioning - ML teams change DAGs constantly. The first
> > limitation
> > > > > they see is being able to review historical information of previous
> > DAG
> > > > > runs based on the exact version of the DAG that executed. Our
> plugin
> > > > > 'dbnd-airflow-versioned-dag' is our approach to that. We save and
> > show in
> > > > > the Airflow UI every specific version of the DAG. This is important
> > in ML
> > > > > use cases because of the data science experimentation cycle and the
> > need to
> > > > > trace exactly what code/data went into a model.
> > > > >
> > > > > B. A better version of the backfill command - We had to reimplement
> > > > > BackfillJob class to be able to run specific DAG versions.
> > > > >
> > > > > C. Running the same DAG in different environments - People want to
> > run
> > > > > the same DAG locally and at GCP/AWS without changing all the code.
> > We have
> > > > > done that by abstracting Spark/Python/Docker code execution so we
> can
> > > > > easily switch from one infra to another. We did that by wrapping
> all
> > infra
> > > > > logic in a generic gateway "operators" with extensive use of
> existing
> > > > > Airflow hooks and operators.
> > > > >
> > > > > D. Data passing & versioning - being able to pass data from
> Operator
> > to
> > > > > Operator, version the data. Being able to do that with easy
> > authoring of
> > > > > DAGs & sub-DAGs - Pipelines grow in complexity very quickly. It
> will
> > be
> > > > > hard to agree on what is the "right" SDK here to implement. Airflow
> > is
> > > > > very "built by engineers for engineers", DAGs are created to be
> > executed as
> > > > > Scheduled Production Jobs. It's going to be a long journey to get
> to
> > the
> > > > > common conclusion on what's needs to be done on a higher level
> around
> > > > > task/data management. Some people from the airflow community went
> and
> > > > > started new Orchestration companies after they didn't manage to
> have
> > a
> > > > > significant change in the Data model of Airflow.
> > > > >
> > > > > Our biggest wish list item in Airflow as advanced user:
> > > > > * A low-level API to generate and run DAGs *.
> > > > > So far there are numerous extensions, and all of them solve this by
> > > > > creating another dag.py file with the dag generation. But neither
> > Scheduler
> > > > > nor UI can support that fully. The moment the scheduler together
> > with UI
> > > > > will be open for "versioned DAGs", a lot of nice DSLs and
> extensions
> > will
> > > > > emerge out of that. Data Analysts will get more GUI driven tools to
> > > > > generate DAGs, ML engineers will be able to run and iterate on
> their
> > > > > algorithms, Data engineers will be able to implement their DAG
> > DSL/SDK the
> > > > > way they see it suits their company.
> > > > >
> > > > > Most users of DBND author their ML pipelines without knowing that
> > Airflow
> > > > > is orchestrating behind the scenes. They submit
> > Python/Spark/Notebooks
> > > > > without knowing that the DAG is going to be run through the Airflow
> > > > > subsystem. Only when they see the Airflow webserver they start to
> > discover
> > > > > that there is Airflow. And this is the way it should be. ML
> > developers
> > > > > don't like new frameworks, they just like to see data flowing from
> > task to
> > > > > task, and ways to push work to production with minimal "external"
> > code
> > > > > involved.
> > > > >
> > > > > Evgeny.
> > > > > On 2020/02/19 16:46:44, Dan Davydov <[email protected]>
> > > > > wrote:
> > > > > > Twitter uses Airflow primarily for ML, to create automated
> > pipelines for
> > > > > > retraining data, but also for more ad-hoc training jobs.
> > > > > >
> > > > > > The biggest gaps are on the experimentation side. It takes too
> > long for
> > > > > a
> > > > > > new user to set up and run a pipeline and then iterate on it.
> This
> > > > >
> > > > > problem
> > > > > > is a bit more unique to ML than other domains because 1) training
> > jobs
> > > > >
> > > > > can
> > > > > > take a very long time to run, and 2) users have the need to
> launch
> > > > >
> > > > > multiple
> > > > > > experiments in parallel for the same model pipeline.
> > > > > >
> > > > > > Biggest Gaps:
> > > > > > - Too much boilerplate to write DAGs compared to Dagster/etc, and
> > > > > > difficulty in message passing (XCom). There was a proposal
> > recently to
> > > > > > improve this in Airflow which should be entering AIP soon.
> > > > > > - Lack of pipeline isolation which hurts model experimentation
> > (being
> > > > >
> > > > > able
> > > > > > to run a DAG, modify it, and run it again without affecting the
> > previous
> > > > > > run), lack of isolation of DAGs from Airflow infrastructure
> > (inability
> > > > >
> > > > > to
> > > > > > redeploy Airflow infra without also redeploying DAGs) also hurts.
> > > > > > - Lack of multi-tenancy; it's hard for customers to quickly
> launch
> > an
> > > > > > ad-hoc pipeline, the overhead of setting up a cluster and all of
> > its
> > > > > > dependencies is quite high
> > > > > > - Lack of integration with data visualization plugins (e.g.
> > plugins for
> > > > > > rendering data related to a task when you click a task instance
> in
> > the
> > > > >
> > > > > UI).
> > > > > > - Lack of simpler abstractions for users with limited knowledge
> of
> > > > >
> > > > > Airflow
> > > > > > or even python to build simple pipelines (not really an Airflow
> > problem,
> > > > > > but rather the need for a good abstraction that sits on top of
> > Airflow
> > > > >
> > > > > like
> > > > > > a drag-and-drop pipeline builder)
> > > > > >
> > > > > > FWIW my personal feeling is that a fair number companies in the
> ML
> > space
> > > > > > are moving to alternate solutions like TFX Pipelines due to the
> > focus
> > > > >
> > > > > these
> > > > > > platforms these have on ML (ML pipelines are first-class
> > citizens), and
> > > > > > support from Google. Would be great if we could change that. The
> ML
> > > > > > orchestration/tooling space is definitely evolving very rapidly
> and
> > > > >
> > > > > there
> > > > > > are also new promising entrants as well.
> > > > > >
> > > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > Hello Daniel,
> > > > > > > In my company we use airflow to update our ML models and to
> > predict.
> > > > > > > As we use kubernetesOperator to trigger jobs, each ML DAG are
> > similar
> > > > > and
> > > > > > > ML/Data science engineer can reuse a template and choose which
> > type of
> > > > > > > machine they needs (highcpu, highmem, GPU or not..etc)
> > > > > > >
> > > > > > > We have a process in place describe in the second part of this
> > article
> > > > > > > (Industrializing machine learning pipeline) :
> > > > > > >
> > > > >
> >
> https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
> > > > > > >
> > > > > > > Hope this help.
> > > > > > > Germain.
> > > > > > > On 19/02/2020 16:42, "Daniel Imberman" <
> > [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > Hello everyone!
> > > > > > > I’m working on a few proposals to make Apache Airflow more
> > > > > friendly
> > > > > > > for ML/Data science use-cases, and I wanted to reach out in
> > hopes of
> > > > > > > hearing from people that are using/wish to use Airflow for ML.
> > If you
> > > > > >
> > > > >
> > > > > have
> > > > > > > any opinions on the subject, I’d love to hear what you’re all
> > working
> > > > > >
> > > > >
> > > > > on!
> > > > > > >
> > > > > > > Current questions I’m looking into:
> > > > > > > 1. How do you use Airflow for your ML? Has it worked out well
> > > > > for you?
> > > > > > > 2. Are there any features that would improve your experience of
> > > > > > > building models on Airflow?
> > > > > > > 3. Have you built anything on top of airflow/around Airflow to
> > > > > >
> > > > >
> > > > > aide
> > > > > > > you in this process?
> > > > > > >
> > > > > > > Thank you so much for your time!
> > > > > > > via Newton Mail [
> > > > >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0
> > > > > > > ]
> > > > > >
> > > > >
> > > >
> > >
> > >
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: Airflow and Machine Learning

Reply via email to