Re: Airflow and Machine Learning

Jarek Potiuk Tue, 25 Feb 2020 07:43:47 -0800

I certainly will take a look. I figured that something like that might
be available already :). It seems it might be very useful for our
development environment indeed (especially that our kubernetes tests
are hard locally). It might also be interesting to see if this can
work in the cases you described! I will definitely take a look !


J.

On Tue, Feb 25, 2020 at 4:36 PM James Meickle
<jmeic...@quantopian.com.invalid> wrote:
>
> Jarek, have you looked at garden.io? We've been experimenting with that for
> our local, Kubernetes-native development. It's hard to describe, but
> imagine: Docker Compose, but cross-git-repo, with builds occurring in
> Kubernetes and deployments managed by Helm charts.
>
> With this workflow, rather than building locally and pushing: you rsync
> local changes (even uncommitted) into the cluster, build there, and push to
> an in-cluster repo there. Since it only sends the diff, and is running on
> arbitrary compute, it can potentially be much faster. It also has a hot
> reload feature for updating in-container code without any rebuild, in the
> case of "code-only" changes. Yet it's still using the same Dockerfile and
> Helm chart as a production deployment, just not going through a
> "control-heavy" CI tool (for example you can deploy without running test
> suites).
>
> Something like that (or Tilt, another similar tool) could be a really nice
> workflow for iterating quickly on Airflow DAGs, since they allow multiple
> types of change (fast code-only change, slower Docker rebuild change,
> Airflow infra change via Helm chart) even for non-credentialed users.
>
> On Mon, Feb 24, 2020 at 7:13 PM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > Agree some kind of benchmarking is needed indeed. And I think this
> > discussion is great BTW :)
> >
> > This would be great if we could achieve the same consistent
> > environment through the whole lifecycle of a DAG. But I am not sure it
> > can be achieved - I got quite disappointed with the iteration speed
> > that you can achieve with docker/image approach in general and with
> > actual complexity of it - especially when you want to iterate
> > independenlty on multiple variations of the same code. Surely it can
> > be improved an optimised - but there are certain limitations and I
> > think even different basic characteristics of different lifecycle
> > stages of a DAG. I believe those different "usage patterns"
> > predetermine different solutions for different parts of a DAG
> > lifecycle.
> >
> > Again starting backwards:
> >
> > In Production environment - all you need is stability and
> > repeatability,You want everything audited, recorded, it should change
> > slooooowly and in a controllable way. You should always be able to
> > rollback a change and possibly perform staging rollout. You want to
> > deploy dags deliberately and have people in operations control it.
> > Everyone should use the same environment. I think this is mostly where
> > James comes from. And here I totally agree we need fixed and
> > controllably released Docker image for everyone with pre-baked
> > dependencies, stable set of DAGs that are deployed from Git or Nexus
> > or smth like that. High uptime needed. No doubts about this - I would
> > be the last person to mess with that.
> >
> > However this "production" is actually only last stage of DAG
> > lifecycle. When you iterate on your DAGs as part of your team you want
> > to iterate rather quickly and test different dependencies and common
> > code that you reuse between the team. You rarely change binary
> > dependencies (but you do and when you do - you do it for the whole
> > team), you have a relatively stable set of dependencies that your team
> > uses but you update those deps occasionally - and then you want to
> > deploy a new image for everyone with pre-baked dependencies so that
> > everyone has very similar environment to start with. Your environment
> > is shared via Git usually so that everyone can update to the latest
> > version. However you do not need all the tools to be able to rollback
> > and deploy stuff in controllable way nor staging. You can afford
> > occasional break-downs (move fast, break things) under the condition
> > you can fix things quickly for everyone when they fail. Occasionally
> > your team members can add new dependencies and contribute to the
> > environment. When ready - whatever you have as stable in this
> > environment you want to be able to promote it to the Prod environment.
> > You need to track execution of that code and match it with the results
> > via some automated tools.
> >
> > When you are iterating on individual models as a member of a team as
> > ML researcher - you want to use the base similar as your team members
> > but you want to experiment. A lot. Fast. Without being too limited.
> > Add a new library or change it's version in minutes rather than hours
> > and see how it works. And you want to do it without impacting others
> > in your team. You want to play and throw away the code you just tried.
> > This happens a lot. You do not want to keep track of this code other
> > than your local environment. Likely you have 10 versions of the same
> > DAG copied to different files and you iterate on them separately
> > (including dependencies and imported code) as you have to wait some
> > hours for results and you do not want to stop and wait for it - you
> > want to experiment with multiple of such runs in parallel. And put it
> > to trash way more often than not.  And you do not need to have fully
> > controllable/easily back-forth way of tracking your code changes such
> > as git. However if you find that your experiment succeeded, you want
> > to be able to retrieve the very code that was used to it. But this is
> > super rare and you do not want to store it permanently after your
> > experiments completed - because it only adds noise. So you need some
> > "temporary traceability",
> >
> > My approach is only really valid for the last two cases, not at all
> > for the first one. I think it would be great if Airflow supports those
> > cases as well as it supports the first one. One might argue that the
> > "Production" approach is the most important, but I would disagree with
> > that. The more efficient we can make the team of ML researchers being
> > able to churn multiple models fast, efficiently and with low
> > frustration - the better "Production" models there will be to run on a
> > daily basis.
> >
> > J.
> >
> >
> > On Tue, Feb 25, 2020 at 12:29 AM Dan Davydov
> > <ddavy...@twitter.com.invalid> wrote:
> > >
> > > I see things the same way as James. That being said I have not worked
> > with
> > > Docker very much (maybe Daniel Imberman can comment?), so I could have
> > some
> > > blindspots. I have heard latency concerns expressed by several people for
> > > example (can't remember in which areas).
> > >
> > > The main thing that draws me to Docker solution is that wheels give only
> > > partial isolation, and Docker gives "full" isolation. When isolation is
> > > partial, users lose trust in the reliability of the system, e.g. for some
> > > packages wheel-level isolation is enough, and for others it's not, and to
> > > me it doesn't feel reasonable to expect that the user understand and
> > think
> > > through whether each change they are making requires one level of
> > isolation
> > > or another (especially if the users are less technical). Even for the
> > > ad-hoc iteration use-case, it's quite important for things to work the
> > way
> > > users expect, and as a user if one time my binary dependencies don't get
> > > packaged correctly, I will lose trust in the workflow I was using to do
> > the
> > > right thing for me. Docker also feels cleaner to me since it handles
> > > isolation completely, whereas a wheel-based solution still needs to
> > handle
> > > binary dependencies using e.g. Docker.
> > >
> > > If it does turn out that Docker is just too slow or there are technical
> > > challenges that are too hard to solve (e.g. modifying one line of code in
> > > your DAG causes the whole/the majority of a docker image to get rebuilt),
> > > then we probably will need to do something more like Jarek is talking
> > > about, but it definitely feels like a hack to me. I would love to see
> > > prototypes for each solution and some benchmarks personally.
> > >
> > > I think in terms of next steps after this discussion completes, probably
> > a
> > > design doc/AIP evaluating the docker vs non-docker options makes sense.
> > >
> > > On Mon, Feb 24, 2020 at 5:48 PM James Meickle
> > > <jmeic...@quantopian.com.invalid> wrote:
> > >
> > > > I appreciate where you're coming from on wanting to enhance
> > productivity
> > > > for different types of users, but as a cluster administrator, I
> > _really_
> > > > don't want to be running software that's managing its own Docker
> > builds,
> > > > virtualenvs, zip uploads, etc.! It will almost certainly not do so in
> > a way
> > > > compliant or consistent with our policies. If anything goes wrong,
> > > > disentangling what happened and how to fix it will be different from
> > all of
> > > > our other software builds, which already have defined processes in
> > place.
> > > > Also, "self-building" software that works this way often leaves ops
> > teams
> > > > like mine looking like the "bad guys" by solving the easy local
> > > > development/single-instance/single-user case, and then failing to work
> > "the
> > > > same way" in a more restricted shared cluster.
> > > >
> > > > I think this is exactly why a structured API is good, though. For
> > example,
> > > > requiring DAGs to implement a manifest/retrieval API; where a basic
> > > > implementation is a hash plus a local filesystem path, a provided but
> > > > optional implementation is a git commit and checkout creds, and a third
> > > > party module implements a multi-user Python notebook integration.
> > > >
> > > > Basically, I am not opposed to "monitor code for a Dockerfile change,
> > > > rebuild/push the Dockerfile, and trigger the DAG" to be something you
> > could
> > > > build on top of Airflow. I think that even some basic commitment to
> > making
> > > > DAG functionality API-driven would enable a third party to do exactly
> > that.
> > > > I would not want to see that functionality baked into Airflow's core,
> > > > though, because it's spanning so many problem domains that involve
> > extra
> > > > security and operational concerns. As-is, we want to get more of those
> > > > concerns under our control, rather than Airflow's. (e.g. we'd rather
> > notify
> > > > Airflow of DAG changes as part of a deployment pipeline, rather than
> > having
> > > > it constantly poll for updated definitions, which leads to resource
> > > > utilization and inconsistency.)
> > > >
> > > > On Mon, Feb 24, 2020 at 5:16 PM Jarek Potiuk <jarek.pot...@polidea.com
> > >
> > > > wrote:
> > > >
> > > > > I think what would help a lot to solve the problems of env/deps/DAGS
> > is
> > > > the
> > > > > wheel packaging we started to talk about in another thread.
> > > > >
> > > > > I personally think what argo does is too much "cloud native" - you
> > have
> > > > to
> > > > > build the image, push to registry, get it pulled by the engine,
> > execute
> > > > > etc. I've been working with ML tool we developed internally for
> > > > NoMagic.ai
> > > > > which did something quite opposite - it was packaging all the python
> > code
> > > > > developed locally in a .zip file, stored it in GCS, submitted it to
> > be
> > > > > executed - it was unpacked and executed  on a remote machine. We
> > actually
> > > > > evaluated Airflow back then (and that's how I learned about the
> > project)
> > > > > but it was too much overhead for our case.
> > > > >
> > > > > The first approach - Argo - is generally slow and "complex" if you
> > just
> > > > > change one line of code. The second approach - default Airflow
> > approach
> > > > is
> > > > > bad when you have new binary or python dependencies. I think both
> > > > solutions
> > > > > are valid, but for different cases of code deployment. And none of
> > them
> > > > > handles the in-between case where we have only new python
> > > > > dependencies added/modified. I think with Airflow we can target a
> > > > > "combined" solution which will be able to handle well all cases.
> > > > >
> > > > > I think  the problem is pretty independent on whether to store the
> > > > changes
> > > > > in a git repo or not. I can imagine that iteration is done based on
> > just
> > > > > (shared) file system changes or through git - it is just a matter of
> > > > > whether you want to "freeze' the state in git or not. As a software
> > > > > engineer I always prefer git - but I perfectly understand as a data
> > > > > scientist git-commit and solving potential conflicts might be a
> > burden,
> > > > > especially when you want to iterate quickly and experiment. But I
> > > > > understand that for this quick iteration case everything should work
> > > > > quickly and transparently without unnecessary overhead when you
> > *just*
> > > > > change a  python file or *just* increase python dependency version.
> > > > >
> > > > > What could work is if we start treating differently the three types
> > of
> > > > > changes and not try to handle them in the same case. I think this
> > can be
> > > > > all automated by Airflow's dag submission mechanism. I think it can
> > > > address
> > > > > several things at various levels - addressing the needs of
> > individuals,
> > > > > teams, and companies at various levels:
> > > > >
> > > > > - super-fast speed of iteration by individual people (on a DAG-code
> > > > level)
> > > > > - ability to use different python dependencies by different people
> > in the
> > > > > same team using the same deployment (on dependency level)
> > > > > - ability to keep the environment evolving with new binary
> > dependencies
> > > > and
> > > > > eventually landing in prod environment (on environment/container
> > level)
> > > > >
> > > > > Working it out backwards from the heaviest to lightest changes:
> > > > >
> > > > > 1) whenever we have a binary dependency change (.so, binary, new
> > external
> > > > > dependency to be added) we  should change dockerfile that installs
> > it. We
> > > > > don't currently handle that case, but I can imagine with prod image
> > > > support
> > > > > and being able to modify thin layer of new binaries added to base
> > Airflow
> > > > > image this should be easy to automate - when you iterate locally you
> > > > should
> > > > > be able to build the image automatically, send a new version to the
> > > > > registry and restart whole deployment of Airflow to use this new
> > binary.
> > > > It
> > > > > should be a rare case, it should be impacting the whole deployment
> > (for
> > > > > example whole dev or staging deployment)  - i.e. everyone in Airflow
> > > > > installation should get the new image as a base. The drawback here is
> > > > that
> > > > > everyone who is using the same Airflow deployment is impacted.
> > > > >
> > > > > 2) whenever we have a Python dependency change (new version, new
> > library
> > > > > added etc.) those deps could be pre-packaged in a binary .whl file
> > > > > (incremental) and submitted together with the DAG (and stored in a
> > shared
> > > > > cache). This could be done via a new requirements.txt file and
> > changes to
> > > > > it. We should be able to detect the changes and scheduler should be
> > able
> > > > to
> > > > > install new requirements in dynamically created virtualenv (and
> > produce a
> > > > > binary wheel), parse the DAG in the context of that virtualenv and
> > submit
> > > > > the DAG together with the new wheel package - so that the wheel
> > package
> > > > can
> > > > > be picked up by the workers to execute the tasks, create a venv for
> > that
> > > > > and run tasks with this venv.  The benefit of this is that you do not
> > > > have
> > > > > to update the whole deployment for that - ie. people in the same
> > team,
> > > > > using the same Airflow Deployment can use different dependencies
> > without
> > > > > impacting each other. The whl packages/venvs can be cached on the
> > > > > workers/scheduler and eventually, when you commit and release such
> > > > > dependency changes, they can be embedded in the shared docker image
> > from
> > > > > point 1). This whl file does not have to be stored in a database -
> > just
> > > > > storing the "requirements.txt" per DagRun is enough - we can always
> > > > rebuild
> > > > > the whl from that when needed and we have no cache.
> > > > >
> > > > > 3) Whenever it's a change to just your own code (or code of any
> > imported
> > > > > files you used) - there is no need to package and build the whl
> > packages.
> > > > > You can package your own code in a .zip (or another .whl) and submit
> > for
> > > > > execution (and store per-DagRun), This can be also automated by
> > > > scheduler -
> > > > > it can traverse the whole python structure while parsing the file and
> > > > > package all dependent files. This can be done using current .zip
> > support
> > > > of
> > > > > Airflow + new versioning. We would have to add a feature that each of
> > > > such
> > > > > DAGs will be a different version even if the same dag_id is used. We
> > > > > already plan it - as a feature for 2.0 to support different version
> > of
> > > > each
> > > > > DAG (and we have a discussion about it tomorrow!). Here the benefit
> > is
> > > > that
> > > > > even if two people are modifying the same DAG file - they can run
> > them
> > > > > independently and iterate independently on it. It will be fast, and
> > the
> > > > > packaged ".zip" file is the "track" of what was exactly tried. Often
> > you
> > > > do
> > > > > not have to store it in git as often those will be experiments.
> > > > >
> > > > > I think we cannot have a "one solution" to all the use cases - but
> > with
> > > > > treating those three cases differently, we can do very well for The
> > ML
> > > > > case. And we could automate it all - we could detect what kind of
> > change
> > > > > user ldid locally and act appropriately.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Feb 24, 2020 at 4:54 PM Ash Berlin-Taylor <a...@apache.org>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > > DAG state (currently stored directly in the DB)
> > > > > > Can you expand on this point James? What is the problem or
> > limitation
> > > > > > here? And would those be solved by expanding on the APIs to allow
> > this
> > > > to
> > > > > > be set by some external process?
> > > > > > On Feb 24 2020, at 3:45 pm, James Meickle <jmeic...@quantopian.com
> > > > > .INVALID>
> > > > > > wrote:
> > > > > > > I really agree with most of what was posted above but
> > particularly
> > > > love
> > > > > > > what Evgeny wrote about having a DAG API. As an end user, I would
> > > > love
> > > > > to
> > > > > > > be able to provide different implementations of core DAG
> > > > functionality,
> > > > > > > similar to how hExecutor can already be subclassed. Some key
> > behavior
> > > > > > > points I either have personally had to override/work around, or
> > would
> > > > > > like
> > > > > > > to do if it were easier:
> > > > > > >
> > > > > > > DAG schedule (currently defined by the DAG after it has been
> > parsed)
> > > > > > > DAG discovery (currently always a scheduler subprocess)
> > > > > > > DAG state (currently stored directly in the DB)
> > > > > > > Loading content for discovered DAGs (currently always on-disk,
> > causes
> > > > > > > versioning problems)
> > > > > > > Parsing content from discovered DAGs (currently always a
> > scheduler
> > > > > > > subprocess, causes performance problems)
> > > > > > > Providing DAG result transfer/persistence (currently only XCOM,
> > > > causes
> > > > > > many
> > > > > > > problems)
> > > > > > >
> > > > > > > Breaking DAG functionality into a set of related APIs would allow
> > > > > Airflow
> > > > > > > to still have a good "out of the box" experience, or a simple
> > > > git-based
> > > > > > > deployment mode; while unlocking a lot of capability for users
> > with
> > > > > more
> > > > > > > sophisticated needs.
> > > > > > >
> > > > > > > For example, we're using Argo Workflows nowadays, which is more
> > > > > > Kubernetes
> > > > > > > native but also more limited. I could easily envision a DAG
> > > > > > implementation
> > > > > > > where Airflow stores historical executions; stores git commits
> > and
> > > > > > > retrieves DAG definitions as needed; launches Argo Workflows to
> > > > perform
> > > > > > > both DAG parsing _and_ task execution; and stores results in
> > Airflow.
> > > > > > This
> > > > > > > would turn Airflow into the system of record (permanent
> > history), and
> > > > > > Argo
> > > > > > > into the ephemeral execution layer (delete after a few days). End
> > > > users
> > > > > > > could submit from Airflow even without Kubernetes access, and
> > > > wouldn't
> > > > > > need
> > > > > > > Kubernetes access to view results or logs.
> > > > > > >
> > > > > > > Another example: we have notebook users who often need to
> > schedule
> > > > > > > pipelines, which they do in ad hoc ways. Instead they could
> > import
> > > > > > Airflow
> > > > > > > and define a DAG in Python, then call a command to remotely
> > execute
> > > > it.
> > > > > > > This would run on our "staging" Airflow cluster, with access to
> > > > staging
> > > > > > > credentials, but as a DAG named something like
> > > > > > > "username.git_repo.notebook_name". Each run would freeze the
> > current
> > > > > > > notebook (like to S3), execute it, and store results via
> > papermill;
> > > > > this
> > > > > > > would let users launch multiple iterations of a notebook (like
> > > > looping
> > > > > > over
> > > > > > > a parameter list), run them for a while, and pick the one that
> > does
> > > > > what
> > > > > > > they need.
> > > > > > >
> > > > > > > In general, there's been no end to the frustration of DAGs being
> > > > > tightly
> > > > > > > coupled to a specific on-disk layout, specific Python packaging,
> > etc.
> > > > > and
> > > > > > > I'd love to be able to cleanly write alternative implementations.
> > > > > > >
> > > > > > > On Fri, Feb 21, 2020 at 4:44 PM Massy Bourennani <
> > > > > > massy.bourenn...@happn.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > > We are using Airflow to put batch ML models in production
> > (only the
> > > > > > > > prediction is done).
> > > > > > > > [image: image.png]
> > > > > > > >
> > > > > > > > Above is an example of a DAG we are using to execute an ML
> > model
> > > > > > written
> > > > > > > > in Python (scikit learn)
> > > > > > > >
> > > > > > > > 1. At first, data is extracted from BigQuery. The SQL query is
> > on a
> > > > > DS
> > > > > > > > owned GitHub repository
> > > > > > > > 2. Then, the trained model, which is also serialized in
> > pickle, is
> > > > > > > > taken from Github
> > > > > > > > 3. After that, an Apache Beam job is started on Dataflow which
> > is
> > > > > > > > responsible for reading the input files, downloading the model
> > from
> > > > > > GCS,
> > > > > > > > deserializing it, and using to predict the score of each data
> > > > point.
> > > > > > Below
> > > > > > > > you can see the expected interface of every model
> > > > > > > > 4. At the end, the results are saved in Bigquery/GCS
> > > > > > > >
> > > > > > > > class WhateverModel: def predict(batch:
> > collections.abc.Iterable)
> > > > ->
> > > > > > > > collections.abc.Iterable: """ :param batch: a collection of
> > dicts
> > > > > :type
> > > > > > > > batch: collection(dict) :return: a collection of dicts. there
> > > > should
> > > > > > be a
> > > > > > > > score in one of the fields of every dict (every datapoint) """
> > pass
> > > > > > > >
> > > > > > > > key points:
> > > > > > > > * every input: SQL query used to extract the dataset, ML model,
> > > > > custom
> > > > > > > > packages used by the model (used to setup Dataflow workers) is
> > in
> > > > > > Github.
> > > > > > > > So we can go from one version to another and use it fairly
> > easily,
> > > > > all
> > > > > > we
> > > > > > > > need are some qualifiers: GitHub repo, path to file, and tag
> > > > version.
> > > > > > > > * DS are free to use whatever python library they want thanks
> > to
> > > > > Apache
> > > > > > > > Beam that provides a way to initialize workers with custom
> > > > packages.
> > > > > > > >
> > > > > > > > weak points:
> > > > > > > > * Whenever DS update one of the input (SQL query, python
> > packages,
> > > > or
> > > > > > > > model) we, DE, need to update the DAG
> > > > > > > > * It's really specific to Batch Python ML models
> > > > > > > >
> > > > > > > > Below is a code snippet of a DAG instantiation
> > > > > > > > with Py3BatchInferenceDAG(dag_id='mydags.execute_my_ml_model',
> > > > > > > > sql_repo_name='app/data-learning',
> > > > > > > > sql_file_path='data_learning/my_ml_model/sql/predict.sql',
> > > > > > > > sql_tag_name='my_ml_model_0.0.12',
> > > > > > > > model_repo_name='app/data-models',
> > > > > > > > model_file_path='repository/my_ml_model/4/model.pkl',
> > > > > > > > model_tag_name='my_ml_model_0.0.4',
> > > > > > > > python_packages_repo_name='app/data-learning',
> > > > > > > > python_packages_tag_name='my_ml_model_0.0.9',
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > python_packages_paths=['data_learning/my_ml_model/python_packages/package_one/'],
> > > > > > > > params={ ########## setup of Dataflow workers
> > > > > > > > 'custom_commands': ['apt-get update',
> > > > > > > > 'gsutil cp -r gs://bucket/package_one /tmp/',
> > > > > > > > 'pip install /tmp/package_one/'
> > > > > > > > ],
> > > > > > > > 'required_packages': ['dill==0.2.9', 'numpy', 'pandas',
> > > > > 'scikit-learn'
> > > > > > > > ,'google-cloud-storage']
> > > > > > > > },
> > > > > > > > external_dag_requires=[
> > > > > > > > 'mydags.dependency_one$ds',
> > > > > > > > 'mydags.dependency_two$ds'
> > > > > > > > ],
> > > > > > > > destination_project_dataset_table='mydags.my_ml_model${{
> > ds_nodash
> > > > > }}',
> > > > > > > > schema_path=Common.get_file_path(__file__, "schema.yaml"),
> > > > > > > > start_date=datetime(2019, 12, 10),
> > > > > > > > schedule_interval=CronPresets.daily()) as dag:
> > > > > > > > pass
> > > > > > > >
> > > > > > > > I hope it helps,
> > > > > > > > Regards
> > > > > > > > Massy
> > > > > > > >
> > > > > > > > On Thu, Feb 20, 2020 at 8:36 AM Evgeny Shulman <
> > > > > > evgeny.shul...@databand.ai>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Everybody
> > > > > > > > > (Fully agreed on Dan's post. These are the main pain points
> > we
> > > > > > see/trying
> > > > > > > > > to fix. Here is our reply on the thread topic)
> > > > > > > > >
> > > > > > > > > We have numerous ML engineers that use our open source
> > project
> > > > > (DBND)
> > > > > > > > > with Airflow for their everyday work. We help them create and
> > > > > monitor
> > > > > > > > > ML/DATA pipelines of different complexity levels and infra
> > > > > > requirements.
> > > > > > > > > After 1.5 years doing that as a company now, and a few years
> > > > doing
> > > > > > it as a
> > > > > > > > > part of a big enterprise organization before we started
> > Databand,
> > > > > > these are
> > > > > > > > > the main pain points we think about when it comes to Airflow:
> > > > > > > > >
> > > > > > > > > A. DAG Versioning - ML teams change DAGs constantly. The
> > first
> > > > > > limitation
> > > > > > > > > they see is being able to review historical information of
> > > > previous
> > > > > > DAG
> > > > > > > > > runs based on the exact version of the DAG that executed. Our
> > > > > plugin
> > > > > > > > > 'dbnd-airflow-versioned-dag' is our approach to that. We
> > save and
> > > > > > show in
> > > > > > > > > the Airflow UI every specific version of the DAG. This is
> > > > important
> > > > > > in ML
> > > > > > > > > use cases because of the data science experimentation cycle
> > and
> > > > the
> > > > > > need to
> > > > > > > > > trace exactly what code/data went into a model.
> > > > > > > > >
> > > > > > > > > B. A better version of the backfill command - We had to
> > > > reimplement
> > > > > > > > > BackfillJob class to be able to run specific DAG versions.
> > > > > > > > >
> > > > > > > > > C. Running the same DAG in different environments - People
> > want
> > > > to
> > > > > > run
> > > > > > > > > the same DAG locally and at GCP/AWS without changing all the
> > > > code.
> > > > > > We have
> > > > > > > > > done that by abstracting Spark/Python/Docker code execution
> > so we
> > > > > can
> > > > > > > > > easily switch from one infra to another. We did that by
> > wrapping
> > > > > all
> > > > > > infra
> > > > > > > > > logic in a generic gateway "operators" with extensive use of
> > > > > existing
> > > > > > > > > Airflow hooks and operators.
> > > > > > > > >
> > > > > > > > > D. Data passing & versioning - being able to pass data from
> > > > > Operator
> > > > > > to
> > > > > > > > > Operator, version the data. Being able to do that with easy
> > > > > > authoring of
> > > > > > > > > DAGs & sub-DAGs - Pipelines grow in complexity very quickly.
> > It
> > > > > will
> > > > > > be
> > > > > > > > > hard to agree on what is the "right" SDK here to implement.
> > > > Airflow
> > > > > > is
> > > > > > > > > very "built by engineers for engineers", DAGs are created to
> > be
> > > > > > executed as
> > > > > > > > > Scheduled Production Jobs. It's going to be a long journey
> > to get
> > > > > to
> > > > > > the
> > > > > > > > > common conclusion on what's needs to be done on a higher
> > level
> > > > > around
> > > > > > > > > task/data management. Some people from the airflow community
> > went
> > > > > and
> > > > > > > > > started new Orchestration companies after they didn't manage
> > to
> > > > > have
> > > > > > a
> > > > > > > > > significant change in the Data model of Airflow.
> > > > > > > > >
> > > > > > > > > Our biggest wish list item in Airflow as advanced user:
> > > > > > > > > * A low-level API to generate and run DAGs *.
> > > > > > > > > So far there are numerous extensions, and all of them solve
> > this
> > > > by
> > > > > > > > > creating another dag.py file with the dag generation. But
> > neither
> > > > > > Scheduler
> > > > > > > > > nor UI can support that fully. The moment the scheduler
> > together
> > > > > > with UI
> > > > > > > > > will be open for "versioned DAGs", a lot of nice DSLs and
> > > > > extensions
> > > > > > will
> > > > > > > > > emerge out of that. Data Analysts will get more GUI driven
> > tools
> > > > to
> > > > > > > > > generate DAGs, ML engineers will be able to run and iterate
> > on
> > > > > their
> > > > > > > > > algorithms, Data engineers will be able to implement their
> > DAG
> > > > > > DSL/SDK the
> > > > > > > > > way they see it suits their company.
> > > > > > > > >
> > > > > > > > > Most users of DBND author their ML pipelines without knowing
> > that
> > > > > > Airflow
> > > > > > > > > is orchestrating behind the scenes. They submit
> > > > > > Python/Spark/Notebooks
> > > > > > > > > without knowing that the DAG is going to be run through the
> > > > Airflow
> > > > > > > > > subsystem. Only when they see the Airflow webserver they
> > start to
> > > > > > discover
> > > > > > > > > that there is Airflow. And this is the way it should be. ML
> > > > > > developers
> > > > > > > > > don't like new frameworks, they just like to see data flowing
> > > > from
> > > > > > task to
> > > > > > > > > task, and ways to push work to production with minimal
> > "external"
> > > > > > code
> > > > > > > > > involved.
> > > > > > > > >
> > > > > > > > > Evgeny.
> > > > > > > > > On 2020/02/19 16:46:44, Dan Davydov
> > <ddavy...@twitter.com.INVALID
> > > > >
> > > > > > > > > wrote:
> > > > > > > > > > Twitter uses Airflow primarily for ML, to create automated
> > > > > > pipelines for
> > > > > > > > > > retraining data, but also for more ad-hoc training jobs.
> > > > > > > > > >
> > > > > > > > > > The biggest gaps are on the experimentation side. It takes
> > too
> > > > > > long for
> > > > > > > > > a
> > > > > > > > > > new user to set up and run a pipeline and then iterate on
> > it.
> > > > > This
> > > > > > > > >
> > > > > > > > > problem
> > > > > > > > > > is a bit more unique to ML than other domains because 1)
> > > > training
> > > > > > jobs
> > > > > > > > >
> > > > > > > > > can
> > > > > > > > > > take a very long time to run, and 2) users have the need to
> > > > > launch
> > > > > > > > >
> > > > > > > > > multiple
> > > > > > > > > > experiments in parallel for the same model pipeline.
> > > > > > > > > >
> > > > > > > > > > Biggest Gaps:
> > > > > > > > > > - Too much boilerplate to write DAGs compared to
> > Dagster/etc,
> > > > and
> > > > > > > > > > difficulty in message passing (XCom). There was a proposal
> > > > > > recently to
> > > > > > > > > > improve this in Airflow which should be entering AIP soon.
> > > > > > > > > > - Lack of pipeline isolation which hurts model
> > experimentation
> > > > > > (being
> > > > > > > > >
> > > > > > > > > able
> > > > > > > > > > to run a DAG, modify it, and run it again without
> > affecting the
> > > > > > previous
> > > > > > > > > > run), lack of isolation of DAGs from Airflow infrastructure
> > > > > > (inability
> > > > > > > > >
> > > > > > > > > to
> > > > > > > > > > redeploy Airflow infra without also redeploying DAGs) also
> > > > hurts.
> > > > > > > > > > - Lack of multi-tenancy; it's hard for customers to quickly
> > > > > launch
> > > > > > an
> > > > > > > > > > ad-hoc pipeline, the overhead of setting up a cluster and
> > all
> > > > of
> > > > > > its
> > > > > > > > > > dependencies is quite high
> > > > > > > > > > - Lack of integration with data visualization plugins (e.g.
> > > > > > plugins for
> > > > > > > > > > rendering data related to a task when you click a task
> > instance
> > > > > in
> > > > > > the
> > > > > > > > >
> > > > > > > > > UI).
> > > > > > > > > > - Lack of simpler abstractions for users with limited
> > knowledge
> > > > > of
> > > > > > > > >
> > > > > > > > > Airflow
> > > > > > > > > > or even python to build simple pipelines (not really an
> > Airflow
> > > > > > problem,
> > > > > > > > > > but rather the need for a good abstraction that sits on
> > top of
> > > > > > Airflow
> > > > > > > > >
> > > > > > > > > like
> > > > > > > > > > a drag-and-drop pipeline builder)
> > > > > > > > > >
> > > > > > > > > > FWIW my personal feeling is that a fair number companies
> > in the
> > > > > ML
> > > > > > space
> > > > > > > > > > are moving to alternate solutions like TFX Pipelines due
> > to the
> > > > > > focus
> > > > > > > > >
> > > > > > > > > these
> > > > > > > > > > platforms these have on ML (ML pipelines are first-class
> > > > > > citizens), and
> > > > > > > > > > support from Google. Would be great if we could change
> > that.
> > > > The
> > > > > ML
> > > > > > > > > > orchestration/tooling space is definitely evolving very
> > rapidly
> > > > > and
> > > > > > > > >
> > > > > > > > > there
> > > > > > > > > > are also new promising entrants as well.
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
> > > > > > > > > > <germain.tan...@dailymotion.com.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello Daniel,
> > > > > > > > > > > In my company we use airflow to update our ML models and
> > to
> > > > > > predict.
> > > > > > > > > > > As we use kubernetesOperator to trigger jobs, each ML
> > DAG are
> > > > > > similar
> > > > > > > > > and
> > > > > > > > > > > ML/Data science engineer can reuse a template and choose
> > > > which
> > > > > > type of
> > > > > > > > > > > machine they needs (highcpu, highmem, GPU or not..etc)
> > > > > > > > > > >
> > > > > > > > > > > We have a process in place describe in the second part of
> > > > this
> > > > > > article
> > > > > > > > > > > (Industrializing machine learning pipeline) :
> > > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > > >
> > > >
> > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
> > > > > > > > > > >
> > > > > > > > > > > Hope this help.
> > > > > > > > > > > Germain.
> > > > > > > > > > > On 19/02/2020 16:42, "Daniel Imberman" <
> > > > > > daniel.imber...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello everyone!
> > > > > > > > > > > I’m working on a few proposals to make Apache Airflow
> > more
> > > > > > > > > friendly
> > > > > > > > > > > for ML/Data science use-cases, and I wanted to reach out
> > in
> > > > > > hopes of
> > > > > > > > > > > hearing from people that are using/wish to use Airflow
> > for
> > > > ML.
> > > > > > If you
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > have
> > > > > > > > > > > any opinions on the subject, I’d love to hear what
> > you’re all
> > > > > > working
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > on!
> > > > > > > > > > >
> > > > > > > > > > > Current questions I’m looking into:
> > > > > > > > > > > 1. How do you use Airflow for your ML? Has it worked out
> > well
> > > > > > > > > for you?
> > > > > > > > > > > 2. Are there any features that would improve your
> > experience
> > > > of
> > > > > > > > > > > building models on Airflow?
> > > > > > > > > > > 3. Have you built anything on top of airflow/around
> > Airflow
> > > > to
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > aide
> > > > > > > > > > > you in this process?
> > > > > > > > > > >
> > > > > > > > > > > Thank you so much for your time!
> > > > > > > > > > > via Newton Mail [
> > > > > > > > >
> > > > > >
> > > > >
> > > >
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0
> > > > > > > > > > > ]
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea | Principal Software Engineer
> >
> > M: +48 660 796 129
> >



-- 

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129

Re: Airflow and Machine Learning

Reply via email to