Re: Airflow and Machine Learning

James Meickle Tue, 25 Feb 2020 07:36:28 -0800

Jarek, have you looked at garden.io? We've been experimenting with that for
our local, Kubernetes-native development. It's hard to describe, but
imagine: Docker Compose, but cross-git-repo, with builds occurring in
Kubernetes and deployments managed by Helm charts.


With this workflow, rather than building locally and pushing: you rsync
local changes (even uncommitted) into the cluster, build there, and push to
an in-cluster repo there. Since it only sends the diff, and is running on
arbitrary compute, it can potentially be much faster. It also has a hot
reload feature for updating in-container code without any rebuild, in the
case of "code-only" changes. Yet it's still using the same Dockerfile and
Helm chart as a production deployment, just not going through a
"control-heavy" CI tool (for example you can deploy without running test
suites).

Something like that (or Tilt, another similar tool) could be a really nice
workflow for iterating quickly on Airflow DAGs, since they allow multiple
types of change (fast code-only change, slower Docker rebuild change,
Airflow infra change via Helm chart) even for non-credentialed users.

On Mon, Feb 24, 2020 at 7:13 PM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> Agree some kind of benchmarking is needed indeed. And I think this
> discussion is great BTW :)
>
> This would be great if we could achieve the same consistent
> environment through the whole lifecycle of a DAG. But I am not sure it
> can be achieved - I got quite disappointed with the iteration speed
> that you can achieve with docker/image approach in general and with
> actual complexity of it - especially when you want to iterate
> independenlty on multiple variations of the same code. Surely it can
> be improved an optimised - but there are certain limitations and I
> think even different basic characteristics of different lifecycle
> stages of a DAG. I believe those different "usage patterns"
> predetermine different solutions for different parts of a DAG
> lifecycle.
>
> Again starting backwards:
>
> In Production environment - all you need is stability and
> repeatability,You want everything audited, recorded, it should change
> slooooowly and in a controllable way. You should always be able to
> rollback a change and possibly perform staging rollout. You want to
> deploy dags deliberately and have people in operations control it.
> Everyone should use the same environment. I think this is mostly where
> James comes from. And here I totally agree we need fixed and
> controllably released Docker image for everyone with pre-baked
> dependencies, stable set of DAGs that are deployed from Git or Nexus
> or smth like that. High uptime needed. No doubts about this - I would
> be the last person to mess with that.
>
> However this "production" is actually only last stage of DAG
> lifecycle. When you iterate on your DAGs as part of your team you want
> to iterate rather quickly and test different dependencies and common
> code that you reuse between the team. You rarely change binary
> dependencies (but you do and when you do - you do it for the whole
> team), you have a relatively stable set of dependencies that your team
> uses but you update those deps occasionally - and then you want to
> deploy a new image for everyone with pre-baked dependencies so that
> everyone has very similar environment to start with. Your environment
> is shared via Git usually so that everyone can update to the latest
> version. However you do not need all the tools to be able to rollback
> and deploy stuff in controllable way nor staging. You can afford
> occasional break-downs (move fast, break things) under the condition
> you can fix things quickly for everyone when they fail. Occasionally
> your team members can add new dependencies and contribute to the
> environment. When ready - whatever you have as stable in this
> environment you want to be able to promote it to the Prod environment.
> You need to track execution of that code and match it with the results
> via some automated tools.
>
> When you are iterating on individual models as a member of a team as
> ML researcher - you want to use the base similar as your team members
> but you want to experiment. A lot. Fast. Without being too limited.
> Add a new library or change it's version in minutes rather than hours
> and see how it works. And you want to do it without impacting others
> in your team. You want to play and throw away the code you just tried.
> This happens a lot. You do not want to keep track of this code other
> than your local environment. Likely you have 10 versions of the same
> DAG copied to different files and you iterate on them separately
> (including dependencies and imported code) as you have to wait some
> hours for results and you do not want to stop and wait for it - you
> want to experiment with multiple of such runs in parallel. And put it
> to trash way more often than not.  And you do not need to have fully
> controllable/easily back-forth way of tracking your code changes such
> as git. However if you find that your experiment succeeded, you want
> to be able to retrieve the very code that was used to it. But this is
> super rare and you do not want to store it permanently after your
> experiments completed - because it only adds noise. So you need some
> "temporary traceability",
>
> My approach is only really valid for the last two cases, not at all
> for the first one. I think it would be great if Airflow supports those
> cases as well as it supports the first one. One might argue that the
> "Production" approach is the most important, but I would disagree with
> that. The more efficient we can make the team of ML researchers being
> able to churn multiple models fast, efficiently and with low
> frustration - the better "Production" models there will be to run on a
> daily basis.
>
> J.
>
>
> On Tue, Feb 25, 2020 at 12:29 AM Dan Davydov
> <ddavy...@twitter.com.invalid> wrote:
> >
> > I see things the same way as James. That being said I have not worked
> with
> > Docker very much (maybe Daniel Imberman can comment?), so I could have
> some
> > blindspots. I have heard latency concerns expressed by several people for
> > example (can't remember in which areas).
> >
> > The main thing that draws me to Docker solution is that wheels give only
> > partial isolation, and Docker gives "full" isolation. When isolation is
> > partial, users lose trust in the reliability of the system, e.g. for some
> > packages wheel-level isolation is enough, and for others it's not, and to
> > me it doesn't feel reasonable to expect that the user understand and
> think
> > through whether each change they are making requires one level of
> isolation
> > or another (especially if the users are less technical). Even for the
> > ad-hoc iteration use-case, it's quite important for things to work the
> way
> > users expect, and as a user if one time my binary dependencies don't get
> > packaged correctly, I will lose trust in the workflow I was using to do
> the
> > right thing for me. Docker also feels cleaner to me since it handles
> > isolation completely, whereas a wheel-based solution still needs to
> handle
> > binary dependencies using e.g. Docker.
> >
> > If it does turn out that Docker is just too slow or there are technical
> > challenges that are too hard to solve (e.g. modifying one line of code in
> > your DAG causes the whole/the majority of a docker image to get rebuilt),
> > then we probably will need to do something more like Jarek is talking
> > about, but it definitely feels like a hack to me. I would love to see
> > prototypes for each solution and some benchmarks personally.
> >
> > I think in terms of next steps after this discussion completes, probably
> a
> > design doc/AIP evaluating the docker vs non-docker options makes sense.
> >
> > On Mon, Feb 24, 2020 at 5:48 PM James Meickle
> > <jmeic...@quantopian.com.invalid> wrote:
> >
> > > I appreciate where you're coming from on wanting to enhance
> productivity
> > > for different types of users, but as a cluster administrator, I
> _really_
> > > don't want to be running software that's managing its own Docker
> builds,
> > > virtualenvs, zip uploads, etc.! It will almost certainly not do so in
> a way
> > > compliant or consistent with our policies. If anything goes wrong,
> > > disentangling what happened and how to fix it will be different from
> all of
> > > our other software builds, which already have defined processes in
> place.
> > > Also, "self-building" software that works this way often leaves ops
> teams
> > > like mine looking like the "bad guys" by solving the easy local
> > > development/single-instance/single-user case, and then failing to work
> "the
> > > same way" in a more restricted shared cluster.
> > >
> > > I think this is exactly why a structured API is good, though. For
> example,
> > > requiring DAGs to implement a manifest/retrieval API; where a basic
> > > implementation is a hash plus a local filesystem path, a provided but
> > > optional implementation is a git commit and checkout creds, and a third
> > > party module implements a multi-user Python notebook integration.
> > >
> > > Basically, I am not opposed to "monitor code for a Dockerfile change,
> > > rebuild/push the Dockerfile, and trigger the DAG" to be something you
> could
> > > build on top of Airflow. I think that even some basic commitment to
> making
> > > DAG functionality API-driven would enable a third party to do exactly
> that.
> > > I would not want to see that functionality baked into Airflow's core,
> > > though, because it's spanning so many problem domains that involve
> extra
> > > security and operational concerns. As-is, we want to get more of those
> > > concerns under our control, rather than Airflow's. (e.g. we'd rather
> notify
> > > Airflow of DAG changes as part of a deployment pipeline, rather than
> having
> > > it constantly poll for updated definitions, which leads to resource
> > > utilization and inconsistency.)
> > >
> > > On Mon, Feb 24, 2020 at 5:16 PM Jarek Potiuk <jarek.pot...@polidea.com
> >
> > > wrote:
> > >
> > > > I think what would help a lot to solve the problems of env/deps/DAGS
> is
> > > the
> > > > wheel packaging we started to talk about in another thread.
> > > >
> > > > I personally think what argo does is too much "cloud native" - you
> have
> > > to
> > > > build the image, push to registry, get it pulled by the engine,
> execute
> > > > etc. I've been working with ML tool we developed internally for
> > > NoMagic.ai
> > > > which did something quite opposite - it was packaging all the python
> code
> > > > developed locally in a .zip file, stored it in GCS, submitted it to
> be
> > > > executed - it was unpacked and executed  on a remote machine. We
> actually
> > > > evaluated Airflow back then (and that's how I learned about the
> project)
> > > > but it was too much overhead for our case.
> > > >
> > > > The first approach - Argo - is generally slow and "complex" if you
> just
> > > > change one line of code. The second approach - default Airflow
> approach
> > > is
> > > > bad when you have new binary or python dependencies. I think both
> > > solutions
> > > > are valid, but for different cases of code deployment. And none of
> them
> > > > handles the in-between case where we have only new python
> > > > dependencies added/modified. I think with Airflow we can target a
> > > > "combined" solution which will be able to handle well all cases.
> > > >
> > > > I think  the problem is pretty independent on whether to store the
> > > changes
> > > > in a git repo or not. I can imagine that iteration is done based on
> just
> > > > (shared) file system changes or through git - it is just a matter of
> > > > whether you want to "freeze' the state in git or not. As a software
> > > > engineer I always prefer git - but I perfectly understand as a data
> > > > scientist git-commit and solving potential conflicts might be a
> burden,
> > > > especially when you want to iterate quickly and experiment. But I
> > > > understand that for this quick iteration case everything should work
> > > > quickly and transparently without unnecessary overhead when you
> *just*
> > > > change a  python file or *just* increase python dependency version.
> > > >
> > > > What could work is if we start treating differently the three types
> of
> > > > changes and not try to handle them in the same case. I think this
> can be
> > > > all automated by Airflow's dag submission mechanism. I think it can
> > > address
> > > > several things at various levels - addressing the needs of
> individuals,
> > > > teams, and companies at various levels:
> > > >
> > > > - super-fast speed of iteration by individual people (on a DAG-code
> > > level)
> > > > - ability to use different python dependencies by different people
> in the
> > > > same team using the same deployment (on dependency level)
> > > > - ability to keep the environment evolving with new binary
> dependencies
> > > and
> > > > eventually landing in prod environment (on environment/container
> level)
> > > >
> > > > Working it out backwards from the heaviest to lightest changes:
> > > >
> > > > 1) whenever we have a binary dependency change (.so, binary, new
> external
> > > > dependency to be added) we  should change dockerfile that installs
> it. We
> > > > don't currently handle that case, but I can imagine with prod image
> > > support
> > > > and being able to modify thin layer of new binaries added to base
> Airflow
> > > > image this should be easy to automate - when you iterate locally you
> > > should
> > > > be able to build the image automatically, send a new version to the
> > > > registry and restart whole deployment of Airflow to use this new
> binary.
> > > It
> > > > should be a rare case, it should be impacting the whole deployment
> (for
> > > > example whole dev or staging deployment)  - i.e. everyone in Airflow
> > > > installation should get the new image as a base. The drawback here is
> > > that
> > > > everyone who is using the same Airflow deployment is impacted.
> > > >
> > > > 2) whenever we have a Python dependency change (new version, new
> library
> > > > added etc.) those deps could be pre-packaged in a binary .whl file
> > > > (incremental) and submitted together with the DAG (and stored in a
> shared
> > > > cache). This could be done via a new requirements.txt file and
> changes to
> > > > it. We should be able to detect the changes and scheduler should be
> able
> > > to
> > > > install new requirements in dynamically created virtualenv (and
> produce a
> > > > binary wheel), parse the DAG in the context of that virtualenv and
> submit
> > > > the DAG together with the new wheel package - so that the wheel
> package
> > > can
> > > > be picked up by the workers to execute the tasks, create a venv for
> that
> > > > and run tasks with this venv.  The benefit of this is that you do not
> > > have
> > > > to update the whole deployment for that - ie. people in the same
> team,
> > > > using the same Airflow Deployment can use different dependencies
> without
> > > > impacting each other. The whl packages/venvs can be cached on the
> > > > workers/scheduler and eventually, when you commit and release such
> > > > dependency changes, they can be embedded in the shared docker image
> from
> > > > point 1). This whl file does not have to be stored in a database -
> just
> > > > storing the "requirements.txt" per DagRun is enough - we can always
> > > rebuild
> > > > the whl from that when needed and we have no cache.
> > > >
> > > > 3) Whenever it's a change to just your own code (or code of any
> imported
> > > > files you used) - there is no need to package and build the whl
> packages.
> > > > You can package your own code in a .zip (or another .whl) and submit
> for
> > > > execution (and store per-DagRun), This can be also automated by
> > > scheduler -
> > > > it can traverse the whole python structure while parsing the file and
> > > > package all dependent files. This can be done using current .zip
> support
> > > of
> > > > Airflow + new versioning. We would have to add a feature that each of
> > > such
> > > > DAGs will be a different version even if the same dag_id is used. We
> > > > already plan it - as a feature for 2.0 to support different version
> of
> > > each
> > > > DAG (and we have a discussion about it tomorrow!). Here the benefit
> is
> > > that
> > > > even if two people are modifying the same DAG file - they can run
> them
> > > > independently and iterate independently on it. It will be fast, and
> the
> > > > packaged ".zip" file is the "track" of what was exactly tried. Often
> you
> > > do
> > > > not have to store it in git as often those will be experiments.
> > > >
> > > > I think we cannot have a "one solution" to all the use cases - but
> with
> > > > treating those three cases differently, we can do very well for The
> ML
> > > > case. And we could automate it all - we could detect what kind of
> change
> > > > user ldid locally and act appropriately.
> > > >
> > > > J.
> > > >
> > > >
> > > >
> > > > On Mon, Feb 24, 2020 at 4:54 PM Ash Berlin-Taylor <a...@apache.org>
> > > wrote:
> > > >
> > > > >
> > > > > > DAG state (currently stored directly in the DB)
> > > > > Can you expand on this point James? What is the problem or
> limitation
> > > > > here? And would those be solved by expanding on the APIs to allow
> this
> > > to
> > > > > be set by some external process?
> > > > > On Feb 24 2020, at 3:45 pm, James Meickle <jmeic...@quantopian.com
> > > > .INVALID>
> > > > > wrote:
> > > > > > I really agree with most of what was posted above but
> particularly
> > > love
> > > > > > what Evgeny wrote about having a DAG API. As an end user, I would
> > > love
> > > > to
> > > > > > be able to provide different implementations of core DAG
> > > functionality,
> > > > > > similar to how hExecutor can already be subclassed. Some key
> behavior
> > > > > > points I either have personally had to override/work around, or
> would
> > > > > like
> > > > > > to do if it were easier:
> > > > > >
> > > > > > DAG schedule (currently defined by the DAG after it has been
> parsed)
> > > > > > DAG discovery (currently always a scheduler subprocess)
> > > > > > DAG state (currently stored directly in the DB)
> > > > > > Loading content for discovered DAGs (currently always on-disk,
> causes
> > > > > > versioning problems)
> > > > > > Parsing content from discovered DAGs (currently always a
> scheduler
> > > > > > subprocess, causes performance problems)
> > > > > > Providing DAG result transfer/persistence (currently only XCOM,
> > > causes
> > > > > many
> > > > > > problems)
> > > > > >
> > > > > > Breaking DAG functionality into a set of related APIs would allow
> > > > Airflow
> > > > > > to still have a good "out of the box" experience, or a simple
> > > git-based
> > > > > > deployment mode; while unlocking a lot of capability for users
> with
> > > > more
> > > > > > sophisticated needs.
> > > > > >
> > > > > > For example, we're using Argo Workflows nowadays, which is more
> > > > > Kubernetes
> > > > > > native but also more limited. I could easily envision a DAG
> > > > > implementation
> > > > > > where Airflow stores historical executions; stores git commits
> and
> > > > > > retrieves DAG definitions as needed; launches Argo Workflows to
> > > perform
> > > > > > both DAG parsing _and_ task execution; and stores results in
> Airflow.
> > > > > This
> > > > > > would turn Airflow into the system of record (permanent
> history), and
> > > > > Argo
> > > > > > into the ephemeral execution layer (delete after a few days). End
> > > users
> > > > > > could submit from Airflow even without Kubernetes access, and
> > > wouldn't
> > > > > need
> > > > > > Kubernetes access to view results or logs.
> > > > > >
> > > > > > Another example: we have notebook users who often need to
> schedule
> > > > > > pipelines, which they do in ad hoc ways. Instead they could
> import
> > > > > Airflow
> > > > > > and define a DAG in Python, then call a command to remotely
> execute
> > > it.
> > > > > > This would run on our "staging" Airflow cluster, with access to
> > > staging
> > > > > > credentials, but as a DAG named something like
> > > > > > "username.git_repo.notebook_name". Each run would freeze the
> current
> > > > > > notebook (like to S3), execute it, and store results via
> papermill;
> > > > this
> > > > > > would let users launch multiple iterations of a notebook (like
> > > looping
> > > > > over
> > > > > > a parameter list), run them for a while, and pick the one that
> does
> > > > what
> > > > > > they need.
> > > > > >
> > > > > > In general, there's been no end to the frustration of DAGs being
> > > > tightly
> > > > > > coupled to a specific on-disk layout, specific Python packaging,
> etc.
> > > > and
> > > > > > I'd love to be able to cleanly write alternative implementations.
> > > > > >
> > > > > > On Fri, Feb 21, 2020 at 4:44 PM Massy Bourennani <
> > > > > massy.bourenn...@happn.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > > We are using Airflow to put batch ML models in production
> (only the
> > > > > > > prediction is done).
> > > > > > > [image: image.png]
> > > > > > >
> > > > > > > Above is an example of a DAG we are using to execute an ML
> model
> > > > > written
> > > > > > > in Python (scikit learn)
> > > > > > >
> > > > > > > 1. At first, data is extracted from BigQuery. The SQL query is
> on a
> > > > DS
> > > > > > > owned GitHub repository
> > > > > > > 2. Then, the trained model, which is also serialized in
> pickle, is
> > > > > > > taken from Github
> > > > > > > 3. After that, an Apache Beam job is started on Dataflow which
> is
> > > > > > > responsible for reading the input files, downloading the model
> from
> > > > > GCS,
> > > > > > > deserializing it, and using to predict the score of each data
> > > point.
> > > > > Below
> > > > > > > you can see the expected interface of every model
> > > > > > > 4. At the end, the results are saved in Bigquery/GCS
> > > > > > >
> > > > > > > class WhateverModel: def predict(batch:
> collections.abc.Iterable)
> > > ->
> > > > > > > collections.abc.Iterable: """ :param batch: a collection of
> dicts
> > > > :type
> > > > > > > batch: collection(dict) :return: a collection of dicts. there
> > > should
> > > > > be a
> > > > > > > score in one of the fields of every dict (every datapoint) """
> pass
> > > > > > >
> > > > > > > key points:
> > > > > > > * every input: SQL query used to extract the dataset, ML model,
> > > > custom
> > > > > > > packages used by the model (used to setup Dataflow workers) is
> in
> > > > > Github.
> > > > > > > So we can go from one version to another and use it fairly
> easily,
> > > > all
> > > > > we
> > > > > > > need are some qualifiers: GitHub repo, path to file, and tag
> > > version.
> > > > > > > * DS are free to use whatever python library they want thanks
> to
> > > > Apache
> > > > > > > Beam that provides a way to initialize workers with custom
> > > packages.
> > > > > > >
> > > > > > > weak points:
> > > > > > > * Whenever DS update one of the input (SQL query, python
> packages,
> > > or
> > > > > > > model) we, DE, need to update the DAG
> > > > > > > * It's really specific to Batch Python ML models
> > > > > > >
> > > > > > > Below is a code snippet of a DAG instantiation
> > > > > > > with Py3BatchInferenceDAG(dag_id='mydags.execute_my_ml_model',
> > > > > > > sql_repo_name='app/data-learning',
> > > > > > > sql_file_path='data_learning/my_ml_model/sql/predict.sql',
> > > > > > > sql_tag_name='my_ml_model_0.0.12',
> > > > > > > model_repo_name='app/data-models',
> > > > > > > model_file_path='repository/my_ml_model/4/model.pkl',
> > > > > > > model_tag_name='my_ml_model_0.0.4',
> > > > > > > python_packages_repo_name='app/data-learning',
> > > > > > > python_packages_tag_name='my_ml_model_0.0.9',
> > > > > > >
> > > > >
> > > >
> > >
> python_packages_paths=['data_learning/my_ml_model/python_packages/package_one/'],
> > > > > > > params={ ########## setup of Dataflow workers
> > > > > > > 'custom_commands': ['apt-get update',
> > > > > > > 'gsutil cp -r gs://bucket/package_one /tmp/',
> > > > > > > 'pip install /tmp/package_one/'
> > > > > > > ],
> > > > > > > 'required_packages': ['dill==0.2.9', 'numpy', 'pandas',
> > > > 'scikit-learn'
> > > > > > > ,'google-cloud-storage']
> > > > > > > },
> > > > > > > external_dag_requires=[
> > > > > > > 'mydags.dependency_one$ds',
> > > > > > > 'mydags.dependency_two$ds'
> > > > > > > ],
> > > > > > > destination_project_dataset_table='mydags.my_ml_model${{
> ds_nodash
> > > > }}',
> > > > > > > schema_path=Common.get_file_path(__file__, "schema.yaml"),
> > > > > > > start_date=datetime(2019, 12, 10),
> > > > > > > schedule_interval=CronPresets.daily()) as dag:
> > > > > > > pass
> > > > > > >
> > > > > > > I hope it helps,
> > > > > > > Regards
> > > > > > > Massy
> > > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 8:36 AM Evgeny Shulman <
> > > > > evgeny.shul...@databand.ai>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Everybody
> > > > > > > > (Fully agreed on Dan's post. These are the main pain points
> we
> > > > > see/trying
> > > > > > > > to fix. Here is our reply on the thread topic)
> > > > > > > >
> > > > > > > > We have numerous ML engineers that use our open source
> project
> > > > (DBND)
> > > > > > > > with Airflow for their everyday work. We help them create and
> > > > monitor
> > > > > > > > ML/DATA pipelines of different complexity levels and infra
> > > > > requirements.
> > > > > > > > After 1.5 years doing that as a company now, and a few years
> > > doing
> > > > > it as a
> > > > > > > > part of a big enterprise organization before we started
> Databand,
> > > > > these are
> > > > > > > > the main pain points we think about when it comes to Airflow:
> > > > > > > >
> > > > > > > > A. DAG Versioning - ML teams change DAGs constantly. The
> first
> > > > > limitation
> > > > > > > > they see is being able to review historical information of
> > > previous
> > > > > DAG
> > > > > > > > runs based on the exact version of the DAG that executed. Our
> > > > plugin
> > > > > > > > 'dbnd-airflow-versioned-dag' is our approach to that. We
> save and
> > > > > show in
> > > > > > > > the Airflow UI every specific version of the DAG. This is
> > > important
> > > > > in ML
> > > > > > > > use cases because of the data science experimentation cycle
> and
> > > the
> > > > > need to
> > > > > > > > trace exactly what code/data went into a model.
> > > > > > > >
> > > > > > > > B. A better version of the backfill command - We had to
> > > reimplement
> > > > > > > > BackfillJob class to be able to run specific DAG versions.
> > > > > > > >
> > > > > > > > C. Running the same DAG in different environments - People
> want
> > > to
> > > > > run
> > > > > > > > the same DAG locally and at GCP/AWS without changing all the
> > > code.
> > > > > We have
> > > > > > > > done that by abstracting Spark/Python/Docker code execution
> so we
> > > > can
> > > > > > > > easily switch from one infra to another. We did that by
> wrapping
> > > > all
> > > > > infra
> > > > > > > > logic in a generic gateway "operators" with extensive use of
> > > > existing
> > > > > > > > Airflow hooks and operators.
> > > > > > > >
> > > > > > > > D. Data passing & versioning - being able to pass data from
> > > > Operator
> > > > > to
> > > > > > > > Operator, version the data. Being able to do that with easy
> > > > > authoring of
> > > > > > > > DAGs & sub-DAGs - Pipelines grow in complexity very quickly.
> It
> > > > will
> > > > > be
> > > > > > > > hard to agree on what is the "right" SDK here to implement.
> > > Airflow
> > > > > is
> > > > > > > > very "built by engineers for engineers", DAGs are created to
> be
> > > > > executed as
> > > > > > > > Scheduled Production Jobs. It's going to be a long journey
> to get
> > > > to
> > > > > the
> > > > > > > > common conclusion on what's needs to be done on a higher
> level
> > > > around
> > > > > > > > task/data management. Some people from the airflow community
> went
> > > > and
> > > > > > > > started new Orchestration companies after they didn't manage
> to
> > > > have
> > > > > a
> > > > > > > > significant change in the Data model of Airflow.
> > > > > > > >
> > > > > > > > Our biggest wish list item in Airflow as advanced user:
> > > > > > > > * A low-level API to generate and run DAGs *.
> > > > > > > > So far there are numerous extensions, and all of them solve
> this
> > > by
> > > > > > > > creating another dag.py file with the dag generation. But
> neither
> > > > > Scheduler
> > > > > > > > nor UI can support that fully. The moment the scheduler
> together
> > > > > with UI
> > > > > > > > will be open for "versioned DAGs", a lot of nice DSLs and
> > > > extensions
> > > > > will
> > > > > > > > emerge out of that. Data Analysts will get more GUI driven
> tools
> > > to
> > > > > > > > generate DAGs, ML engineers will be able to run and iterate
> on
> > > > their
> > > > > > > > algorithms, Data engineers will be able to implement their
> DAG
> > > > > DSL/SDK the
> > > > > > > > way they see it suits their company.
> > > > > > > >
> > > > > > > > Most users of DBND author their ML pipelines without knowing
> that
> > > > > Airflow
> > > > > > > > is orchestrating behind the scenes. They submit
> > > > > Python/Spark/Notebooks
> > > > > > > > without knowing that the DAG is going to be run through the
> > > Airflow
> > > > > > > > subsystem. Only when they see the Airflow webserver they
> start to
> > > > > discover
> > > > > > > > that there is Airflow. And this is the way it should be. ML
> > > > > developers
> > > > > > > > don't like new frameworks, they just like to see data flowing
> > > from
> > > > > task to
> > > > > > > > task, and ways to push work to production with minimal
> "external"
> > > > > code
> > > > > > > > involved.
> > > > > > > >
> > > > > > > > Evgeny.
> > > > > > > > On 2020/02/19 16:46:44, Dan Davydov
> <ddavy...@twitter.com.INVALID
> > > >
> > > > > > > > wrote:
> > > > > > > > > Twitter uses Airflow primarily for ML, to create automated
> > > > > pipelines for
> > > > > > > > > retraining data, but also for more ad-hoc training jobs.
> > > > > > > > >
> > > > > > > > > The biggest gaps are on the experimentation side. It takes
> too
> > > > > long for
> > > > > > > > a
> > > > > > > > > new user to set up and run a pipeline and then iterate on
> it.
> > > > This
> > > > > > > >
> > > > > > > > problem
> > > > > > > > > is a bit more unique to ML than other domains because 1)
> > > training
> > > > > jobs
> > > > > > > >
> > > > > > > > can
> > > > > > > > > take a very long time to run, and 2) users have the need to
> > > > launch
> > > > > > > >
> > > > > > > > multiple
> > > > > > > > > experiments in parallel for the same model pipeline.
> > > > > > > > >
> > > > > > > > > Biggest Gaps:
> > > > > > > > > - Too much boilerplate to write DAGs compared to
> Dagster/etc,
> > > and
> > > > > > > > > difficulty in message passing (XCom). There was a proposal
> > > > > recently to
> > > > > > > > > improve this in Airflow which should be entering AIP soon.
> > > > > > > > > - Lack of pipeline isolation which hurts model
> experimentation
> > > > > (being
> > > > > > > >
> > > > > > > > able
> > > > > > > > > to run a DAG, modify it, and run it again without
> affecting the
> > > > > previous
> > > > > > > > > run), lack of isolation of DAGs from Airflow infrastructure
> > > > > (inability
> > > > > > > >
> > > > > > > > to
> > > > > > > > > redeploy Airflow infra without also redeploying DAGs) also
> > > hurts.
> > > > > > > > > - Lack of multi-tenancy; it's hard for customers to quickly
> > > > launch
> > > > > an
> > > > > > > > > ad-hoc pipeline, the overhead of setting up a cluster and
> all
> > > of
> > > > > its
> > > > > > > > > dependencies is quite high
> > > > > > > > > - Lack of integration with data visualization plugins (e.g.
> > > > > plugins for
> > > > > > > > > rendering data related to a task when you click a task
> instance
> > > > in
> > > > > the
> > > > > > > >
> > > > > > > > UI).
> > > > > > > > > - Lack of simpler abstractions for users with limited
> knowledge
> > > > of
> > > > > > > >
> > > > > > > > Airflow
> > > > > > > > > or even python to build simple pipelines (not really an
> Airflow
> > > > > problem,
> > > > > > > > > but rather the need for a good abstraction that sits on
> top of
> > > > > Airflow
> > > > > > > >
> > > > > > > > like
> > > > > > > > > a drag-and-drop pipeline builder)
> > > > > > > > >
> > > > > > > > > FWIW my personal feeling is that a fair number companies
> in the
> > > > ML
> > > > > space
> > > > > > > > > are moving to alternate solutions like TFX Pipelines due
> to the
> > > > > focus
> > > > > > > >
> > > > > > > > these
> > > > > > > > > platforms these have on ML (ML pipelines are first-class
> > > > > citizens), and
> > > > > > > > > support from Google. Would be great if we could change
> that.
> > > The
> > > > ML
> > > > > > > > > orchestration/tooling space is definitely evolving very
> rapidly
> > > > and
> > > > > > > >
> > > > > > > > there
> > > > > > > > > are also new promising entrants as well.
> > > > > > > > >
> > > > > > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
> > > > > > > > > <germain.tan...@dailymotion.com.invalid> wrote:
> > > > > > > > >
> > > > > > > > > > Hello Daniel,
> > > > > > > > > > In my company we use airflow to update our ML models and
> to
> > > > > predict.
> > > > > > > > > > As we use kubernetesOperator to trigger jobs, each ML
> DAG are
> > > > > similar
> > > > > > > > and
> > > > > > > > > > ML/Data science engineer can reuse a template and choose
> > > which
> > > > > type of
> > > > > > > > > > machine they needs (highcpu, highmem, GPU or not..etc)
> > > > > > > > > >
> > > > > > > > > > We have a process in place describe in the second part of
> > > this
> > > > > article
> > > > > > > > > > (Industrializing machine learning pipeline) :
> > > > > > > > > >
> > > > > > > >
> > > > >
> > > >
> > >
> https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
> > > > > > > > > >
> > > > > > > > > > Hope this help.
> > > > > > > > > > Germain.
> > > > > > > > > > On 19/02/2020 16:42, "Daniel Imberman" <
> > > > > daniel.imber...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hello everyone!
> > > > > > > > > > I’m working on a few proposals to make Apache Airflow
> more
> > > > > > > > friendly
> > > > > > > > > > for ML/Data science use-cases, and I wanted to reach out
> in
> > > > > hopes of
> > > > > > > > > > hearing from people that are using/wish to use Airflow
> for
> > > ML.
> > > > > If you
> > > > > > > > >
> > > > > > > >
> > > > > > > > have
> > > > > > > > > > any opinions on the subject, I’d love to hear what
> you’re all
> > > > > working
> > > > > > > > >
> > > > > > > >
> > > > > > > > on!
> > > > > > > > > >
> > > > > > > > > > Current questions I’m looking into:
> > > > > > > > > > 1. How do you use Airflow for your ML? Has it worked out
> well
> > > > > > > > for you?
> > > > > > > > > > 2. Are there any features that would improve your
> experience
> > > of
> > > > > > > > > > building models on Airflow?
> > > > > > > > > > 3. Have you built anything on top of airflow/around
> Airflow
> > > to
> > > > > > > > >
> > > > > > > >
> > > > > > > > aide
> > > > > > > > > > you in this process?
> > > > > > > > > >
> > > > > > > > > > Thank you so much for your time!
> > > > > > > > > > via Newton Mail [
> > > > > > > >
> > > > >
> > > >
> > >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0
> > > > > > > > > > ]
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Jarek Potiuk
> > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > >
> > > > M: +48 660 796 129 <+48660796129>
> > > > [image: Polidea] <https://www.polidea.com/>
> > > >
> > >
>
>
>
> --
>
> Jarek Potiuk
> Polidea | Principal Software Engineer
>
> M: +48 660 796 129
>

Re: Airflow and Machine Learning

Reply via email to