Jarek, have you looked at garden.io? We've been experimenting with that for our local, Kubernetes-native development. It's hard to describe, but imagine: Docker Compose, but cross-git-repo, with builds occurring in Kubernetes and deployments managed by Helm charts.
With this workflow, rather than building locally and pushing: you rsync local changes (even uncommitted) into the cluster, build there, and push to an in-cluster repo there. Since it only sends the diff, and is running on arbitrary compute, it can potentially be much faster. It also has a hot reload feature for updating in-container code without any rebuild, in the case of "code-only" changes. Yet it's still using the same Dockerfile and Helm chart as a production deployment, just not going through a "control-heavy" CI tool (for example you can deploy without running test suites). Something like that (or Tilt, another similar tool) could be a really nice workflow for iterating quickly on Airflow DAGs, since they allow multiple types of change (fast code-only change, slower Docker rebuild change, Airflow infra change via Helm chart) even for non-credentialed users. On Mon, Feb 24, 2020 at 7:13 PM Jarek Potiuk <jarek.pot...@polidea.com> wrote: > Agree some kind of benchmarking is needed indeed. And I think this > discussion is great BTW :) > > This would be great if we could achieve the same consistent > environment through the whole lifecycle of a DAG. But I am not sure it > can be achieved - I got quite disappointed with the iteration speed > that you can achieve with docker/image approach in general and with > actual complexity of it - especially when you want to iterate > independenlty on multiple variations of the same code. Surely it can > be improved an optimised - but there are certain limitations and I > think even different basic characteristics of different lifecycle > stages of a DAG. I believe those different "usage patterns" > predetermine different solutions for different parts of a DAG > lifecycle. > > Again starting backwards: > > In Production environment - all you need is stability and > repeatability,You want everything audited, recorded, it should change > slooooowly and in a controllable way. You should always be able to > rollback a change and possibly perform staging rollout. You want to > deploy dags deliberately and have people in operations control it. > Everyone should use the same environment. I think this is mostly where > James comes from. And here I totally agree we need fixed and > controllably released Docker image for everyone with pre-baked > dependencies, stable set of DAGs that are deployed from Git or Nexus > or smth like that. High uptime needed. No doubts about this - I would > be the last person to mess with that. > > However this "production" is actually only last stage of DAG > lifecycle. When you iterate on your DAGs as part of your team you want > to iterate rather quickly and test different dependencies and common > code that you reuse between the team. You rarely change binary > dependencies (but you do and when you do - you do it for the whole > team), you have a relatively stable set of dependencies that your team > uses but you update those deps occasionally - and then you want to > deploy a new image for everyone with pre-baked dependencies so that > everyone has very similar environment to start with. Your environment > is shared via Git usually so that everyone can update to the latest > version. However you do not need all the tools to be able to rollback > and deploy stuff in controllable way nor staging. You can afford > occasional break-downs (move fast, break things) under the condition > you can fix things quickly for everyone when they fail. Occasionally > your team members can add new dependencies and contribute to the > environment. When ready - whatever you have as stable in this > environment you want to be able to promote it to the Prod environment. > You need to track execution of that code and match it with the results > via some automated tools. > > When you are iterating on individual models as a member of a team as > ML researcher - you want to use the base similar as your team members > but you want to experiment. A lot. Fast. Without being too limited. > Add a new library or change it's version in minutes rather than hours > and see how it works. And you want to do it without impacting others > in your team. You want to play and throw away the code you just tried. > This happens a lot. You do not want to keep track of this code other > than your local environment. Likely you have 10 versions of the same > DAG copied to different files and you iterate on them separately > (including dependencies and imported code) as you have to wait some > hours for results and you do not want to stop and wait for it - you > want to experiment with multiple of such runs in parallel. And put it > to trash way more often than not. And you do not need to have fully > controllable/easily back-forth way of tracking your code changes such > as git. However if you find that your experiment succeeded, you want > to be able to retrieve the very code that was used to it. But this is > super rare and you do not want to store it permanently after your > experiments completed - because it only adds noise. So you need some > "temporary traceability", > > My approach is only really valid for the last two cases, not at all > for the first one. I think it would be great if Airflow supports those > cases as well as it supports the first one. One might argue that the > "Production" approach is the most important, but I would disagree with > that. The more efficient we can make the team of ML researchers being > able to churn multiple models fast, efficiently and with low > frustration - the better "Production" models there will be to run on a > daily basis. > > J. > > > On Tue, Feb 25, 2020 at 12:29 AM Dan Davydov > <ddavy...@twitter.com.invalid> wrote: > > > > I see things the same way as James. That being said I have not worked > with > > Docker very much (maybe Daniel Imberman can comment?), so I could have > some > > blindspots. I have heard latency concerns expressed by several people for > > example (can't remember in which areas). > > > > The main thing that draws me to Docker solution is that wheels give only > > partial isolation, and Docker gives "full" isolation. When isolation is > > partial, users lose trust in the reliability of the system, e.g. for some > > packages wheel-level isolation is enough, and for others it's not, and to > > me it doesn't feel reasonable to expect that the user understand and > think > > through whether each change they are making requires one level of > isolation > > or another (especially if the users are less technical). Even for the > > ad-hoc iteration use-case, it's quite important for things to work the > way > > users expect, and as a user if one time my binary dependencies don't get > > packaged correctly, I will lose trust in the workflow I was using to do > the > > right thing for me. Docker also feels cleaner to me since it handles > > isolation completely, whereas a wheel-based solution still needs to > handle > > binary dependencies using e.g. Docker. > > > > If it does turn out that Docker is just too slow or there are technical > > challenges that are too hard to solve (e.g. modifying one line of code in > > your DAG causes the whole/the majority of a docker image to get rebuilt), > > then we probably will need to do something more like Jarek is talking > > about, but it definitely feels like a hack to me. I would love to see > > prototypes for each solution and some benchmarks personally. > > > > I think in terms of next steps after this discussion completes, probably > a > > design doc/AIP evaluating the docker vs non-docker options makes sense. > > > > On Mon, Feb 24, 2020 at 5:48 PM James Meickle > > <jmeic...@quantopian.com.invalid> wrote: > > > > > I appreciate where you're coming from on wanting to enhance > productivity > > > for different types of users, but as a cluster administrator, I > _really_ > > > don't want to be running software that's managing its own Docker > builds, > > > virtualenvs, zip uploads, etc.! It will almost certainly not do so in > a way > > > compliant or consistent with our policies. If anything goes wrong, > > > disentangling what happened and how to fix it will be different from > all of > > > our other software builds, which already have defined processes in > place. > > > Also, "self-building" software that works this way often leaves ops > teams > > > like mine looking like the "bad guys" by solving the easy local > > > development/single-instance/single-user case, and then failing to work > "the > > > same way" in a more restricted shared cluster. > > > > > > I think this is exactly why a structured API is good, though. For > example, > > > requiring DAGs to implement a manifest/retrieval API; where a basic > > > implementation is a hash plus a local filesystem path, a provided but > > > optional implementation is a git commit and checkout creds, and a third > > > party module implements a multi-user Python notebook integration. > > > > > > Basically, I am not opposed to "monitor code for a Dockerfile change, > > > rebuild/push the Dockerfile, and trigger the DAG" to be something you > could > > > build on top of Airflow. I think that even some basic commitment to > making > > > DAG functionality API-driven would enable a third party to do exactly > that. > > > I would not want to see that functionality baked into Airflow's core, > > > though, because it's spanning so many problem domains that involve > extra > > > security and operational concerns. As-is, we want to get more of those > > > concerns under our control, rather than Airflow's. (e.g. we'd rather > notify > > > Airflow of DAG changes as part of a deployment pipeline, rather than > having > > > it constantly poll for updated definitions, which leads to resource > > > utilization and inconsistency.) > > > > > > On Mon, Feb 24, 2020 at 5:16 PM Jarek Potiuk <jarek.pot...@polidea.com > > > > > wrote: > > > > > > > I think what would help a lot to solve the problems of env/deps/DAGS > is > > > the > > > > wheel packaging we started to talk about in another thread. > > > > > > > > I personally think what argo does is too much "cloud native" - you > have > > > to > > > > build the image, push to registry, get it pulled by the engine, > execute > > > > etc. I've been working with ML tool we developed internally for > > > NoMagic.ai > > > > which did something quite opposite - it was packaging all the python > code > > > > developed locally in a .zip file, stored it in GCS, submitted it to > be > > > > executed - it was unpacked and executed on a remote machine. We > actually > > > > evaluated Airflow back then (and that's how I learned about the > project) > > > > but it was too much overhead for our case. > > > > > > > > The first approach - Argo - is generally slow and "complex" if you > just > > > > change one line of code. The second approach - default Airflow > approach > > > is > > > > bad when you have new binary or python dependencies. I think both > > > solutions > > > > are valid, but for different cases of code deployment. And none of > them > > > > handles the in-between case where we have only new python > > > > dependencies added/modified. I think with Airflow we can target a > > > > "combined" solution which will be able to handle well all cases. > > > > > > > > I think the problem is pretty independent on whether to store the > > > changes > > > > in a git repo or not. I can imagine that iteration is done based on > just > > > > (shared) file system changes or through git - it is just a matter of > > > > whether you want to "freeze' the state in git or not. As a software > > > > engineer I always prefer git - but I perfectly understand as a data > > > > scientist git-commit and solving potential conflicts might be a > burden, > > > > especially when you want to iterate quickly and experiment. But I > > > > understand that for this quick iteration case everything should work > > > > quickly and transparently without unnecessary overhead when you > *just* > > > > change a python file or *just* increase python dependency version. > > > > > > > > What could work is if we start treating differently the three types > of > > > > changes and not try to handle them in the same case. I think this > can be > > > > all automated by Airflow's dag submission mechanism. I think it can > > > address > > > > several things at various levels - addressing the needs of > individuals, > > > > teams, and companies at various levels: > > > > > > > > - super-fast speed of iteration by individual people (on a DAG-code > > > level) > > > > - ability to use different python dependencies by different people > in the > > > > same team using the same deployment (on dependency level) > > > > - ability to keep the environment evolving with new binary > dependencies > > > and > > > > eventually landing in prod environment (on environment/container > level) > > > > > > > > Working it out backwards from the heaviest to lightest changes: > > > > > > > > 1) whenever we have a binary dependency change (.so, binary, new > external > > > > dependency to be added) we should change dockerfile that installs > it. We > > > > don't currently handle that case, but I can imagine with prod image > > > support > > > > and being able to modify thin layer of new binaries added to base > Airflow > > > > image this should be easy to automate - when you iterate locally you > > > should > > > > be able to build the image automatically, send a new version to the > > > > registry and restart whole deployment of Airflow to use this new > binary. > > > It > > > > should be a rare case, it should be impacting the whole deployment > (for > > > > example whole dev or staging deployment) - i.e. everyone in Airflow > > > > installation should get the new image as a base. The drawback here is > > > that > > > > everyone who is using the same Airflow deployment is impacted. > > > > > > > > 2) whenever we have a Python dependency change (new version, new > library > > > > added etc.) those deps could be pre-packaged in a binary .whl file > > > > (incremental) and submitted together with the DAG (and stored in a > shared > > > > cache). This could be done via a new requirements.txt file and > changes to > > > > it. We should be able to detect the changes and scheduler should be > able > > > to > > > > install new requirements in dynamically created virtualenv (and > produce a > > > > binary wheel), parse the DAG in the context of that virtualenv and > submit > > > > the DAG together with the new wheel package - so that the wheel > package > > > can > > > > be picked up by the workers to execute the tasks, create a venv for > that > > > > and run tasks with this venv. The benefit of this is that you do not > > > have > > > > to update the whole deployment for that - ie. people in the same > team, > > > > using the same Airflow Deployment can use different dependencies > without > > > > impacting each other. The whl packages/venvs can be cached on the > > > > workers/scheduler and eventually, when you commit and release such > > > > dependency changes, they can be embedded in the shared docker image > from > > > > point 1). This whl file does not have to be stored in a database - > just > > > > storing the "requirements.txt" per DagRun is enough - we can always > > > rebuild > > > > the whl from that when needed and we have no cache. > > > > > > > > 3) Whenever it's a change to just your own code (or code of any > imported > > > > files you used) - there is no need to package and build the whl > packages. > > > > You can package your own code in a .zip (or another .whl) and submit > for > > > > execution (and store per-DagRun), This can be also automated by > > > scheduler - > > > > it can traverse the whole python structure while parsing the file and > > > > package all dependent files. This can be done using current .zip > support > > > of > > > > Airflow + new versioning. We would have to add a feature that each of > > > such > > > > DAGs will be a different version even if the same dag_id is used. We > > > > already plan it - as a feature for 2.0 to support different version > of > > > each > > > > DAG (and we have a discussion about it tomorrow!). Here the benefit > is > > > that > > > > even if two people are modifying the same DAG file - they can run > them > > > > independently and iterate independently on it. It will be fast, and > the > > > > packaged ".zip" file is the "track" of what was exactly tried. Often > you > > > do > > > > not have to store it in git as often those will be experiments. > > > > > > > > I think we cannot have a "one solution" to all the use cases - but > with > > > > treating those three cases differently, we can do very well for The > ML > > > > case. And we could automate it all - we could detect what kind of > change > > > > user ldid locally and act appropriately. > > > > > > > > J. > > > > > > > > > > > > > > > > On Mon, Feb 24, 2020 at 4:54 PM Ash Berlin-Taylor <a...@apache.org> > > > wrote: > > > > > > > > > > > > > > > DAG state (currently stored directly in the DB) > > > > > Can you expand on this point James? What is the problem or > limitation > > > > > here? And would those be solved by expanding on the APIs to allow > this > > > to > > > > > be set by some external process? > > > > > On Feb 24 2020, at 3:45 pm, James Meickle <jmeic...@quantopian.com > > > > .INVALID> > > > > > wrote: > > > > > > I really agree with most of what was posted above but > particularly > > > love > > > > > > what Evgeny wrote about having a DAG API. As an end user, I would > > > love > > > > to > > > > > > be able to provide different implementations of core DAG > > > functionality, > > > > > > similar to how hExecutor can already be subclassed. Some key > behavior > > > > > > points I either have personally had to override/work around, or > would > > > > > like > > > > > > to do if it were easier: > > > > > > > > > > > > DAG schedule (currently defined by the DAG after it has been > parsed) > > > > > > DAG discovery (currently always a scheduler subprocess) > > > > > > DAG state (currently stored directly in the DB) > > > > > > Loading content for discovered DAGs (currently always on-disk, > causes > > > > > > versioning problems) > > > > > > Parsing content from discovered DAGs (currently always a > scheduler > > > > > > subprocess, causes performance problems) > > > > > > Providing DAG result transfer/persistence (currently only XCOM, > > > causes > > > > > many > > > > > > problems) > > > > > > > > > > > > Breaking DAG functionality into a set of related APIs would allow > > > > Airflow > > > > > > to still have a good "out of the box" experience, or a simple > > > git-based > > > > > > deployment mode; while unlocking a lot of capability for users > with > > > > more > > > > > > sophisticated needs. > > > > > > > > > > > > For example, we're using Argo Workflows nowadays, which is more > > > > > Kubernetes > > > > > > native but also more limited. I could easily envision a DAG > > > > > implementation > > > > > > where Airflow stores historical executions; stores git commits > and > > > > > > retrieves DAG definitions as needed; launches Argo Workflows to > > > perform > > > > > > both DAG parsing _and_ task execution; and stores results in > Airflow. > > > > > This > > > > > > would turn Airflow into the system of record (permanent > history), and > > > > > Argo > > > > > > into the ephemeral execution layer (delete after a few days). End > > > users > > > > > > could submit from Airflow even without Kubernetes access, and > > > wouldn't > > > > > need > > > > > > Kubernetes access to view results or logs. > > > > > > > > > > > > Another example: we have notebook users who often need to > schedule > > > > > > pipelines, which they do in ad hoc ways. Instead they could > import > > > > > Airflow > > > > > > and define a DAG in Python, then call a command to remotely > execute > > > it. > > > > > > This would run on our "staging" Airflow cluster, with access to > > > staging > > > > > > credentials, but as a DAG named something like > > > > > > "username.git_repo.notebook_name". Each run would freeze the > current > > > > > > notebook (like to S3), execute it, and store results via > papermill; > > > > this > > > > > > would let users launch multiple iterations of a notebook (like > > > looping > > > > > over > > > > > > a parameter list), run them for a while, and pick the one that > does > > > > what > > > > > > they need. > > > > > > > > > > > > In general, there's been no end to the frustration of DAGs being > > > > tightly > > > > > > coupled to a specific on-disk layout, specific Python packaging, > etc. > > > > and > > > > > > I'd love to be able to cleanly write alternative implementations. > > > > > > > > > > > > On Fri, Feb 21, 2020 at 4:44 PM Massy Bourennani < > > > > > massy.bourenn...@happn.com> > > > > > > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > We are using Airflow to put batch ML models in production > (only the > > > > > > > prediction is done). > > > > > > > [image: image.png] > > > > > > > > > > > > > > Above is an example of a DAG we are using to execute an ML > model > > > > > written > > > > > > > in Python (scikit learn) > > > > > > > > > > > > > > 1. At first, data is extracted from BigQuery. The SQL query is > on a > > > > DS > > > > > > > owned GitHub repository > > > > > > > 2. Then, the trained model, which is also serialized in > pickle, is > > > > > > > taken from Github > > > > > > > 3. After that, an Apache Beam job is started on Dataflow which > is > > > > > > > responsible for reading the input files, downloading the model > from > > > > > GCS, > > > > > > > deserializing it, and using to predict the score of each data > > > point. > > > > > Below > > > > > > > you can see the expected interface of every model > > > > > > > 4. At the end, the results are saved in Bigquery/GCS > > > > > > > > > > > > > > class WhateverModel: def predict(batch: > collections.abc.Iterable) > > > -> > > > > > > > collections.abc.Iterable: """ :param batch: a collection of > dicts > > > > :type > > > > > > > batch: collection(dict) :return: a collection of dicts. there > > > should > > > > > be a > > > > > > > score in one of the fields of every dict (every datapoint) """ > pass > > > > > > > > > > > > > > key points: > > > > > > > * every input: SQL query used to extract the dataset, ML model, > > > > custom > > > > > > > packages used by the model (used to setup Dataflow workers) is > in > > > > > Github. > > > > > > > So we can go from one version to another and use it fairly > easily, > > > > all > > > > > we > > > > > > > need are some qualifiers: GitHub repo, path to file, and tag > > > version. > > > > > > > * DS are free to use whatever python library they want thanks > to > > > > Apache > > > > > > > Beam that provides a way to initialize workers with custom > > > packages. > > > > > > > > > > > > > > weak points: > > > > > > > * Whenever DS update one of the input (SQL query, python > packages, > > > or > > > > > > > model) we, DE, need to update the DAG > > > > > > > * It's really specific to Batch Python ML models > > > > > > > > > > > > > > Below is a code snippet of a DAG instantiation > > > > > > > with Py3BatchInferenceDAG(dag_id='mydags.execute_my_ml_model', > > > > > > > sql_repo_name='app/data-learning', > > > > > > > sql_file_path='data_learning/my_ml_model/sql/predict.sql', > > > > > > > sql_tag_name='my_ml_model_0.0.12', > > > > > > > model_repo_name='app/data-models', > > > > > > > model_file_path='repository/my_ml_model/4/model.pkl', > > > > > > > model_tag_name='my_ml_model_0.0.4', > > > > > > > python_packages_repo_name='app/data-learning', > > > > > > > python_packages_tag_name='my_ml_model_0.0.9', > > > > > > > > > > > > > > > > > > > > python_packages_paths=['data_learning/my_ml_model/python_packages/package_one/'], > > > > > > > params={ ########## setup of Dataflow workers > > > > > > > 'custom_commands': ['apt-get update', > > > > > > > 'gsutil cp -r gs://bucket/package_one /tmp/', > > > > > > > 'pip install /tmp/package_one/' > > > > > > > ], > > > > > > > 'required_packages': ['dill==0.2.9', 'numpy', 'pandas', > > > > 'scikit-learn' > > > > > > > ,'google-cloud-storage'] > > > > > > > }, > > > > > > > external_dag_requires=[ > > > > > > > 'mydags.dependency_one$ds', > > > > > > > 'mydags.dependency_two$ds' > > > > > > > ], > > > > > > > destination_project_dataset_table='mydags.my_ml_model${{ > ds_nodash > > > > }}', > > > > > > > schema_path=Common.get_file_path(__file__, "schema.yaml"), > > > > > > > start_date=datetime(2019, 12, 10), > > > > > > > schedule_interval=CronPresets.daily()) as dag: > > > > > > > pass > > > > > > > > > > > > > > I hope it helps, > > > > > > > Regards > > > > > > > Massy > > > > > > > > > > > > > > On Thu, Feb 20, 2020 at 8:36 AM Evgeny Shulman < > > > > > evgeny.shul...@databand.ai> > > > > > > > wrote: > > > > > > > > > > > > > > > Hey Everybody > > > > > > > > (Fully agreed on Dan's post. These are the main pain points > we > > > > > see/trying > > > > > > > > to fix. Here is our reply on the thread topic) > > > > > > > > > > > > > > > > We have numerous ML engineers that use our open source > project > > > > (DBND) > > > > > > > > with Airflow for their everyday work. We help them create and > > > > monitor > > > > > > > > ML/DATA pipelines of different complexity levels and infra > > > > > requirements. > > > > > > > > After 1.5 years doing that as a company now, and a few years > > > doing > > > > > it as a > > > > > > > > part of a big enterprise organization before we started > Databand, > > > > > these are > > > > > > > > the main pain points we think about when it comes to Airflow: > > > > > > > > > > > > > > > > A. DAG Versioning - ML teams change DAGs constantly. The > first > > > > > limitation > > > > > > > > they see is being able to review historical information of > > > previous > > > > > DAG > > > > > > > > runs based on the exact version of the DAG that executed. Our > > > > plugin > > > > > > > > 'dbnd-airflow-versioned-dag' is our approach to that. We > save and > > > > > show in > > > > > > > > the Airflow UI every specific version of the DAG. This is > > > important > > > > > in ML > > > > > > > > use cases because of the data science experimentation cycle > and > > > the > > > > > need to > > > > > > > > trace exactly what code/data went into a model. > > > > > > > > > > > > > > > > B. A better version of the backfill command - We had to > > > reimplement > > > > > > > > BackfillJob class to be able to run specific DAG versions. > > > > > > > > > > > > > > > > C. Running the same DAG in different environments - People > want > > > to > > > > > run > > > > > > > > the same DAG locally and at GCP/AWS without changing all the > > > code. > > > > > We have > > > > > > > > done that by abstracting Spark/Python/Docker code execution > so we > > > > can > > > > > > > > easily switch from one infra to another. We did that by > wrapping > > > > all > > > > > infra > > > > > > > > logic in a generic gateway "operators" with extensive use of > > > > existing > > > > > > > > Airflow hooks and operators. > > > > > > > > > > > > > > > > D. Data passing & versioning - being able to pass data from > > > > Operator > > > > > to > > > > > > > > Operator, version the data. Being able to do that with easy > > > > > authoring of > > > > > > > > DAGs & sub-DAGs - Pipelines grow in complexity very quickly. > It > > > > will > > > > > be > > > > > > > > hard to agree on what is the "right" SDK here to implement. > > > Airflow > > > > > is > > > > > > > > very "built by engineers for engineers", DAGs are created to > be > > > > > executed as > > > > > > > > Scheduled Production Jobs. It's going to be a long journey > to get > > > > to > > > > > the > > > > > > > > common conclusion on what's needs to be done on a higher > level > > > > around > > > > > > > > task/data management. Some people from the airflow community > went > > > > and > > > > > > > > started new Orchestration companies after they didn't manage > to > > > > have > > > > > a > > > > > > > > significant change in the Data model of Airflow. > > > > > > > > > > > > > > > > Our biggest wish list item in Airflow as advanced user: > > > > > > > > * A low-level API to generate and run DAGs *. > > > > > > > > So far there are numerous extensions, and all of them solve > this > > > by > > > > > > > > creating another dag.py file with the dag generation. But > neither > > > > > Scheduler > > > > > > > > nor UI can support that fully. The moment the scheduler > together > > > > > with UI > > > > > > > > will be open for "versioned DAGs", a lot of nice DSLs and > > > > extensions > > > > > will > > > > > > > > emerge out of that. Data Analysts will get more GUI driven > tools > > > to > > > > > > > > generate DAGs, ML engineers will be able to run and iterate > on > > > > their > > > > > > > > algorithms, Data engineers will be able to implement their > DAG > > > > > DSL/SDK the > > > > > > > > way they see it suits their company. > > > > > > > > > > > > > > > > Most users of DBND author their ML pipelines without knowing > that > > > > > Airflow > > > > > > > > is orchestrating behind the scenes. They submit > > > > > Python/Spark/Notebooks > > > > > > > > without knowing that the DAG is going to be run through the > > > Airflow > > > > > > > > subsystem. Only when they see the Airflow webserver they > start to > > > > > discover > > > > > > > > that there is Airflow. And this is the way it should be. ML > > > > > developers > > > > > > > > don't like new frameworks, they just like to see data flowing > > > from > > > > > task to > > > > > > > > task, and ways to push work to production with minimal > "external" > > > > > code > > > > > > > > involved. > > > > > > > > > > > > > > > > Evgeny. > > > > > > > > On 2020/02/19 16:46:44, Dan Davydov > <ddavy...@twitter.com.INVALID > > > > > > > > > > > > wrote: > > > > > > > > > Twitter uses Airflow primarily for ML, to create automated > > > > > pipelines for > > > > > > > > > retraining data, but also for more ad-hoc training jobs. > > > > > > > > > > > > > > > > > > The biggest gaps are on the experimentation side. It takes > too > > > > > long for > > > > > > > > a > > > > > > > > > new user to set up and run a pipeline and then iterate on > it. > > > > This > > > > > > > > > > > > > > > > problem > > > > > > > > > is a bit more unique to ML than other domains because 1) > > > training > > > > > jobs > > > > > > > > > > > > > > > > can > > > > > > > > > take a very long time to run, and 2) users have the need to > > > > launch > > > > > > > > > > > > > > > > multiple > > > > > > > > > experiments in parallel for the same model pipeline. > > > > > > > > > > > > > > > > > > Biggest Gaps: > > > > > > > > > - Too much boilerplate to write DAGs compared to > Dagster/etc, > > > and > > > > > > > > > difficulty in message passing (XCom). There was a proposal > > > > > recently to > > > > > > > > > improve this in Airflow which should be entering AIP soon. > > > > > > > > > - Lack of pipeline isolation which hurts model > experimentation > > > > > (being > > > > > > > > > > > > > > > > able > > > > > > > > > to run a DAG, modify it, and run it again without > affecting the > > > > > previous > > > > > > > > > run), lack of isolation of DAGs from Airflow infrastructure > > > > > (inability > > > > > > > > > > > > > > > > to > > > > > > > > > redeploy Airflow infra without also redeploying DAGs) also > > > hurts. > > > > > > > > > - Lack of multi-tenancy; it's hard for customers to quickly > > > > launch > > > > > an > > > > > > > > > ad-hoc pipeline, the overhead of setting up a cluster and > all > > > of > > > > > its > > > > > > > > > dependencies is quite high > > > > > > > > > - Lack of integration with data visualization plugins (e.g. > > > > > plugins for > > > > > > > > > rendering data related to a task when you click a task > instance > > > > in > > > > > the > > > > > > > > > > > > > > > > UI). > > > > > > > > > - Lack of simpler abstractions for users with limited > knowledge > > > > of > > > > > > > > > > > > > > > > Airflow > > > > > > > > > or even python to build simple pipelines (not really an > Airflow > > > > > problem, > > > > > > > > > but rather the need for a good abstraction that sits on > top of > > > > > Airflow > > > > > > > > > > > > > > > > like > > > > > > > > > a drag-and-drop pipeline builder) > > > > > > > > > > > > > > > > > > FWIW my personal feeling is that a fair number companies > in the > > > > ML > > > > > space > > > > > > > > > are moving to alternate solutions like TFX Pipelines due > to the > > > > > focus > > > > > > > > > > > > > > > > these > > > > > > > > > platforms these have on ML (ML pipelines are first-class > > > > > citizens), and > > > > > > > > > support from Google. Would be great if we could change > that. > > > The > > > > ML > > > > > > > > > orchestration/tooling space is definitely evolving very > rapidly > > > > and > > > > > > > > > > > > > > > > there > > > > > > > > > are also new promising entrants as well. > > > > > > > > > > > > > > > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy > > > > > > > > > <germain.tan...@dailymotion.com.invalid> wrote: > > > > > > > > > > > > > > > > > > > Hello Daniel, > > > > > > > > > > In my company we use airflow to update our ML models and > to > > > > > predict. > > > > > > > > > > As we use kubernetesOperator to trigger jobs, each ML > DAG are > > > > > similar > > > > > > > > and > > > > > > > > > > ML/Data science engineer can reuse a template and choose > > > which > > > > > type of > > > > > > > > > > machine they needs (highcpu, highmem, GPU or not..etc) > > > > > > > > > > > > > > > > > > > > We have a process in place describe in the second part of > > > this > > > > > article > > > > > > > > > > (Industrializing machine learning pipeline) : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f > > > > > > > > > > > > > > > > > > > > Hope this help. > > > > > > > > > > Germain. > > > > > > > > > > On 19/02/2020 16:42, "Daniel Imberman" < > > > > > daniel.imber...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hello everyone! > > > > > > > > > > I’m working on a few proposals to make Apache Airflow > more > > > > > > > > friendly > > > > > > > > > > for ML/Data science use-cases, and I wanted to reach out > in > > > > > hopes of > > > > > > > > > > hearing from people that are using/wish to use Airflow > for > > > ML. > > > > > If you > > > > > > > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > any opinions on the subject, I’d love to hear what > you’re all > > > > > working > > > > > > > > > > > > > > > > > > > > > > > > > on! > > > > > > > > > > > > > > > > > > > > Current questions I’m looking into: > > > > > > > > > > 1. How do you use Airflow for your ML? Has it worked out > well > > > > > > > > for you? > > > > > > > > > > 2. Are there any features that would improve your > experience > > > of > > > > > > > > > > building models on Airflow? > > > > > > > > > > 3. Have you built anything on top of airflow/around > Airflow > > > to > > > > > > > > > > > > > > > > > > > > > > > > > aide > > > > > > > > > > you in this process? > > > > > > > > > > > > > > > > > > > > Thank you so much for your time! > > > > > > > > > > via Newton Mail [ > > > > > > > > > > > > > > > > > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0 > > > > > > > > > > ] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jarek Potiuk > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > -- > > Jarek Potiuk > Polidea | Principal Software Engineer > > M: +48 660 796 129 >