I see things the same way as James. That being said I have not worked with Docker very much (maybe Daniel Imberman can comment?), so I could have some blindspots. I have heard latency concerns expressed by several people for example (can't remember in which areas).
The main thing that draws me to Docker solution is that wheels give only partial isolation, and Docker gives "full" isolation. When isolation is partial, users lose trust in the reliability of the system, e.g. for some packages wheel-level isolation is enough, and for others it's not, and to me it doesn't feel reasonable to expect that the user understand and think through whether each change they are making requires one level of isolation or another (especially if the users are less technical). Even for the ad-hoc iteration use-case, it's quite important for things to work the way users expect, and as a user if one time my binary dependencies don't get packaged correctly, I will lose trust in the workflow I was using to do the right thing for me. Docker also feels cleaner to me since it handles isolation completely, whereas a wheel-based solution still needs to handle binary dependencies using e.g. Docker. If it does turn out that Docker is just too slow or there are technical challenges that are too hard to solve (e.g. modifying one line of code in your DAG causes the whole/the majority of a docker image to get rebuilt), then we probably will need to do something more like Jarek is talking about, but it definitely feels like a hack to me. I would love to see prototypes for each solution and some benchmarks personally. I think in terms of next steps after this discussion completes, probably a design doc/AIP evaluating the docker vs non-docker options makes sense. On Mon, Feb 24, 2020 at 5:48 PM James Meickle <jmeic...@quantopian.com.invalid> wrote: > I appreciate where you're coming from on wanting to enhance productivity > for different types of users, but as a cluster administrator, I _really_ > don't want to be running software that's managing its own Docker builds, > virtualenvs, zip uploads, etc.! It will almost certainly not do so in a way > compliant or consistent with our policies. If anything goes wrong, > disentangling what happened and how to fix it will be different from all of > our other software builds, which already have defined processes in place. > Also, "self-building" software that works this way often leaves ops teams > like mine looking like the "bad guys" by solving the easy local > development/single-instance/single-user case, and then failing to work "the > same way" in a more restricted shared cluster. > > I think this is exactly why a structured API is good, though. For example, > requiring DAGs to implement a manifest/retrieval API; where a basic > implementation is a hash plus a local filesystem path, a provided but > optional implementation is a git commit and checkout creds, and a third > party module implements a multi-user Python notebook integration. > > Basically, I am not opposed to "monitor code for a Dockerfile change, > rebuild/push the Dockerfile, and trigger the DAG" to be something you could > build on top of Airflow. I think that even some basic commitment to making > DAG functionality API-driven would enable a third party to do exactly that. > I would not want to see that functionality baked into Airflow's core, > though, because it's spanning so many problem domains that involve extra > security and operational concerns. As-is, we want to get more of those > concerns under our control, rather than Airflow's. (e.g. we'd rather notify > Airflow of DAG changes as part of a deployment pipeline, rather than having > it constantly poll for updated definitions, which leads to resource > utilization and inconsistency.) > > On Mon, Feb 24, 2020 at 5:16 PM Jarek Potiuk <jarek.pot...@polidea.com> > wrote: > > > I think what would help a lot to solve the problems of env/deps/DAGS is > the > > wheel packaging we started to talk about in another thread. > > > > I personally think what argo does is too much "cloud native" - you have > to > > build the image, push to registry, get it pulled by the engine, execute > > etc. I've been working with ML tool we developed internally for > NoMagic.ai > > which did something quite opposite - it was packaging all the python code > > developed locally in a .zip file, stored it in GCS, submitted it to be > > executed - it was unpacked and executed on a remote machine. We actually > > evaluated Airflow back then (and that's how I learned about the project) > > but it was too much overhead for our case. > > > > The first approach - Argo - is generally slow and "complex" if you just > > change one line of code. The second approach - default Airflow approach > is > > bad when you have new binary or python dependencies. I think both > solutions > > are valid, but for different cases of code deployment. And none of them > > handles the in-between case where we have only new python > > dependencies added/modified. I think with Airflow we can target a > > "combined" solution which will be able to handle well all cases. > > > > I think the problem is pretty independent on whether to store the > changes > > in a git repo or not. I can imagine that iteration is done based on just > > (shared) file system changes or through git - it is just a matter of > > whether you want to "freeze' the state in git or not. As a software > > engineer I always prefer git - but I perfectly understand as a data > > scientist git-commit and solving potential conflicts might be a burden, > > especially when you want to iterate quickly and experiment. But I > > understand that for this quick iteration case everything should work > > quickly and transparently without unnecessary overhead when you *just* > > change a python file or *just* increase python dependency version. > > > > What could work is if we start treating differently the three types of > > changes and not try to handle them in the same case. I think this can be > > all automated by Airflow's dag submission mechanism. I think it can > address > > several things at various levels - addressing the needs of individuals, > > teams, and companies at various levels: > > > > - super-fast speed of iteration by individual people (on a DAG-code > level) > > - ability to use different python dependencies by different people in the > > same team using the same deployment (on dependency level) > > - ability to keep the environment evolving with new binary dependencies > and > > eventually landing in prod environment (on environment/container level) > > > > Working it out backwards from the heaviest to lightest changes: > > > > 1) whenever we have a binary dependency change (.so, binary, new external > > dependency to be added) we should change dockerfile that installs it. We > > don't currently handle that case, but I can imagine with prod image > support > > and being able to modify thin layer of new binaries added to base Airflow > > image this should be easy to automate - when you iterate locally you > should > > be able to build the image automatically, send a new version to the > > registry and restart whole deployment of Airflow to use this new binary. > It > > should be a rare case, it should be impacting the whole deployment (for > > example whole dev or staging deployment) - i.e. everyone in Airflow > > installation should get the new image as a base. The drawback here is > that > > everyone who is using the same Airflow deployment is impacted. > > > > 2) whenever we have a Python dependency change (new version, new library > > added etc.) those deps could be pre-packaged in a binary .whl file > > (incremental) and submitted together with the DAG (and stored in a shared > > cache). This could be done via a new requirements.txt file and changes to > > it. We should be able to detect the changes and scheduler should be able > to > > install new requirements in dynamically created virtualenv (and produce a > > binary wheel), parse the DAG in the context of that virtualenv and submit > > the DAG together with the new wheel package - so that the wheel package > can > > be picked up by the workers to execute the tasks, create a venv for that > > and run tasks with this venv. The benefit of this is that you do not > have > > to update the whole deployment for that - ie. people in the same team, > > using the same Airflow Deployment can use different dependencies without > > impacting each other. The whl packages/venvs can be cached on the > > workers/scheduler and eventually, when you commit and release such > > dependency changes, they can be embedded in the shared docker image from > > point 1). This whl file does not have to be stored in a database - just > > storing the "requirements.txt" per DagRun is enough - we can always > rebuild > > the whl from that when needed and we have no cache. > > > > 3) Whenever it's a change to just your own code (or code of any imported > > files you used) - there is no need to package and build the whl packages. > > You can package your own code in a .zip (or another .whl) and submit for > > execution (and store per-DagRun), This can be also automated by > scheduler - > > it can traverse the whole python structure while parsing the file and > > package all dependent files. This can be done using current .zip support > of > > Airflow + new versioning. We would have to add a feature that each of > such > > DAGs will be a different version even if the same dag_id is used. We > > already plan it - as a feature for 2.0 to support different version of > each > > DAG (and we have a discussion about it tomorrow!). Here the benefit is > that > > even if two people are modifying the same DAG file - they can run them > > independently and iterate independently on it. It will be fast, and the > > packaged ".zip" file is the "track" of what was exactly tried. Often you > do > > not have to store it in git as often those will be experiments. > > > > I think we cannot have a "one solution" to all the use cases - but with > > treating those three cases differently, we can do very well for The ML > > case. And we could automate it all - we could detect what kind of change > > user ldid locally and act appropriately. > > > > J. > > > > > > > > On Mon, Feb 24, 2020 at 4:54 PM Ash Berlin-Taylor <a...@apache.org> > wrote: > > > > > > > > > DAG state (currently stored directly in the DB) > > > Can you expand on this point James? What is the problem or limitation > > > here? And would those be solved by expanding on the APIs to allow this > to > > > be set by some external process? > > > On Feb 24 2020, at 3:45 pm, James Meickle <jmeic...@quantopian.com > > .INVALID> > > > wrote: > > > > I really agree with most of what was posted above but particularly > love > > > > what Evgeny wrote about having a DAG API. As an end user, I would > love > > to > > > > be able to provide different implementations of core DAG > functionality, > > > > similar to how hExecutor can already be subclassed. Some key behavior > > > > points I either have personally had to override/work around, or would > > > like > > > > to do if it were easier: > > > > > > > > DAG schedule (currently defined by the DAG after it has been parsed) > > > > DAG discovery (currently always a scheduler subprocess) > > > > DAG state (currently stored directly in the DB) > > > > Loading content for discovered DAGs (currently always on-disk, causes > > > > versioning problems) > > > > Parsing content from discovered DAGs (currently always a scheduler > > > > subprocess, causes performance problems) > > > > Providing DAG result transfer/persistence (currently only XCOM, > causes > > > many > > > > problems) > > > > > > > > Breaking DAG functionality into a set of related APIs would allow > > Airflow > > > > to still have a good "out of the box" experience, or a simple > git-based > > > > deployment mode; while unlocking a lot of capability for users with > > more > > > > sophisticated needs. > > > > > > > > For example, we're using Argo Workflows nowadays, which is more > > > Kubernetes > > > > native but also more limited. I could easily envision a DAG > > > implementation > > > > where Airflow stores historical executions; stores git commits and > > > > retrieves DAG definitions as needed; launches Argo Workflows to > perform > > > > both DAG parsing _and_ task execution; and stores results in Airflow. > > > This > > > > would turn Airflow into the system of record (permanent history), and > > > Argo > > > > into the ephemeral execution layer (delete after a few days). End > users > > > > could submit from Airflow even without Kubernetes access, and > wouldn't > > > need > > > > Kubernetes access to view results or logs. > > > > > > > > Another example: we have notebook users who often need to schedule > > > > pipelines, which they do in ad hoc ways. Instead they could import > > > Airflow > > > > and define a DAG in Python, then call a command to remotely execute > it. > > > > This would run on our "staging" Airflow cluster, with access to > staging > > > > credentials, but as a DAG named something like > > > > "username.git_repo.notebook_name". Each run would freeze the current > > > > notebook (like to S3), execute it, and store results via papermill; > > this > > > > would let users launch multiple iterations of a notebook (like > looping > > > over > > > > a parameter list), run them for a while, and pick the one that does > > what > > > > they need. > > > > > > > > In general, there's been no end to the frustration of DAGs being > > tightly > > > > coupled to a specific on-disk layout, specific Python packaging, etc. > > and > > > > I'd love to be able to cleanly write alternative implementations. > > > > > > > > On Fri, Feb 21, 2020 at 4:44 PM Massy Bourennani < > > > massy.bourenn...@happn.com> > > > > wrote: > > > > > > > > > Hi all, > > > > > We are using Airflow to put batch ML models in production (only the > > > > > prediction is done). > > > > > [image: image.png] > > > > > > > > > > Above is an example of a DAG we are using to execute an ML model > > > written > > > > > in Python (scikit learn) > > > > > > > > > > 1. At first, data is extracted from BigQuery. The SQL query is on a > > DS > > > > > owned GitHub repository > > > > > 2. Then, the trained model, which is also serialized in pickle, is > > > > > taken from Github > > > > > 3. After that, an Apache Beam job is started on Dataflow which is > > > > > responsible for reading the input files, downloading the model from > > > GCS, > > > > > deserializing it, and using to predict the score of each data > point. > > > Below > > > > > you can see the expected interface of every model > > > > > 4. At the end, the results are saved in Bigquery/GCS > > > > > > > > > > class WhateverModel: def predict(batch: collections.abc.Iterable) > -> > > > > > collections.abc.Iterable: """ :param batch: a collection of dicts > > :type > > > > > batch: collection(dict) :return: a collection of dicts. there > should > > > be a > > > > > score in one of the fields of every dict (every datapoint) """ pass > > > > > > > > > > key points: > > > > > * every input: SQL query used to extract the dataset, ML model, > > custom > > > > > packages used by the model (used to setup Dataflow workers) is in > > > Github. > > > > > So we can go from one version to another and use it fairly easily, > > all > > > we > > > > > need are some qualifiers: GitHub repo, path to file, and tag > version. > > > > > * DS are free to use whatever python library they want thanks to > > Apache > > > > > Beam that provides a way to initialize workers with custom > packages. > > > > > > > > > > weak points: > > > > > * Whenever DS update one of the input (SQL query, python packages, > or > > > > > model) we, DE, need to update the DAG > > > > > * It's really specific to Batch Python ML models > > > > > > > > > > Below is a code snippet of a DAG instantiation > > > > > with Py3BatchInferenceDAG(dag_id='mydags.execute_my_ml_model', > > > > > sql_repo_name='app/data-learning', > > > > > sql_file_path='data_learning/my_ml_model/sql/predict.sql', > > > > > sql_tag_name='my_ml_model_0.0.12', > > > > > model_repo_name='app/data-models', > > > > > model_file_path='repository/my_ml_model/4/model.pkl', > > > > > model_tag_name='my_ml_model_0.0.4', > > > > > python_packages_repo_name='app/data-learning', > > > > > python_packages_tag_name='my_ml_model_0.0.9', > > > > > > > > > > > python_packages_paths=['data_learning/my_ml_model/python_packages/package_one/'], > > > > > params={ ########## setup of Dataflow workers > > > > > 'custom_commands': ['apt-get update', > > > > > 'gsutil cp -r gs://bucket/package_one /tmp/', > > > > > 'pip install /tmp/package_one/' > > > > > ], > > > > > 'required_packages': ['dill==0.2.9', 'numpy', 'pandas', > > 'scikit-learn' > > > > > ,'google-cloud-storage'] > > > > > }, > > > > > external_dag_requires=[ > > > > > 'mydags.dependency_one$ds', > > > > > 'mydags.dependency_two$ds' > > > > > ], > > > > > destination_project_dataset_table='mydags.my_ml_model${{ ds_nodash > > }}', > > > > > schema_path=Common.get_file_path(__file__, "schema.yaml"), > > > > > start_date=datetime(2019, 12, 10), > > > > > schedule_interval=CronPresets.daily()) as dag: > > > > > pass > > > > > > > > > > I hope it helps, > > > > > Regards > > > > > Massy > > > > > > > > > > On Thu, Feb 20, 2020 at 8:36 AM Evgeny Shulman < > > > evgeny.shul...@databand.ai> > > > > > wrote: > > > > > > > > > > > Hey Everybody > > > > > > (Fully agreed on Dan's post. These are the main pain points we > > > see/trying > > > > > > to fix. Here is our reply on the thread topic) > > > > > > > > > > > > We have numerous ML engineers that use our open source project > > (DBND) > > > > > > with Airflow for their everyday work. We help them create and > > monitor > > > > > > ML/DATA pipelines of different complexity levels and infra > > > requirements. > > > > > > After 1.5 years doing that as a company now, and a few years > doing > > > it as a > > > > > > part of a big enterprise organization before we started Databand, > > > these are > > > > > > the main pain points we think about when it comes to Airflow: > > > > > > > > > > > > A. DAG Versioning - ML teams change DAGs constantly. The first > > > limitation > > > > > > they see is being able to review historical information of > previous > > > DAG > > > > > > runs based on the exact version of the DAG that executed. Our > > plugin > > > > > > 'dbnd-airflow-versioned-dag' is our approach to that. We save and > > > show in > > > > > > the Airflow UI every specific version of the DAG. This is > important > > > in ML > > > > > > use cases because of the data science experimentation cycle and > the > > > need to > > > > > > trace exactly what code/data went into a model. > > > > > > > > > > > > B. A better version of the backfill command - We had to > reimplement > > > > > > BackfillJob class to be able to run specific DAG versions. > > > > > > > > > > > > C. Running the same DAG in different environments - People want > to > > > run > > > > > > the same DAG locally and at GCP/AWS without changing all the > code. > > > We have > > > > > > done that by abstracting Spark/Python/Docker code execution so we > > can > > > > > > easily switch from one infra to another. We did that by wrapping > > all > > > infra > > > > > > logic in a generic gateway "operators" with extensive use of > > existing > > > > > > Airflow hooks and operators. > > > > > > > > > > > > D. Data passing & versioning - being able to pass data from > > Operator > > > to > > > > > > Operator, version the data. Being able to do that with easy > > > authoring of > > > > > > DAGs & sub-DAGs - Pipelines grow in complexity very quickly. It > > will > > > be > > > > > > hard to agree on what is the "right" SDK here to implement. > Airflow > > > is > > > > > > very "built by engineers for engineers", DAGs are created to be > > > executed as > > > > > > Scheduled Production Jobs. It's going to be a long journey to get > > to > > > the > > > > > > common conclusion on what's needs to be done on a higher level > > around > > > > > > task/data management. Some people from the airflow community went > > and > > > > > > started new Orchestration companies after they didn't manage to > > have > > > a > > > > > > significant change in the Data model of Airflow. > > > > > > > > > > > > Our biggest wish list item in Airflow as advanced user: > > > > > > * A low-level API to generate and run DAGs *. > > > > > > So far there are numerous extensions, and all of them solve this > by > > > > > > creating another dag.py file with the dag generation. But neither > > > Scheduler > > > > > > nor UI can support that fully. The moment the scheduler together > > > with UI > > > > > > will be open for "versioned DAGs", a lot of nice DSLs and > > extensions > > > will > > > > > > emerge out of that. Data Analysts will get more GUI driven tools > to > > > > > > generate DAGs, ML engineers will be able to run and iterate on > > their > > > > > > algorithms, Data engineers will be able to implement their DAG > > > DSL/SDK the > > > > > > way they see it suits their company. > > > > > > > > > > > > Most users of DBND author their ML pipelines without knowing that > > > Airflow > > > > > > is orchestrating behind the scenes. They submit > > > Python/Spark/Notebooks > > > > > > without knowing that the DAG is going to be run through the > Airflow > > > > > > subsystem. Only when they see the Airflow webserver they start to > > > discover > > > > > > that there is Airflow. And this is the way it should be. ML > > > developers > > > > > > don't like new frameworks, they just like to see data flowing > from > > > task to > > > > > > task, and ways to push work to production with minimal "external" > > > code > > > > > > involved. > > > > > > > > > > > > Evgeny. > > > > > > On 2020/02/19 16:46:44, Dan Davydov <ddavy...@twitter.com.INVALID > > > > > > > > wrote: > > > > > > > Twitter uses Airflow primarily for ML, to create automated > > > pipelines for > > > > > > > retraining data, but also for more ad-hoc training jobs. > > > > > > > > > > > > > > The biggest gaps are on the experimentation side. It takes too > > > long for > > > > > > a > > > > > > > new user to set up and run a pipeline and then iterate on it. > > This > > > > > > > > > > > > problem > > > > > > > is a bit more unique to ML than other domains because 1) > training > > > jobs > > > > > > > > > > > > can > > > > > > > take a very long time to run, and 2) users have the need to > > launch > > > > > > > > > > > > multiple > > > > > > > experiments in parallel for the same model pipeline. > > > > > > > > > > > > > > Biggest Gaps: > > > > > > > - Too much boilerplate to write DAGs compared to Dagster/etc, > and > > > > > > > difficulty in message passing (XCom). There was a proposal > > > recently to > > > > > > > improve this in Airflow which should be entering AIP soon. > > > > > > > - Lack of pipeline isolation which hurts model experimentation > > > (being > > > > > > > > > > > > able > > > > > > > to run a DAG, modify it, and run it again without affecting the > > > previous > > > > > > > run), lack of isolation of DAGs from Airflow infrastructure > > > (inability > > > > > > > > > > > > to > > > > > > > redeploy Airflow infra without also redeploying DAGs) also > hurts. > > > > > > > - Lack of multi-tenancy; it's hard for customers to quickly > > launch > > > an > > > > > > > ad-hoc pipeline, the overhead of setting up a cluster and all > of > > > its > > > > > > > dependencies is quite high > > > > > > > - Lack of integration with data visualization plugins (e.g. > > > plugins for > > > > > > > rendering data related to a task when you click a task instance > > in > > > the > > > > > > > > > > > > UI). > > > > > > > - Lack of simpler abstractions for users with limited knowledge > > of > > > > > > > > > > > > Airflow > > > > > > > or even python to build simple pipelines (not really an Airflow > > > problem, > > > > > > > but rather the need for a good abstraction that sits on top of > > > Airflow > > > > > > > > > > > > like > > > > > > > a drag-and-drop pipeline builder) > > > > > > > > > > > > > > FWIW my personal feeling is that a fair number companies in the > > ML > > > space > > > > > > > are moving to alternate solutions like TFX Pipelines due to the > > > focus > > > > > > > > > > > > these > > > > > > > platforms these have on ML (ML pipelines are first-class > > > citizens), and > > > > > > > support from Google. Would be great if we could change that. > The > > ML > > > > > > > orchestration/tooling space is definitely evolving very rapidly > > and > > > > > > > > > > > > there > > > > > > > are also new promising entrants as well. > > > > > > > > > > > > > > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy > > > > > > > <germain.tan...@dailymotion.com.invalid> wrote: > > > > > > > > > > > > > > > Hello Daniel, > > > > > > > > In my company we use airflow to update our ML models and to > > > predict. > > > > > > > > As we use kubernetesOperator to trigger jobs, each ML DAG are > > > similar > > > > > > and > > > > > > > > ML/Data science engineer can reuse a template and choose > which > > > type of > > > > > > > > machine they needs (highcpu, highmem, GPU or not..etc) > > > > > > > > > > > > > > > > We have a process in place describe in the second part of > this > > > article > > > > > > > > (Industrializing machine learning pipeline) : > > > > > > > > > > > > > > > > > > > > https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f > > > > > > > > > > > > > > > > Hope this help. > > > > > > > > Germain. > > > > > > > > On 19/02/2020 16:42, "Daniel Imberman" < > > > daniel.imber...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > Hello everyone! > > > > > > > > I’m working on a few proposals to make Apache Airflow more > > > > > > friendly > > > > > > > > for ML/Data science use-cases, and I wanted to reach out in > > > hopes of > > > > > > > > hearing from people that are using/wish to use Airflow for > ML. > > > If you > > > > > > > > > > > > > > > > > > > have > > > > > > > > any opinions on the subject, I’d love to hear what you’re all > > > working > > > > > > > > > > > > > > > > > > > on! > > > > > > > > > > > > > > > > Current questions I’m looking into: > > > > > > > > 1. How do you use Airflow for your ML? Has it worked out well > > > > > > for you? > > > > > > > > 2. Are there any features that would improve your experience > of > > > > > > > > building models on Airflow? > > > > > > > > 3. Have you built anything on top of airflow/around Airflow > to > > > > > > > > > > > > > > > > > > > aide > > > > > > > > you in this process? > > > > > > > > > > > > > > > > Thank you so much for your time! > > > > > > > > via Newton Mail [ > > > > > > > > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&reserved=0 > > > > > > > > ] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > >