I am in "before-Xmas" mood so I thought I will write more of my thoughts about it :).
*TL;DR; I try to reason (mostly looking at it from the philosophy/usage point of view) why container-native approach might not be best for Airflow and why we should go python-first instead.* I also used to be in the "docker" camp as it seemed kinda natural. Adding DAG layer at package runtime seems like a natural thing to do. That seem to fit perfectly well some sophisticated production deployment models where people are using docker registry to deploy new software. But in the meantime many more questions started to bother me: - Is it really the case for all the deployment models and use cases how Airflow is used? - While it is a good model for some frozen-in-time production deployment model, is it a good model to support the whole DAG lifecycle? Think about initial development, debugging, iteration on it, but also post-deployment maintenance and upgrades? - More importantly - does it fit the current philosophy of Airflow and is it expected by its users ? After asking those questions (and formulating some answers) I am not so sure any more that containerisation should be something Airflow bases it's deployment model on. After spending a year with Airflow, getting more embedded in it's philosophy and talking to the users and especially looking at the "competition" we have - I changed my mind here. I don't think Airflow is in the "Container-centric" world but it is really "Python-centric" world and it is a conscious choice we should continue with in the future. I think there are a number of advantages of Airflow that make it so popular and really liked by the users. If we go a bit too much into "Docker/Container/Cloud Native" world - we might get a bit closer to some of our competitors (think Argo for example) but we might lose quite a bit of an advantage we have. The exact advantage that makes us better for our users, different from competition and also serves quite a bit different use cases than "general workflow engine". While I am not a data-scientist myself, I interacted with data scientists and data engineers a lot (mostly while working as a robotics engineer at NoMagic.ai) and I found that they think and act quite a bit differently than DevOps or even traditional Software Engineers. And I think those people are our primary users. Looking at the results of our recent survey <https://airflow.apache.org/blog/airflow-survey/> around 70% of Airflow users call themselves "Data Engineer" or "Data Scientist". Let me dive a bit deeper. For me when I think "Airflow" - I immediately think "Python". There are certain advantages of Airflow being python-first and python-focused. The main advantage is that the same people who are able to do data science feel comfortable with writing the pipelines and use pre-existing abstractions that make it easier for them to write the pipelines (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live and breathe python as their primary tool of choice. Using Jupyter Notebooks, writing data processing and machine learning experiments as python scripts is part of their daily job. Docker and containers for them are merely an execution engine for whatever they do and while they know about it and realise why containers are useful - it's best if they do not have to bother about containerisation. Even if they use it, it should be pretty much transparent to them. This is in parts the reasoning behind developing Breeze - while it uses containers to take advantage of isolation and consistent environment for everyone it tries to hide the dockerization/containerisation as much as possible and provide a simple, focused interface to manage it. People who know python don't necessarily need to understand containerisation in order to make use of it's advantage. It's very similar to virtual machines, compilers etc. make use of them without really knowing how they work. And it's perfectly OK - they don't have to. Tying the deployment of Airflow DAGs has the disadvantage that you have to include the whole step of packaging, distribution, sharing, and using the image to be used by the "worker" of Airflow. It also basically means that every task execution of Airflow has to be a separate docker container - isolated from the rest, started pretty much totally from scratch - either as part of a new Pod in Kubernetes or spun-off as a new container via docker-compose or docker-swarm. The whole idea of having separate DAGs which can be updated independently and potentially have different dependencies, maybe other python code etc. - this means pretty much that for every single DAG that you want to update, you need to package it as an extra layer in Docker, put it somewhere in a shared registry, and switch your executors to use the new image, get it downloaded by the executor, restart worker somehow (to start a container based on that new image). That's a lot of hassle to just update one line in a DAG. Surely we can automate that and have it fast, but it's quite difficult to explain to data scientists that just want to change one line in the DAG that they have to go through that process. They would need to understand how to check if their image is properly built and distributed, if the executor they run already picked-up the new image, if the worker has already picked the new image - and in the case of a spelling mistake they will have to repeat that whole process again. That's hardly what data scientists are used to. They are used to try something and see results as quickly as possible without too much of a hassle and knowing about some external tooling. This is the whole point of jupyter notebooks for example - you can incrementally change single step in your whole process and continue iterating on the rest. This is one of the reasons we loved immediately the idea of Databand.ai to develop DebugExecutor <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing> and we helped in making it merge-ready. It lets the data scientists to iterate and debug their DAGs using their familiar tools and process (just as if they debug a python script) without the hassle of learning new tools and changing the way they work. Tomek will soon write a blog post about it, but I think it's one of the best productivity improvements we could give our DAG-writing users in a long time. This problem is also quite visible with container-native workflow engines such as Argo that force you to have every single step of your workflow to be a Docker container. That sounds great in theory (containers! isolation! kubernetes!). And it even works perfectly well in a number of practical cases. For example when each step require complex processing, a number of dependencies and require different binaries etc. But when you look at it more closely - this is NOT primary use case for Airflow. The primary use case of Airflow is that it talks to other systems via APIs and orchestrates their work. There is hardly any processing on Airflow worker nodes. There are hardly any new requirements/dependencies needed in most cases. I really love that Airflow is actually focusing on the "glue" layer between those external services. Again - the same people who do data engineering can interact over python API with services they use, put all the steps and logic as python code in the same DAG and iterate and change it and get immediate feedback - and even add a few lines of code if they need to add an extra parameter or so. Imagine the case where every step of your workflow is a Docker container to run - as a data engineer you have to use python to put the DAG together, then if you want to interact with an external service, you have to find an existing container that does it, figure out how to pass credentials to this container from your host (this is often non-trivial), and in many cases you find that in order to achieve what you want you have to build your own image because those available in public registries are old or don't have some features exposed. It happened to me many times when I tried to use such workflows, I was eventually forced to build and deploy somewhere my own Docker image - even if I was just doing iterating and trying different things. That's far more complex than 'pip install <x>' adding '<x> to setup.py' and adding one or two lines of python code to do what I want. And I am super-familiar with Docker. I leave and breathe Docker. But I can see how intimidating and difficult it must be for people who don't. That's why I think that our basic and most common deployment model (even the one used in production) should be based on python toolset - not containers. Wheels seems like a great tool for python dependency management. I think in most cases when we have just a few dependencies to install per task (for example python google libraries for google tasks) from wheel in a running container and create a virtualenv for it - it might be comparable or even faster than restarting a whole new container with those packages installed as a layer. Not mentioning much smaller memory and cpu overhead if this is done within a running container, rather than restarting the whole container for that task. Kubernetes and it's deployment models are very well suited for long running tasks that do a lot of work, but if you want to start a new container that starts the whole python interpreter with all dependencies, with it's own CPU/Memory requirements *JUST* to have an API call to start external service and wait for it to finish (most of Airflow tasks are exactly this) - this seems like a terrible overkill. It seems that the Native Executor <https://github.com/apache/airflow/pull/6750> idea discussed in sig-scalability group where we abstract away from deployment model and use queues to communicate and where we keep the worker running to serve many subsequent tasks is much better idea than dedicated executors such as KubernetesExecutor which starts a new POD for every task. We should still use containers under the hood of course, and have deployments using Kubernetes etc. But this should be transparent to the people who write DAGs. Sorry for such a long mail - I just think it's a super-important decision on the philosophy of Airflow, which use cases it serves and how well it serves the whole lifecycle of DAGs - from debugging to maintenance, and I think it should really be a foundation of how we are implementing some of the deployment-related features of Airflow 2.0 - in order for it to stay relevant, preferred by our users and focusing on those cases that it does already very well. Let me know what you think. But in the meantime - have a great Xmas Everyone! J. On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> wrote: > > For the docker example, you'd almost > want to inject or "layer" the DAG script and airflow package at run time. > > Something sort of like Heroku build packs? > > -a > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin < > [email protected]> wrote: > >This reminds me of the "DagFetcher" idea. Basically a new abstraction > >that > >can fetch a DAG object from anywhere and run a task. In theory you > >could > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or > >whatever makes sense to your organization. In the proposal I wrote > >about > >using a universal uri scheme to identify DAG artifacts, with support > >for > >versioning, as in s3://company_dagbag/some_dag@latest > > > >One challenge is around *not* serializing Airflow specific code in the > >artifact/docker, otherwise you end up with a messy heterogenous cluster > >that runs multiple Airflow versions. For the docker example, you'd > >almost > >want to inject or "layer" the DAG script and airflow package at run > >time. > > > >Max > > > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov > ><[email protected]> > >wrote: > > > >> The zip support is a bit of a hack and was a bit controversial when > >it was > >> added. I think if we go down the path of supporting more DAG sources, > >we > >> should make sure we have the right interface in place so we avoid the > >> current `if format == zip then: else:` and make sure that we don't > >tightly > >> couple to specific DAG sourcing implementations. Personally I feel > >that > >> Docker makes more sense than wheels (since they are fully > >self-contained > >> even at the binary dependency level), but if we go down the interface > >route > >> it might be fine to add support for both Docker and wheels. > >> > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex > >> <[email protected]> wrote: > >> > >> > Hi Jarek, > >> > > >> > This sounds great. Is this possibly related to the work started in > >> > https://github.com/apache/airflow/pull/730? < > >> > https://github.com/apache/airflow/pull/730?> > >> > > >> > I'm not sure I’m following your proposal entirely. Initially, what > >would > >> > be a great first step would be to support loading DAGs from > >entry_point, > >> as > >> > proposed in the closed PR above. This would already enable most of > >the > >> > features you’ve mentioned below. Each DAG could be a Python > >package, and > >> it > >> > would carry all the information about required packages in its > >package > >> > meta-data. > >> > > >> > Is that what you’re envisioning? If so, I’d be happy to support you > >with > >> > the implementation! > >> > > >> > Also, I think while the idea of creating a temporary virtual > >environment > >> > for running tasks is very useful, I’d like this to be optional, as > >it can > >> > also create a lot of overhead to running tasks. > >> > > >> > Cheers, > >> > > >> > Björn > >> > > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk > ><[email protected]> > >> > wrote: > >> > > > >> > > I had a lot of interesting discussions last few days with Apache > >> Airflow > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how > >many > >> > people > >> > > use Airflow in Poland). One discussion brought an interesting > >subject: > >> > > Packaging dags in wheel format. The users mentioned that they are > >> > > super-happy using .zip-packaged DAGs but they think it could be > >> improved > >> > > with wheel format (which is also .zip BTW). Maybe it was already > >> > mentioned > >> > > in some discussions before but I have not found any. > >> > > > >> > > *Context:* > >> > > > >> > > We are well on the way of implementing "AIP-21 Changing import > >paths" > >> and > >> > > will provide backport packages for Airflow 1.10. As a next step > >we want > >> > to > >> > > target AIP-8. > >> > > One of the problems to implement AIP-8 (split hooks/operators > >into > >> > separate > >> > > packages) is the problem of dependencies. Different > >operators/hooks > >> might > >> > > have different dependencies if maintained separately. Currently > >we > >> have a > >> > > common set of dependencies as we have only one setup.py, but if > >we > >> split > >> > to > >> > > separate packages, this might change. > >> > > > >> > > *Proposal:* > >> > > > >> > > Our users - who love the .zip DAG distribution - proposed that we > >> package > >> > > the DAGs and all related packages in a wheel package instead of > >pure > >> > .zip. > >> > > This would allow the users to install extra dependencies needed > >by the > >> > DAG. > >> > > And it struck me that we could indeed do that for DAGs but also > >> mitigate > >> > > most of the dependency problems for separately-packaged > >operators. > >> > > > >> > > The proposal from our users was to package the extra dependencies > >> > together > >> > > with the DAG in a wheel file. This is quite cool on it's own, but > >I > >> > thought > >> > > we might actually use the same approach to solve dependency > >problem > >> with > >> > > AIP-8. > >> > > > >> > > I think we could implement "operator group" -> extra -> "pip > >packages" > >> > > dependencies (we need them anyway for AIP-21) and then we could > >have > >> > wheel > >> > > packages with all the "extra" dependencies for each group of > >operators. > >> > > > >> > > Worker executing an operator could have the "core" dependencies > >> installed > >> > > initially but then when it is supposed to run an operator it > >could > >> > create a > >> > > virtualenv, install the required "extra" from wheels and run the > >task > >> for > >> > > this operator in this virtualenv (and remove virtualenv). We > >could have > >> > > such package-wheels prepared (one wheel package per operator > >group) and > >> > > distributed either same way as DAGs or using some shared binary > >> > repository > >> > > (and cached in the worker). > >> > > > >> > > Having such dynamically created virtualenv has also the advantage > >that > >> if > >> > > someone has a DAG with specific dependencies - they could be > >embedded > >> in > >> > > the DAG wheel, installed from it to this virtualenv, and the > >virtualenv > >> > > would be removed after the task is finished. > >> > > > >> > > The advantage of this approach is that each DAG's extra > >dependencies > >> are > >> > > isolated and you could have even different versions of the same > >> > dependency > >> > > used by different DAGs. I think that could save a lot of > >headaches for > >> > many > >> > > users. > >> > > > >> > > For me that whole idea sounds pretty cool. > >> > > > >> > > Let me know what you think. > >> > > > >> > > J. > >> > > > >> > > > >> > > -- > >> > > > >> > > Jarek Potiuk > >> > > Polidea <https://www.polidea.com/> | Principal Software Engineer > >> > > > >> > > M: +48 660 796 129 <+48660796129> > >> > > [image: Polidea] <https://www.polidea.com/> > >> > > >> > > >> > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
