I also love the idea of DAG fetcher, It fits very well the "Python-centric" rather than "Container-centric" approach. Fetching it from different sources like local/ .zip and then .wheel seems like an interesting approach. I think the important parts of whatever approach we come up with are:
- make it easy for development/iteration by the creator - make it stable/manageable for deployment purpose - make it manageable for incremental updates. J. On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek <[email protected]> wrote: > I like the idea of a DagFetcher ( > https://github.com/apache/airflow/pull/3138). > I think it's a good and > simple starting point to fetch .py files from places like local file > system, S3 or GCS (that's what > Composer actually do under the hood). As the next step we can think about > wheels, zip and other > more demanding packaging. > > In my opinion in case of such "big" changes we should try to iterate in > small steps. Especially if > we don't have any strong opinions. > > Bests, > Tomek > > On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]> > wrote: > > > I am in "before-Xmas" mood so I thought I will write more of my thoughts > > about it :). > > > > *TL;DR; I try to reason (mostly looking at it from the philosophy/usage > > point of view) why container-native approach might not be best for > Airflow > > and why we should go python-first instead.* > > > > I also used to be in the "docker" camp as it seemed kinda natural. Adding > > DAG layer at package runtime seems like a natural thing to do. That seem > to > > fit perfectly well some sophisticated production deployment models where > > people are using docker registry to deploy new software. > > > > But in the meantime many more questions started to bother me: > > > > - Is it really the case for all the deployment models and use cases > how > > Airflow is used? > > - While it is a good model for some frozen-in-time production > deployment > > model, is it a good model to support the whole DAG lifecycle? Think > > about > > initial development, debugging, iteration on it, but also > > post-deployment > > maintenance and upgrades? > > - More importantly - does it fit the current philosophy of Airflow and > > is it expected by its users ? > > > > After asking those questions (and formulating some answers) I am not so > > sure any more that containerisation should be something Airflow bases > it's > > deployment model on. > > > > After spending a year with Airflow, getting more embedded in it's > > philosophy and talking to the users and especially looking at the > > "competition" we have - I changed my mind here. I don't think Airflow is > in > > the "Container-centric" world but it is really "Python-centric" world and > > it is a conscious choice we should continue with in the future. > > > > I think there are a number of advantages of Airflow that make it so > popular > > and really liked by the users. If we go a bit too much into > > "Docker/Container/Cloud Native" world - we might get a bit closer to some > > of our competitors (think Argo for example) but we might lose quite a bit > > of an advantage we have. The exact advantage that makes us better for our > > users, different from competition and also serves quite a bit different > use > > cases than "general workflow engine". > > > > While I am not a data-scientist myself, I interacted with data scientists > > and data engineers a lot (mostly while working as a robotics engineer at > > NoMagic.ai) and I found that they think and act quite a bit differently > > than DevOps or even traditional Software Engineers. And I think those > > people are our primary users. Looking at the results of our recent survey > > <https://airflow.apache.org/blog/airflow-survey/> around 70% of Airflow > > users call themselves "Data Engineer" or "Data Scientist". > > > > Let me dive a bit deeper. > > > > For me when I think "Airflow" - I immediately think "Python". There are > > certain advantages of Airflow being python-first and python-focused. The > > main advantage is that the same people who are able to do data science > feel > > comfortable with writing the pipelines and use pre-existing abstractions > > that make it easier for them to write the pipelines > > (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live > and > > breathe python as their primary tool of choice. Using Jupyter Notebooks, > > writing data processing and machine learning experiments as python > scripts > > is part of their daily job. Docker and containers for them are merely an > > execution engine for whatever they do and while they know about it and > > realise why containers are useful - it's best if they do not have to > bother > > about containerisation. Even if they use it, it should be pretty much > > transparent to them. This is in parts the reasoning behind developing > > Breeze - while it uses containers to take advantage of isolation and > > consistent environment for everyone it tries to hide the > > dockerization/containerisation as much as possible and provide a simple, > > focused interface to manage it. People who know python don't necessarily > > need to understand containerisation in order to make use of it's > advantage. > > It's very similar to virtual machines, compilers etc. make use of them > > without really knowing how they work. And it's perfectly OK - they don't > > have to. > > > > Tying the deployment of Airflow DAGs has the disadvantage that you have > to > > include the whole step of packaging, distribution, sharing, and using the > > image to be used by the "worker" of Airflow. It also basically means that > > every task execution of Airflow has to be a separate docker container - > > isolated from the rest, started pretty much totally from scratch - either > > as part of a new Pod in Kubernetes or spun-off as a new container via > > docker-compose or docker-swarm. The whole idea of having separate DAGs > > which can be updated independently and potentially have different > > dependencies, maybe other python code etc. - this means pretty much that > > for every single DAG that you want to update, you need to package it as > an > > extra layer in Docker, put it somewhere in a shared registry, and switch > > your executors to use the new image, get it downloaded by the executor, > > restart worker somehow (to start a container based on that new image). > > That's a lot of hassle to just update one line in a DAG. Surely we can > > automate that and have it fast, but it's quite difficult to explain to > data > > scientists that just want to change one line in the DAG that they have to > > go through that process. They would need to understand how to check if > > their image is properly built and distributed, if the executor they run > > already picked-up the new image, if the worker has already picked the new > > image - and in the case of a spelling mistake they will have to repeat > that > > whole process again. That's hardly what data scientists are used to. They > > are used to try something and see results as quickly as possible without > > too much of a hassle and knowing about some external tooling. This is the > > whole point of jupyter notebooks for example - you can incrementally > change > > single step in your whole process and continue iterating on the rest. > This > > is one of the reasons we loved immediately the idea of Databand.ai to > > develop DebugExecutor > > <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing> > > and > > we helped in making it merge-ready. It lets the data scientists to > iterate > > and debug their DAGs using their familiar tools and process (just as if > > they debug a python script) without the hassle of learning new tools and > > changing the way they work. Tomek will soon write a blog post about it, > but > > I think it's one of the best productivity improvements we could give our > > DAG-writing users in a long time. > > > > This problem is also quite visible with container-native workflow engines > > such as Argo that force you to have every single step of your workflow to > > be a Docker container. That sounds great in theory (containers! > isolation! > > kubernetes!). And it even works perfectly well in a number of practical > > cases. For example when each step require complex processing, a number of > > dependencies and require different binaries etc. But when you look at it > > more closely - this is NOT primary use case for Airflow. The primary use > > case of Airflow is that it talks to other systems via APIs and > orchestrates > > their work. There is hardly any processing on Airflow worker nodes. There > > are hardly any new requirements/dependencies needed in most cases. I > really > > love that Airflow is actually focusing on the "glue" layer between those > > external services. Again - the same people who do data engineering can > > interact over python API with services they use, put all the steps and > > logic as python code in the same DAG and iterate and change it and > > get immediate feedback - and even add a few lines of code if they need to > > add an extra parameter or so. Imagine the case where every step of your > > workflow is a Docker container to run - as a data engineer you have to > use > > python to put the DAG together, then if you want to interact with an > > external service, you have to find an existing container that does it, > > figure out how to pass credentials to this container from your host (this > > is often non-trivial), and in many cases you find that in order to > achieve > > what you want you have to build your own image because those available in > > public registries are old or don't have some features exposed. It > happened > > to me many times when I tried to use such workflows, I was eventually > > forced to build and deploy somewhere my own Docker image - even if I was > > just doing iterating and trying different things. That's far more complex > > than 'pip install <x>' adding '<x> to setup.py' and adding one or two > lines > > of python code to do what I want. And I am super-familiar with Docker. I > > leave and breathe Docker. But I can see how intimidating and difficult it > > must be for people who don't. > > > > That's why I think that our basic and most common deployment model (even > > the one used in production) should be based on python toolset - not > > containers. Wheels seems like a great tool for python dependency > > management. I think in most cases when we have just a few dependencies to > > install per task (for example python google libraries for google tasks) > > from wheel in a running container and create a virtualenv for it - it > might > > be comparable or even faster than restarting a whole new container with > > those packages installed as a layer. Not mentioning much smaller memory > and > > cpu overhead if this is done within a running container, rather than > > restarting the whole container for that task. Kubernetes and it's > > deployment models are very well suited for long running tasks that do a > lot > > of work, but if you want to start a new container that starts the whole > > python interpreter with all dependencies, with it's own CPU/Memory > > requirements *JUST* to have an API call to start external service and > wait > > for it to finish (most of Airflow tasks are exactly this) - this seems > like > > a terrible overkill. It seems that the Native Executor > > <https://github.com/apache/airflow/pull/6750> idea discussed in > > sig-scalability group where we abstract away from deployment model and > use > > queues to communicate and where we keep the worker running to serve many > > subsequent tasks is much better idea than dedicated executors such as > > KubernetesExecutor which starts a new POD for every task. We should still > > use containers under the hood of course, and have deployments using > > Kubernetes etc. But this should be transparent to the people who write > > DAGs. > > > > Sorry for such a long mail - I just think it's a super-important decision > > on the philosophy of Airflow, which use cases it serves and how well it > > serves the whole lifecycle of DAGs - from debugging to maintenance, and I > > think it should really be a foundation of how we are implementing some of > > the deployment-related features of Airflow 2.0 - in order for it to stay > > relevant, preferred by our users and focusing on those cases that it does > > already very well. > > > > Let me know what you think. But in the meantime - have a great Xmas > > Everyone! > > > > J. > > > > > > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> > wrote: > > > > > > For the docker example, you'd almost > > > want to inject or "layer" the DAG script and airflow package at run > time. > > > > > > Something sort of like Heroku build packs? > > > > > > -a > > > > > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin < > > > [email protected]> wrote: > > > >This reminds me of the "DagFetcher" idea. Basically a new abstraction > > > >that > > > >can fetch a DAG object from anywhere and run a task. In theory you > > > >could > > > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or > > > >whatever makes sense to your organization. In the proposal I wrote > > > >about > > > >using a universal uri scheme to identify DAG artifacts, with support > > > >for > > > >versioning, as in s3://company_dagbag/some_dag@latest > > > > > > > >One challenge is around *not* serializing Airflow specific code in the > > > >artifact/docker, otherwise you end up with a messy heterogenous > cluster > > > >that runs multiple Airflow versions. For the docker example, you'd > > > >almost > > > >want to inject or "layer" the DAG script and airflow package at run > > > >time. > > > > > > > >Max > > > > > > > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov > > > ><[email protected]> > > > >wrote: > > > > > > > >> The zip support is a bit of a hack and was a bit controversial when > > > >it was > > > >> added. I think if we go down the path of supporting more DAG > sources, > > > >we > > > >> should make sure we have the right interface in place so we avoid > the > > > >> current `if format == zip then: else:` and make sure that we don't > > > >tightly > > > >> couple to specific DAG sourcing implementations. Personally I feel > > > >that > > > >> Docker makes more sense than wheels (since they are fully > > > >self-contained > > > >> even at the binary dependency level), but if we go down the > interface > > > >route > > > >> it might be fine to add support for both Docker and wheels. > > > >> > > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex > > > >> <[email protected]> wrote: > > > >> > > > >> > Hi Jarek, > > > >> > > > > >> > This sounds great. Is this possibly related to the work started in > > > >> > https://github.com/apache/airflow/pull/730? < > > > >> > https://github.com/apache/airflow/pull/730?> > > > >> > > > > >> > I'm not sure I’m following your proposal entirely. Initially, what > > > >would > > > >> > be a great first step would be to support loading DAGs from > > > >entry_point, > > > >> as > > > >> > proposed in the closed PR above. This would already enable most of > > > >the > > > >> > features you’ve mentioned below. Each DAG could be a Python > > > >package, and > > > >> it > > > >> > would carry all the information about required packages in its > > > >package > > > >> > meta-data. > > > >> > > > > >> > Is that what you’re envisioning? If so, I’d be happy to support > you > > > >with > > > >> > the implementation! > > > >> > > > > >> > Also, I think while the idea of creating a temporary virtual > > > >environment > > > >> > for running tasks is very useful, I’d like this to be optional, as > > > >it can > > > >> > also create a lot of overhead to running tasks. > > > >> > > > > >> > Cheers, > > > >> > > > > >> > Björn > > > >> > > > > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk > > > ><[email protected]> > > > >> > wrote: > > > >> > > > > > >> > > I had a lot of interesting discussions last few days with Apache > > > >> Airflow > > > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how > > > >many > > > >> > people > > > >> > > use Airflow in Poland). One discussion brought an interesting > > > >subject: > > > >> > > Packaging dags in wheel format. The users mentioned that they > are > > > >> > > super-happy using .zip-packaged DAGs but they think it could be > > > >> improved > > > >> > > with wheel format (which is also .zip BTW). Maybe it was already > > > >> > mentioned > > > >> > > in some discussions before but I have not found any. > > > >> > > > > > >> > > *Context:* > > > >> > > > > > >> > > We are well on the way of implementing "AIP-21 Changing import > > > >paths" > > > >> and > > > >> > > will provide backport packages for Airflow 1.10. As a next step > > > >we want > > > >> > to > > > >> > > target AIP-8. > > > >> > > One of the problems to implement AIP-8 (split hooks/operators > > > >into > > > >> > separate > > > >> > > packages) is the problem of dependencies. Different > > > >operators/hooks > > > >> might > > > >> > > have different dependencies if maintained separately. Currently > > > >we > > > >> have a > > > >> > > common set of dependencies as we have only one setup.py, but if > > > >we > > > >> split > > > >> > to > > > >> > > separate packages, this might change. > > > >> > > > > > >> > > *Proposal:* > > > >> > > > > > >> > > Our users - who love the .zip DAG distribution - proposed that > we > > > >> package > > > >> > > the DAGs and all related packages in a wheel package instead of > > > >pure > > > >> > .zip. > > > >> > > This would allow the users to install extra dependencies needed > > > >by the > > > >> > DAG. > > > >> > > And it struck me that we could indeed do that for DAGs but also > > > >> mitigate > > > >> > > most of the dependency problems for separately-packaged > > > >operators. > > > >> > > > > > >> > > The proposal from our users was to package the extra > dependencies > > > >> > together > > > >> > > with the DAG in a wheel file. This is quite cool on it's own, > but > > > >I > > > >> > thought > > > >> > > we might actually use the same approach to solve dependency > > > >problem > > > >> with > > > >> > > AIP-8. > > > >> > > > > > >> > > I think we could implement "operator group" -> extra -> "pip > > > >packages" > > > >> > > dependencies (we need them anyway for AIP-21) and then we could > > > >have > > > >> > wheel > > > >> > > packages with all the "extra" dependencies for each group of > > > >operators. > > > >> > > > > > >> > > Worker executing an operator could have the "core" dependencies > > > >> installed > > > >> > > initially but then when it is supposed to run an operator it > > > >could > > > >> > create a > > > >> > > virtualenv, install the required "extra" from wheels and run the > > > >task > > > >> for > > > >> > > this operator in this virtualenv (and remove virtualenv). We > > > >could have > > > >> > > such package-wheels prepared (one wheel package per operator > > > >group) and > > > >> > > distributed either same way as DAGs or using some shared binary > > > >> > repository > > > >> > > (and cached in the worker). > > > >> > > > > > >> > > Having such dynamically created virtualenv has also the > advantage > > > >that > > > >> if > > > >> > > someone has a DAG with specific dependencies - they could be > > > >embedded > > > >> in > > > >> > > the DAG wheel, installed from it to this virtualenv, and the > > > >virtualenv > > > >> > > would be removed after the task is finished. > > > >> > > > > > >> > > The advantage of this approach is that each DAG's extra > > > >dependencies > > > >> are > > > >> > > isolated and you could have even different versions of the same > > > >> > dependency > > > >> > > used by different DAGs. I think that could save a lot of > > > >headaches for > > > >> > many > > > >> > > users. > > > >> > > > > > >> > > For me that whole idea sounds pretty cool. > > > >> > > > > > >> > > Let me know what you think. > > > >> > > > > > >> > > J. > > > >> > > > > > >> > > > > > >> > > -- > > > >> > > > > > >> > > Jarek Potiuk > > > >> > > Polidea <https://www.polidea.com/> | Principal Software > Engineer > > > >> > > > > > >> > > M: +48 660 796 129 <+48660796129> > > > >> > > [image: Polidea] <https://www.polidea.com/> > > > >> > > > > >> > > > > >> > > > > > > > > > -- > > > > Jarek Potiuk > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > M: +48 660 796 129 <+48660796129> > > [image: Polidea] <https://www.polidea.com/> > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
