Kaxil and I are planning on tackling versioning (for display only right now, as it's the first step on this journey) as part of AIP-24.
However the issue with versioning the _entire_ DAG code/environment is sometimes you want to run in the old env, but sometimes you want to run in a new/latest env (such as when you re-try a task after pushing a bug fix.) So there's some UX to work out there. The idea of wheels is an interesting one, and it sounds sensible at a high level, and you hit on a very important point about who are key target users are. I would love to see how it plays out when we implement it and how it plays out and interacts with AIP-5, and if we have to deal with things like: - Suopport for a custom wheel "repo"? - Re-packaging modules as wheels for when they don't do it themselves - What about things that require system libraries? I guess we require them pre-installed. - If we need to add any tooling around this or not. - Is `pip` drivable enough as a library to let airflow manage this? I was initially a wondering what packaging dags as a wheel gives us over (better) zip support, but the point about dependencies and an on-demand virtualenv is interesting. We should of course still support the mode where people want fixed/static/pre-installed deps (i.e. sometimes for audit reasons it's important to know exactly what environment something ran in.) One of the reason we started working on the Knative Executor (AIP-25) over here at Astronomer was exactly this "spin up time" - both container start up time and extra process start up time. -ash > On 23 Dec 2019, at 05:16, Claudio <[email protected]> wrote: > > I think It's could be cool to add dag versioning, in this way it's possibile > to fetch a particular version of the dag.What do you think about?Claudio >> >> On 22 Dec 2019, at 21:34, Chao-Han Tsai <[email protected]> wrote: >> >> Probably it is a good time to revisit >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher >> again? >> >> On Sun, Dec 22, 2019 at 12:16 PM Jarek Potiuk <[email protected]> >> wrote: >> >>> I also love the idea of DAG fetcher, It fits very well the "Python-centric" >>> rather than "Container-centric" approach. Fetching it from different >>> sources like local/ .zip and then .wheel seems like an interesting >>> approach. I think the important parts of whatever approach we come up with >>> are: >>> >>> - make it easy for development/iteration by the creator >>> - make it stable/manageable for deployment purpose >>> - make it manageable for incremental updates. >>> >>> J. >>> >>> On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek <[email protected]> >>> wrote: >>> >>>> I like the idea of a DagFetcher ( >>>> https://github.com/apache/airflow/pull/3138). >>>> I think it's a good and >>>> simple starting point to fetch .py files from places like local file >>>> system, S3 or GCS (that's what >>>> Composer actually do under the hood). As the next step we can think about >>>> wheels, zip and other >>>> more demanding packaging. >>>> >>>> In my opinion in case of such "big" changes we should try to iterate in >>>> small steps. Especially if >>>> we don't have any strong opinions. >>>> >>>> Bests, >>>> Tomek >>>> >>>> On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]> >>>> wrote: >>>> >>>>> I am in "before-Xmas" mood so I thought I will write more of my >>> thoughts >>>>> about it :). >>>>> >>>>> *TL;DR; I try to reason (mostly looking at it from the philosophy/usage >>>>> point of view) why container-native approach might not be best for >>>> Airflow >>>>> and why we should go python-first instead.* >>>>> >>>>> I also used to be in the "docker" camp as it seemed kinda natural. >>> Adding >>>>> DAG layer at package runtime seems like a natural thing to do. That >>> seem >>>> to >>>>> fit perfectly well some sophisticated production deployment models >>> where >>>>> people are using docker registry to deploy new software. >>>>> >>>>> But in the meantime many more questions started to bother me: >>>>> >>>>> - Is it really the case for all the deployment models and use cases >>>> how >>>>> Airflow is used? >>>>> - While it is a good model for some frozen-in-time production >>>> deployment >>>>> model, is it a good model to support the whole DAG lifecycle? Think >>>>> about >>>>> initial development, debugging, iteration on it, but also >>>>> post-deployment >>>>> maintenance and upgrades? >>>>> - More importantly - does it fit the current philosophy of Airflow >>> and >>>>> is it expected by its users ? >>>>> >>>>> After asking those questions (and formulating some answers) I am not so >>>>> sure any more that containerisation should be something Airflow bases >>>> it's >>>>> deployment model on. >>>>> >>>>> After spending a year with Airflow, getting more embedded in it's >>>>> philosophy and talking to the users and especially looking at the >>>>> "competition" we have - I changed my mind here. I don't think Airflow >>> is >>>> in >>>>> the "Container-centric" world but it is really "Python-centric" world >>> and >>>>> it is a conscious choice we should continue with in the future. >>>>> >>>>> I think there are a number of advantages of Airflow that make it so >>>> popular >>>>> and really liked by the users. If we go a bit too much into >>>>> "Docker/Container/Cloud Native" world - we might get a bit closer to >>> some >>>>> of our competitors (think Argo for example) but we might lose quite a >>> bit >>>>> of an advantage we have. The exact advantage that makes us better for >>> our >>>>> users, different from competition and also serves quite a bit different >>>> use >>>>> cases than "general workflow engine". >>>>> >>>>> While I am not a data-scientist myself, I interacted with data >>> scientists >>>>> and data engineers a lot (mostly while working as a robotics engineer >>> at >>>>> NoMagic.ai) and I found that they think and act quite a bit differently >>>>> than DevOps or even traditional Software Engineers. And I think those >>>>> people are our primary users. Looking at the results of our recent >>> survey >>>>> <https://airflow.apache.org/blog/airflow-survey/> around 70% of >>> Airflow >>>>> users call themselves "Data Engineer" or "Data Scientist". >>>>> >>>>> Let me dive a bit deeper. >>>>> >>>>> For me when I think "Airflow" - I immediately think "Python". There are >>>>> certain advantages of Airflow being python-first and python-focused. >>> The >>>>> main advantage is that the same people who are able to do data science >>>> feel >>>>> comfortable with writing the pipelines and use pre-existing >>> abstractions >>>>> that make it easier for them to write the pipelines >>>>> (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live >>>> and >>>>> breathe python as their primary tool of choice. Using Jupyter >>> Notebooks, >>>>> writing data processing and machine learning experiments as python >>>> scripts >>>>> is part of their daily job. Docker and containers for them are merely >>> an >>>>> execution engine for whatever they do and while they know about it and >>>>> realise why containers are useful - it's best if they do not have to >>>> bother >>>>> about containerisation. Even if they use it, it should be pretty much >>>>> transparent to them. This is in parts the reasoning behind developing >>>>> Breeze - while it uses containers to take advantage of isolation and >>>>> consistent environment for everyone it tries to hide the >>>>> dockerization/containerisation as much as possible and provide a >>> simple, >>>>> focused interface to manage it. People who know python don't >>> necessarily >>>>> need to understand containerisation in order to make use of it's >>>> advantage. >>>>> It's very similar to virtual machines, compilers etc. make use of them >>>>> without really knowing how they work. And it's perfectly OK - they >>> don't >>>>> have to. >>>>> >>>>> Tying the deployment of Airflow DAGs has the disadvantage that you have >>>> to >>>>> include the whole step of packaging, distribution, sharing, and using >>> the >>>>> image to be used by the "worker" of Airflow. It also basically means >>> that >>>>> every task execution of Airflow has to be a separate docker container - >>>>> isolated from the rest, started pretty much totally from scratch - >>> either >>>>> as part of a new Pod in Kubernetes or spun-off as a new container via >>>>> docker-compose or docker-swarm. The whole idea of having separate DAGs >>>>> which can be updated independently and potentially have different >>>>> dependencies, maybe other python code etc. - this means pretty much >>> that >>>>> for every single DAG that you want to update, you need to package it as >>>> an >>>>> extra layer in Docker, put it somewhere in a shared registry, and >>> switch >>>>> your executors to use the new image, get it downloaded by the executor, >>>>> restart worker somehow (to start a container based on that new image). >>>>> That's a lot of hassle to just update one line in a DAG. Surely we can >>>>> automate that and have it fast, but it's quite difficult to explain to >>>> data >>>>> scientists that just want to change one line in the DAG that they have >>> to >>>>> go through that process. They would need to understand how to check if >>>>> their image is properly built and distributed, if the executor they run >>>>> already picked-up the new image, if the worker has already picked the >>> new >>>>> image - and in the case of a spelling mistake they will have to repeat >>>> that >>>>> whole process again. That's hardly what data scientists are used to. >>> They >>>>> are used to try something and see results as quickly as possible >>> without >>>>> too much of a hassle and knowing about some external tooling. This is >>> the >>>>> whole point of jupyter notebooks for example - you can incrementally >>>> change >>>>> single step in your whole process and continue iterating on the rest. >>>> This >>>>> is one of the reasons we loved immediately the idea of Databand.ai to >>>>> develop DebugExecutor >>>>> <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing >>>> >>>>> and >>>>> we helped in making it merge-ready. It lets the data scientists to >>>> iterate >>>>> and debug their DAGs using their familiar tools and process (just as if >>>>> they debug a python script) without the hassle of learning new tools >>> and >>>>> changing the way they work. Tomek will soon write a blog post about it, >>>> but >>>>> I think it's one of the best productivity improvements we could give >>> our >>>>> DAG-writing users in a long time. >>>>> >>>>> This problem is also quite visible with container-native workflow >>> engines >>>>> such as Argo that force you to have every single step of your workflow >>> to >>>>> be a Docker container. That sounds great in theory (containers! >>>> isolation! >>>>> kubernetes!). And it even works perfectly well in a number of practical >>>>> cases. For example when each step require complex processing, a number >>> of >>>>> dependencies and require different binaries etc. But when you look at >>> it >>>>> more closely - this is NOT primary use case for Airflow. The primary >>> use >>>>> case of Airflow is that it talks to other systems via APIs and >>>> orchestrates >>>>> their work. There is hardly any processing on Airflow worker nodes. >>> There >>>>> are hardly any new requirements/dependencies needed in most cases. I >>>> really >>>>> love that Airflow is actually focusing on the "glue" layer between >>> those >>>>> external services. Again - the same people who do data engineering can >>>>> interact over python API with services they use, put all the steps and >>>>> logic as python code in the same DAG and iterate and change it and >>>>> get immediate feedback - and even add a few lines of code if they need >>> to >>>>> add an extra parameter or so. Imagine the case where every step of your >>>>> workflow is a Docker container to run - as a data engineer you have to >>>> use >>>>> python to put the DAG together, then if you want to interact with an >>>>> external service, you have to find an existing container that does it, >>>>> figure out how to pass credentials to this container from your host >>> (this >>>>> is often non-trivial), and in many cases you find that in order to >>>> achieve >>>>> what you want you have to build your own image because those available >>> in >>>>> public registries are old or don't have some features exposed. It >>>> happened >>>>> to me many times when I tried to use such workflows, I was eventually >>>>> forced to build and deploy somewhere my own Docker image - even if I >>> was >>>>> just doing iterating and trying different things. That's far more >>> complex >>>>> than 'pip install <x>' adding '<x> to setup.py' and adding one or two >>>> lines >>>>> of python code to do what I want. And I am super-familiar with Docker. >>> I >>>>> leave and breathe Docker. But I can see how intimidating and difficult >>> it >>>>> must be for people who don't. >>>>> >>>>> That's why I think that our basic and most common deployment model >>> (even >>>>> the one used in production) should be based on python toolset - not >>>>> containers. Wheels seems like a great tool for python dependency >>>>> management. I think in most cases when we have just a few dependencies >>> to >>>>> install per task (for example python google libraries for google tasks) >>>>> from wheel in a running container and create a virtualenv for it - it >>>> might >>>>> be comparable or even faster than restarting a whole new container with >>>>> those packages installed as a layer. Not mentioning much smaller memory >>>> and >>>>> cpu overhead if this is done within a running container, rather than >>>>> restarting the whole container for that task. Kubernetes and it's >>>>> deployment models are very well suited for long running tasks that do a >>>> lot >>>>> of work, but if you want to start a new container that starts the whole >>>>> python interpreter with all dependencies, with it's own CPU/Memory >>>>> requirements *JUST* to have an API call to start external service and >>>> wait >>>>> for it to finish (most of Airflow tasks are exactly this) - this seems >>>> like >>>>> a terrible overkill. It seems that the Native Executor >>>>> <https://github.com/apache/airflow/pull/6750> idea discussed in >>>>> sig-scalability group where we abstract away from deployment model and >>>> use >>>>> queues to communicate and where we keep the worker running to serve >>> many >>>>> subsequent tasks is much better idea than dedicated executors such as >>>>> KubernetesExecutor which starts a new POD for every task. We should >>> still >>>>> use containers under the hood of course, and have deployments using >>>>> Kubernetes etc. But this should be transparent to the people who write >>>>> DAGs. >>>>> >>>>> Sorry for such a long mail - I just think it's a super-important >>> decision >>>>> on the philosophy of Airflow, which use cases it serves and how well it >>>>> serves the whole lifecycle of DAGs - from debugging to maintenance, >>> and I >>>>> think it should really be a foundation of how we are implementing some >>> of >>>>> the deployment-related features of Airflow 2.0 - in order for it to >>> stay >>>>> relevant, preferred by our users and focusing on those cases that it >>> does >>>>> already very well. >>>>> >>>>> Let me know what you think. But in the meantime - have a great Xmas >>>>> Everyone! >>>>> >>>>> J. >>>>> >>>>> >>>>> On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> >>>> wrote: >>>>> >>>>>>> For the docker example, you'd almost >>>>>> want to inject or "layer" the DAG script and airflow package at run >>>> time. >>>>>> >>>>>> Something sort of like Heroku build packs? >>>>>> >>>>>> -a >>>>>> >>>>>> On 20 December 2019 23:43:30 GMT, Maxime Beauchemin < >>>>>> [email protected]> wrote: >>>>>>> This reminds me of the "DagFetcher" idea. Basically a new >>> abstraction >>>>>>> that >>>>>>> can fetch a DAG object from anywhere and run a task. In theory you >>>>>>> could >>>>>>> extend it to do "zip on s3", "pex on GFS", "docker on artifactory" >>> or >>>>>>> whatever makes sense to your organization. In the proposal I wrote >>>>>>> about >>>>>>> using a universal uri scheme to identify DAG artifacts, with support >>>>>>> for >>>>>>> versioning, as in s3://company_dagbag/some_dag@latest >>>>>>> >>>>>>> One challenge is around *not* serializing Airflow specific code in >>> the >>>>>>> artifact/docker, otherwise you end up with a messy heterogenous >>>> cluster >>>>>>> that runs multiple Airflow versions. For the docker example, you'd >>>>>>> almost >>>>>>> want to inject or "layer" the DAG script and airflow package at run >>>>>>> time. >>>>>>> >>>>>>> Max >>>>>>> >>>>>>> On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov >>>>>>> <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> The zip support is a bit of a hack and was a bit controversial >>> when >>>>>>> it was >>>>>>>> added. I think if we go down the path of supporting more DAG >>>> sources, >>>>>>> we >>>>>>>> should make sure we have the right interface in place so we avoid >>>> the >>>>>>>> current `if format == zip then: else:` and make sure that we don't >>>>>>> tightly >>>>>>>> couple to specific DAG sourcing implementations. Personally I feel >>>>>>> that >>>>>>>> Docker makes more sense than wheels (since they are fully >>>>>>> self-contained >>>>>>>> even at the binary dependency level), but if we go down the >>>> interface >>>>>>> route >>>>>>>> it might be fine to add support for both Docker and wheels. >>>>>>>> >>>>>>>> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Jarek, >>>>>>>>> >>>>>>>>> This sounds great. Is this possibly related to the work started >>> in >>>>>>>>> https://github.com/apache/airflow/pull/730? < >>>>>>>>> https://github.com/apache/airflow/pull/730?> >>>>>>>>> >>>>>>>>> I'm not sure I’m following your proposal entirely. Initially, >>> what >>>>>>> would >>>>>>>>> be a great first step would be to support loading DAGs from >>>>>>> entry_point, >>>>>>>> as >>>>>>>>> proposed in the closed PR above. This would already enable most >>> of >>>>>>> the >>>>>>>>> features you’ve mentioned below. Each DAG could be a Python >>>>>>> package, and >>>>>>>> it >>>>>>>>> would carry all the information about required packages in its >>>>>>> package >>>>>>>>> meta-data. >>>>>>>>> >>>>>>>>> Is that what you’re envisioning? If so, I’d be happy to support >>>> you >>>>>>> with >>>>>>>>> the implementation! >>>>>>>>> >>>>>>>>> Also, I think while the idea of creating a temporary virtual >>>>>>> environment >>>>>>>>> for running tasks is very useful, I’d like this to be optional, >>> as >>>>>>> it can >>>>>>>>> also create a lot of overhead to running tasks. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Björn >>>>>>>>> >>>>>>>>>> On 14. Dec 2019, at 11:10, Jarek Potiuk >>>>>>> <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I had a lot of interesting discussions last few days with >>> Apache >>>>>>>> Airflow >>>>>>>>>> users at PyDataWarsaw 2019 (I was actually quite surprised how >>>>>>> many >>>>>>>>> people >>>>>>>>>> use Airflow in Poland). One discussion brought an interesting >>>>>>> subject: >>>>>>>>>> Packaging dags in wheel format. The users mentioned that they >>>> are >>>>>>>>>> super-happy using .zip-packaged DAGs but they think it could >>> be >>>>>>>> improved >>>>>>>>>> with wheel format (which is also .zip BTW). Maybe it was >>> already >>>>>>>>> mentioned >>>>>>>>>> in some discussions before but I have not found any. >>>>>>>>>> >>>>>>>>>> *Context:* >>>>>>>>>> >>>>>>>>>> We are well on the way of implementing "AIP-21 Changing import >>>>>>> paths" >>>>>>>> and >>>>>>>>>> will provide backport packages for Airflow 1.10. As a next >>> step >>>>>>> we want >>>>>>>>> to >>>>>>>>>> target AIP-8. >>>>>>>>>> One of the problems to implement AIP-8 (split hooks/operators >>>>>>> into >>>>>>>>> separate >>>>>>>>>> packages) is the problem of dependencies. Different >>>>>>> operators/hooks >>>>>>>> might >>>>>>>>>> have different dependencies if maintained separately. >>> Currently >>>>>>> we >>>>>>>> have a >>>>>>>>>> common set of dependencies as we have only one setup.py, but >>> if >>>>>>> we >>>>>>>> split >>>>>>>>> to >>>>>>>>>> separate packages, this might change. >>>>>>>>>> >>>>>>>>>> *Proposal:* >>>>>>>>>> >>>>>>>>>> Our users - who love the .zip DAG distribution - proposed that >>>> we >>>>>>>> package >>>>>>>>>> the DAGs and all related packages in a wheel package instead >>> of >>>>>>> pure >>>>>>>>> .zip. >>>>>>>>>> This would allow the users to install extra dependencies >>> needed >>>>>>> by the >>>>>>>>> DAG. >>>>>>>>>> And it struck me that we could indeed do that for DAGs but >>> also >>>>>>>> mitigate >>>>>>>>>> most of the dependency problems for separately-packaged >>>>>>> operators. >>>>>>>>>> >>>>>>>>>> The proposal from our users was to package the extra >>>> dependencies >>>>>>>>> together >>>>>>>>>> with the DAG in a wheel file. This is quite cool on it's own, >>>> but >>>>>>> I >>>>>>>>> thought >>>>>>>>>> we might actually use the same approach to solve dependency >>>>>>> problem >>>>>>>> with >>>>>>>>>> AIP-8. >>>>>>>>>> >>>>>>>>>> I think we could implement "operator group" -> extra -> "pip >>>>>>> packages" >>>>>>>>>> dependencies (we need them anyway for AIP-21) and then we >>> could >>>>>>> have >>>>>>>>> wheel >>>>>>>>>> packages with all the "extra" dependencies for each group of >>>>>>> operators. >>>>>>>>>> >>>>>>>>>> Worker executing an operator could have the "core" >>> dependencies >>>>>>>> installed >>>>>>>>>> initially but then when it is supposed to run an operator it >>>>>>> could >>>>>>>>> create a >>>>>>>>>> virtualenv, install the required "extra" from wheels and run >>> the >>>>>>> task >>>>>>>> for >>>>>>>>>> this operator in this virtualenv (and remove virtualenv). We >>>>>>> could have >>>>>>>>>> such package-wheels prepared (one wheel package per operator >>>>>>> group) and >>>>>>>>>> distributed either same way as DAGs or using some shared >>> binary >>>>>>>>> repository >>>>>>>>>> (and cached in the worker). >>>>>>>>>> >>>>>>>>>> Having such dynamically created virtualenv has also the >>>> advantage >>>>>>> that >>>>>>>> if >>>>>>>>>> someone has a DAG with specific dependencies - they could be >>>>>>> embedded >>>>>>>> in >>>>>>>>>> the DAG wheel, installed from it to this virtualenv, and the >>>>>>> virtualenv >>>>>>>>>> would be removed after the task is finished. >>>>>>>>>> >>>>>>>>>> The advantage of this approach is that each DAG's extra >>>>>>> dependencies >>>>>>>> are >>>>>>>>>> isolated and you could have even different versions of the >>> same >>>>>>>>> dependency >>>>>>>>>> used by different DAGs. I think that could save a lot of >>>>>>> headaches for >>>>>>>>> many >>>>>>>>>> users. >>>>>>>>>> >>>>>>>>>> For me that whole idea sounds pretty cool. >>>>>>>>>> >>>>>>>>>> Let me know what you think. >>>>>>>>>> >>>>>>>>>> J. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Jarek Potiuk >>>>>>>>>> Polidea <https://www.polidea.com/> | Principal Software >>>> Engineer >>>>>>>>>> >>>>>>>>>> M: +48 660 796 129 <+48660796129> >>>>>>>>>> [image: Polidea] <https://www.polidea.com/> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Jarek Potiuk >>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>>>> >>>>> M: +48 660 796 129 <+48660796129> >>>>> [image: Polidea] <https://www.polidea.com/> >>>>> >>>> >>> >>> >>> -- >>> >>> Jarek Potiuk >>> Polidea <https://www.polidea.com/> | Principal Software Engineer >>> >>> M: +48 660 796 129 <+48660796129> >>> [image: Polidea] <https://www.polidea.com/> >>> >> >> >> -- >> >> Chao-Han Tsai >>
