Probably it is a good time to revisit https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher again?
On Sun, Dec 22, 2019 at 12:16 PM Jarek Potiuk <[email protected]> wrote: > I also love the idea of DAG fetcher, It fits very well the "Python-centric" > rather than "Container-centric" approach. Fetching it from different > sources like local/ .zip and then .wheel seems like an interesting > approach. I think the important parts of whatever approach we come up with > are: > > - make it easy for development/iteration by the creator > - make it stable/manageable for deployment purpose > - make it manageable for incremental updates. > > J. > > On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek <[email protected]> > wrote: > > > I like the idea of a DagFetcher ( > > https://github.com/apache/airflow/pull/3138). > > I think it's a good and > > simple starting point to fetch .py files from places like local file > > system, S3 or GCS (that's what > > Composer actually do under the hood). As the next step we can think about > > wheels, zip and other > > more demanding packaging. > > > > In my opinion in case of such "big" changes we should try to iterate in > > small steps. Especially if > > we don't have any strong opinions. > > > > Bests, > > Tomek > > > > On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]> > > wrote: > > > > > I am in "before-Xmas" mood so I thought I will write more of my > thoughts > > > about it :). > > > > > > *TL;DR; I try to reason (mostly looking at it from the philosophy/usage > > > point of view) why container-native approach might not be best for > > Airflow > > > and why we should go python-first instead.* > > > > > > I also used to be in the "docker" camp as it seemed kinda natural. > Adding > > > DAG layer at package runtime seems like a natural thing to do. That > seem > > to > > > fit perfectly well some sophisticated production deployment models > where > > > people are using docker registry to deploy new software. > > > > > > But in the meantime many more questions started to bother me: > > > > > > - Is it really the case for all the deployment models and use cases > > how > > > Airflow is used? > > > - While it is a good model for some frozen-in-time production > > deployment > > > model, is it a good model to support the whole DAG lifecycle? Think > > > about > > > initial development, debugging, iteration on it, but also > > > post-deployment > > > maintenance and upgrades? > > > - More importantly - does it fit the current philosophy of Airflow > and > > > is it expected by its users ? > > > > > > After asking those questions (and formulating some answers) I am not so > > > sure any more that containerisation should be something Airflow bases > > it's > > > deployment model on. > > > > > > After spending a year with Airflow, getting more embedded in it's > > > philosophy and talking to the users and especially looking at the > > > "competition" we have - I changed my mind here. I don't think Airflow > is > > in > > > the "Container-centric" world but it is really "Python-centric" world > and > > > it is a conscious choice we should continue with in the future. > > > > > > I think there are a number of advantages of Airflow that make it so > > popular > > > and really liked by the users. If we go a bit too much into > > > "Docker/Container/Cloud Native" world - we might get a bit closer to > some > > > of our competitors (think Argo for example) but we might lose quite a > bit > > > of an advantage we have. The exact advantage that makes us better for > our > > > users, different from competition and also serves quite a bit different > > use > > > cases than "general workflow engine". > > > > > > While I am not a data-scientist myself, I interacted with data > scientists > > > and data engineers a lot (mostly while working as a robotics engineer > at > > > NoMagic.ai) and I found that they think and act quite a bit differently > > > than DevOps or even traditional Software Engineers. And I think those > > > people are our primary users. Looking at the results of our recent > survey > > > <https://airflow.apache.org/blog/airflow-survey/> around 70% of > Airflow > > > users call themselves "Data Engineer" or "Data Scientist". > > > > > > Let me dive a bit deeper. > > > > > > For me when I think "Airflow" - I immediately think "Python". There are > > > certain advantages of Airflow being python-first and python-focused. > The > > > main advantage is that the same people who are able to do data science > > feel > > > comfortable with writing the pipelines and use pre-existing > abstractions > > > that make it easier for them to write the pipelines > > > (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live > > and > > > breathe python as their primary tool of choice. Using Jupyter > Notebooks, > > > writing data processing and machine learning experiments as python > > scripts > > > is part of their daily job. Docker and containers for them are merely > an > > > execution engine for whatever they do and while they know about it and > > > realise why containers are useful - it's best if they do not have to > > bother > > > about containerisation. Even if they use it, it should be pretty much > > > transparent to them. This is in parts the reasoning behind developing > > > Breeze - while it uses containers to take advantage of isolation and > > > consistent environment for everyone it tries to hide the > > > dockerization/containerisation as much as possible and provide a > simple, > > > focused interface to manage it. People who know python don't > necessarily > > > need to understand containerisation in order to make use of it's > > advantage. > > > It's very similar to virtual machines, compilers etc. make use of them > > > without really knowing how they work. And it's perfectly OK - they > don't > > > have to. > > > > > > Tying the deployment of Airflow DAGs has the disadvantage that you have > > to > > > include the whole step of packaging, distribution, sharing, and using > the > > > image to be used by the "worker" of Airflow. It also basically means > that > > > every task execution of Airflow has to be a separate docker container - > > > isolated from the rest, started pretty much totally from scratch - > either > > > as part of a new Pod in Kubernetes or spun-off as a new container via > > > docker-compose or docker-swarm. The whole idea of having separate DAGs > > > which can be updated independently and potentially have different > > > dependencies, maybe other python code etc. - this means pretty much > that > > > for every single DAG that you want to update, you need to package it as > > an > > > extra layer in Docker, put it somewhere in a shared registry, and > switch > > > your executors to use the new image, get it downloaded by the executor, > > > restart worker somehow (to start a container based on that new image). > > > That's a lot of hassle to just update one line in a DAG. Surely we can > > > automate that and have it fast, but it's quite difficult to explain to > > data > > > scientists that just want to change one line in the DAG that they have > to > > > go through that process. They would need to understand how to check if > > > their image is properly built and distributed, if the executor they run > > > already picked-up the new image, if the worker has already picked the > new > > > image - and in the case of a spelling mistake they will have to repeat > > that > > > whole process again. That's hardly what data scientists are used to. > They > > > are used to try something and see results as quickly as possible > without > > > too much of a hassle and knowing about some external tooling. This is > the > > > whole point of jupyter notebooks for example - you can incrementally > > change > > > single step in your whole process and continue iterating on the rest. > > This > > > is one of the reasons we loved immediately the idea of Databand.ai to > > > develop DebugExecutor > > > <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing > > > > > and > > > we helped in making it merge-ready. It lets the data scientists to > > iterate > > > and debug their DAGs using their familiar tools and process (just as if > > > they debug a python script) without the hassle of learning new tools > and > > > changing the way they work. Tomek will soon write a blog post about it, > > but > > > I think it's one of the best productivity improvements we could give > our > > > DAG-writing users in a long time. > > > > > > This problem is also quite visible with container-native workflow > engines > > > such as Argo that force you to have every single step of your workflow > to > > > be a Docker container. That sounds great in theory (containers! > > isolation! > > > kubernetes!). And it even works perfectly well in a number of practical > > > cases. For example when each step require complex processing, a number > of > > > dependencies and require different binaries etc. But when you look at > it > > > more closely - this is NOT primary use case for Airflow. The primary > use > > > case of Airflow is that it talks to other systems via APIs and > > orchestrates > > > their work. There is hardly any processing on Airflow worker nodes. > There > > > are hardly any new requirements/dependencies needed in most cases. I > > really > > > love that Airflow is actually focusing on the "glue" layer between > those > > > external services. Again - the same people who do data engineering can > > > interact over python API with services they use, put all the steps and > > > logic as python code in the same DAG and iterate and change it and > > > get immediate feedback - and even add a few lines of code if they need > to > > > add an extra parameter or so. Imagine the case where every step of your > > > workflow is a Docker container to run - as a data engineer you have to > > use > > > python to put the DAG together, then if you want to interact with an > > > external service, you have to find an existing container that does it, > > > figure out how to pass credentials to this container from your host > (this > > > is often non-trivial), and in many cases you find that in order to > > achieve > > > what you want you have to build your own image because those available > in > > > public registries are old or don't have some features exposed. It > > happened > > > to me many times when I tried to use such workflows, I was eventually > > > forced to build and deploy somewhere my own Docker image - even if I > was > > > just doing iterating and trying different things. That's far more > complex > > > than 'pip install <x>' adding '<x> to setup.py' and adding one or two > > lines > > > of python code to do what I want. And I am super-familiar with Docker. > I > > > leave and breathe Docker. But I can see how intimidating and difficult > it > > > must be for people who don't. > > > > > > That's why I think that our basic and most common deployment model > (even > > > the one used in production) should be based on python toolset - not > > > containers. Wheels seems like a great tool for python dependency > > > management. I think in most cases when we have just a few dependencies > to > > > install per task (for example python google libraries for google tasks) > > > from wheel in a running container and create a virtualenv for it - it > > might > > > be comparable or even faster than restarting a whole new container with > > > those packages installed as a layer. Not mentioning much smaller memory > > and > > > cpu overhead if this is done within a running container, rather than > > > restarting the whole container for that task. Kubernetes and it's > > > deployment models are very well suited for long running tasks that do a > > lot > > > of work, but if you want to start a new container that starts the whole > > > python interpreter with all dependencies, with it's own CPU/Memory > > > requirements *JUST* to have an API call to start external service and > > wait > > > for it to finish (most of Airflow tasks are exactly this) - this seems > > like > > > a terrible overkill. It seems that the Native Executor > > > <https://github.com/apache/airflow/pull/6750> idea discussed in > > > sig-scalability group where we abstract away from deployment model and > > use > > > queues to communicate and where we keep the worker running to serve > many > > > subsequent tasks is much better idea than dedicated executors such as > > > KubernetesExecutor which starts a new POD for every task. We should > still > > > use containers under the hood of course, and have deployments using > > > Kubernetes etc. But this should be transparent to the people who write > > > DAGs. > > > > > > Sorry for such a long mail - I just think it's a super-important > decision > > > on the philosophy of Airflow, which use cases it serves and how well it > > > serves the whole lifecycle of DAGs - from debugging to maintenance, > and I > > > think it should really be a foundation of how we are implementing some > of > > > the deployment-related features of Airflow 2.0 - in order for it to > stay > > > relevant, preferred by our users and focusing on those cases that it > does > > > already very well. > > > > > > Let me know what you think. But in the meantime - have a great Xmas > > > Everyone! > > > > > > J. > > > > > > > > > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> > > wrote: > > > > > > > > For the docker example, you'd almost > > > > want to inject or "layer" the DAG script and airflow package at run > > time. > > > > > > > > Something sort of like Heroku build packs? > > > > > > > > -a > > > > > > > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin < > > > > [email protected]> wrote: > > > > >This reminds me of the "DagFetcher" idea. Basically a new > abstraction > > > > >that > > > > >can fetch a DAG object from anywhere and run a task. In theory you > > > > >could > > > > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" > or > > > > >whatever makes sense to your organization. In the proposal I wrote > > > > >about > > > > >using a universal uri scheme to identify DAG artifacts, with support > > > > >for > > > > >versioning, as in s3://company_dagbag/some_dag@latest > > > > > > > > > >One challenge is around *not* serializing Airflow specific code in > the > > > > >artifact/docker, otherwise you end up with a messy heterogenous > > cluster > > > > >that runs multiple Airflow versions. For the docker example, you'd > > > > >almost > > > > >want to inject or "layer" the DAG script and airflow package at run > > > > >time. > > > > > > > > > >Max > > > > > > > > > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov > > > > ><[email protected]> > > > > >wrote: > > > > > > > > > >> The zip support is a bit of a hack and was a bit controversial > when > > > > >it was > > > > >> added. I think if we go down the path of supporting more DAG > > sources, > > > > >we > > > > >> should make sure we have the right interface in place so we avoid > > the > > > > >> current `if format == zip then: else:` and make sure that we don't > > > > >tightly > > > > >> couple to specific DAG sourcing implementations. Personally I feel > > > > >that > > > > >> Docker makes more sense than wheels (since they are fully > > > > >self-contained > > > > >> even at the binary dependency level), but if we go down the > > interface > > > > >route > > > > >> it might be fine to add support for both Docker and wheels. > > > > >> > > > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex > > > > >> <[email protected]> wrote: > > > > >> > > > > >> > Hi Jarek, > > > > >> > > > > > >> > This sounds great. Is this possibly related to the work started > in > > > > >> > https://github.com/apache/airflow/pull/730? < > > > > >> > https://github.com/apache/airflow/pull/730?> > > > > >> > > > > > >> > I'm not sure I’m following your proposal entirely. Initially, > what > > > > >would > > > > >> > be a great first step would be to support loading DAGs from > > > > >entry_point, > > > > >> as > > > > >> > proposed in the closed PR above. This would already enable most > of > > > > >the > > > > >> > features you’ve mentioned below. Each DAG could be a Python > > > > >package, and > > > > >> it > > > > >> > would carry all the information about required packages in its > > > > >package > > > > >> > meta-data. > > > > >> > > > > > >> > Is that what you’re envisioning? If so, I’d be happy to support > > you > > > > >with > > > > >> > the implementation! > > > > >> > > > > > >> > Also, I think while the idea of creating a temporary virtual > > > > >environment > > > > >> > for running tasks is very useful, I’d like this to be optional, > as > > > > >it can > > > > >> > also create a lot of overhead to running tasks. > > > > >> > > > > > >> > Cheers, > > > > >> > > > > > >> > Björn > > > > >> > > > > > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk > > > > ><[email protected]> > > > > >> > wrote: > > > > >> > > > > > > >> > > I had a lot of interesting discussions last few days with > Apache > > > > >> Airflow > > > > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how > > > > >many > > > > >> > people > > > > >> > > use Airflow in Poland). One discussion brought an interesting > > > > >subject: > > > > >> > > Packaging dags in wheel format. The users mentioned that they > > are > > > > >> > > super-happy using .zip-packaged DAGs but they think it could > be > > > > >> improved > > > > >> > > with wheel format (which is also .zip BTW). Maybe it was > already > > > > >> > mentioned > > > > >> > > in some discussions before but I have not found any. > > > > >> > > > > > > >> > > *Context:* > > > > >> > > > > > > >> > > We are well on the way of implementing "AIP-21 Changing import > > > > >paths" > > > > >> and > > > > >> > > will provide backport packages for Airflow 1.10. As a next > step > > > > >we want > > > > >> > to > > > > >> > > target AIP-8. > > > > >> > > One of the problems to implement AIP-8 (split hooks/operators > > > > >into > > > > >> > separate > > > > >> > > packages) is the problem of dependencies. Different > > > > >operators/hooks > > > > >> might > > > > >> > > have different dependencies if maintained separately. > Currently > > > > >we > > > > >> have a > > > > >> > > common set of dependencies as we have only one setup.py, but > if > > > > >we > > > > >> split > > > > >> > to > > > > >> > > separate packages, this might change. > > > > >> > > > > > > >> > > *Proposal:* > > > > >> > > > > > > >> > > Our users - who love the .zip DAG distribution - proposed that > > we > > > > >> package > > > > >> > > the DAGs and all related packages in a wheel package instead > of > > > > >pure > > > > >> > .zip. > > > > >> > > This would allow the users to install extra dependencies > needed > > > > >by the > > > > >> > DAG. > > > > >> > > And it struck me that we could indeed do that for DAGs but > also > > > > >> mitigate > > > > >> > > most of the dependency problems for separately-packaged > > > > >operators. > > > > >> > > > > > > >> > > The proposal from our users was to package the extra > > dependencies > > > > >> > together > > > > >> > > with the DAG in a wheel file. This is quite cool on it's own, > > but > > > > >I > > > > >> > thought > > > > >> > > we might actually use the same approach to solve dependency > > > > >problem > > > > >> with > > > > >> > > AIP-8. > > > > >> > > > > > > >> > > I think we could implement "operator group" -> extra -> "pip > > > > >packages" > > > > >> > > dependencies (we need them anyway for AIP-21) and then we > could > > > > >have > > > > >> > wheel > > > > >> > > packages with all the "extra" dependencies for each group of > > > > >operators. > > > > >> > > > > > > >> > > Worker executing an operator could have the "core" > dependencies > > > > >> installed > > > > >> > > initially but then when it is supposed to run an operator it > > > > >could > > > > >> > create a > > > > >> > > virtualenv, install the required "extra" from wheels and run > the > > > > >task > > > > >> for > > > > >> > > this operator in this virtualenv (and remove virtualenv). We > > > > >could have > > > > >> > > such package-wheels prepared (one wheel package per operator > > > > >group) and > > > > >> > > distributed either same way as DAGs or using some shared > binary > > > > >> > repository > > > > >> > > (and cached in the worker). > > > > >> > > > > > > >> > > Having such dynamically created virtualenv has also the > > advantage > > > > >that > > > > >> if > > > > >> > > someone has a DAG with specific dependencies - they could be > > > > >embedded > > > > >> in > > > > >> > > the DAG wheel, installed from it to this virtualenv, and the > > > > >virtualenv > > > > >> > > would be removed after the task is finished. > > > > >> > > > > > > >> > > The advantage of this approach is that each DAG's extra > > > > >dependencies > > > > >> are > > > > >> > > isolated and you could have even different versions of the > same > > > > >> > dependency > > > > >> > > used by different DAGs. I think that could save a lot of > > > > >headaches for > > > > >> > many > > > > >> > > users. > > > > >> > > > > > > >> > > For me that whole idea sounds pretty cool. > > > > >> > > > > > > >> > > Let me know what you think. > > > > >> > > > > > > >> > > J. > > > > >> > > > > > > >> > > > > > > >> > > -- > > > > >> > > > > > > >> > > Jarek Potiuk > > > > >> > > Polidea <https://www.polidea.com/> | Principal Software > > Engineer > > > > >> > > > > > > >> > > M: +48 660 796 129 <+48660796129> > > > > >> > > [image: Polidea] <https://www.polidea.com/> > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > -- Chao-Han Tsai
