Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Tomasz Urbaszek Sun, 22 Dec 2019 07:35:25 -0800

I like the idea of a DagFetcher (https://github.com/apache/airflow/pull/3138).
I think it's a good and
simple starting point to fetch .py files from places like local file
system, S3 or GCS (that's what
Composer actually do under the hood). As the next step we can think about
wheels, zip and other
more demanding packaging.


In my opinion in case of such "big" changes we should try to iterate in
small steps. Especially if
we don't have any strong opinions.

Bests,
Tomek

On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]>
wrote:

> I am in "before-Xmas" mood so I thought I will write more of my thoughts
> about it :).
>
> *TL;DR; I try to reason (mostly looking at it from the philosophy/usage
> point of view) why container-native approach might not be best for Airflow
> and why we should go python-first instead.*
>
> I also used to be in the "docker" camp as it seemed kinda natural. Adding
> DAG layer at package runtime seems like a natural thing to do. That seem to
> fit perfectly well some sophisticated production deployment models where
> people are using docker registry to deploy new software.
>
> But in the meantime many more questions started to bother me:
>
>    - Is it really the case for all the deployment models and use cases how
>    Airflow is used?
>    - While it is a good model for some frozen-in-time production deployment
>    model, is it a good model to support the whole DAG lifecycle?  Think
> about
>    initial development, debugging, iteration on it, but also
> post-deployment
>    maintenance and upgrades?
>    - More importantly - does it fit the current philosophy of Airflow and
>    is it expected by its users ?
>
> After asking those questions (and formulating some answers) I am not so
> sure any more that containerisation should be something Airflow bases it's
> deployment model on.
>
> After spending a year with Airflow, getting more embedded in it's
> philosophy and talking to the users and especially looking at the
> "competition" we have - I changed my mind here. I don't think Airflow is in
> the "Container-centric" world but it is really "Python-centric" world and
> it is a conscious choice we should continue with in the future.
>
> I think there are a number of advantages of Airflow that make it so popular
> and really liked by the users. If we go a bit too much into
> "Docker/Container/Cloud Native" world - we might get a bit closer to some
> of our competitors (think Argo for example) but we might lose quite a bit
> of an advantage we have. The exact advantage that makes us better for our
> users, different from competition and also serves quite a bit different use
> cases than "general workflow engine".
>
> While I am not a data-scientist myself, I interacted with data scientists
> and data engineers a lot (mostly while working as a robotics engineer at
> NoMagic.ai) and I found that they think and act quite a bit differently
> than DevOps or even traditional Software Engineers. And I think those
> people are our primary users. Looking at the results of our recent survey
> <https://airflow.apache.org/blog/airflow-survey/> around 70% of Airflow
> users call themselves "Data Engineer" or "Data Scientist".
>
> Let me dive a bit deeper.
>
> For me when I think "Airflow" - I immediately think "Python". There are
> certain advantages of Airflow being python-first and python-focused. The
> main advantage is that the same people who are able to do data science feel
> comfortable with writing the pipelines and use pre-existing abstractions
> that make it easier for them to write the pipelines
> (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live and
> breathe python as their primary tool of choice. Using Jupyter Notebooks,
> writing data processing and machine learning experiments as python scripts
> is part of their daily job. Docker and containers for them are merely an
> execution engine for whatever they do and while they know about it and
> realise why containers are useful - it's best if they do not have to bother
> about containerisation. Even if they use it, it should be pretty much
> transparent to them. This is in parts the reasoning behind developing
> Breeze - while it uses containers to take advantage of isolation and
> consistent environment for everyone it tries to hide the
> dockerization/containerisation as much as possible and provide a simple,
> focused interface to manage it. People who know python don't necessarily
> need to understand containerisation in order to make use of it's advantage.
> It's very similar to virtual machines, compilers etc. make use of them
> without really knowing how they work. And it's perfectly OK - they don't
> have to.
>
> Tying the deployment of Airflow DAGs has the disadvantage that you have to
> include the whole step of packaging, distribution, sharing, and using the
> image to be used by the "worker" of Airflow. It also basically means that
> every task execution of Airflow has to be a separate docker container -
> isolated from the rest, started pretty much totally from scratch - either
> as part of a new Pod in Kubernetes or spun-off as a new container via
> docker-compose or docker-swarm. The whole idea of having separate DAGs
> which can be updated independently and potentially have different
> dependencies, maybe other python code etc. - this means pretty much that
> for every single DAG that you want to update, you need to package it as an
> extra layer in Docker, put it somewhere in a shared registry, and switch
> your executors to use the new image, get it downloaded by the executor,
> restart worker somehow (to start a container based on that new image).
> That's a lot of hassle to just update one line in a DAG. Surely we can
> automate that and have it fast, but it's quite difficult to explain to data
> scientists that just want to change one line in the DAG that they have to
> go through that process. They would need to understand how to check if
> their image is properly built and distributed, if the executor they run
> already picked-up the new image, if the worker has already picked the new
> image - and in the case of a spelling mistake they will have to repeat that
> whole process again. That's hardly what data scientists are used to. They
> are used to try something and see results as quickly as possible without
> too much of a hassle and knowing about some external tooling. This is the
> whole point of jupyter notebooks for example - you can incrementally change
> single step in your whole process and continue iterating on the rest. This
> is one of the reasons we loved immediately the idea of Databand.ai to
> develop DebugExecutor
> <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing>
> and
> we helped in making it merge-ready. It lets the data scientists to iterate
> and debug their DAGs using their familiar tools and process (just as if
> they debug a python script) without the hassle of learning new tools and
> changing the way they work. Tomek will soon write a blog post about it, but
> I think it's one of the best productivity improvements we could give our
> DAG-writing users in a long time.
>
> This problem is also quite visible with container-native workflow engines
> such as Argo that force you to have every single step of your workflow to
> be a Docker container. That sounds great in theory (containers! isolation!
> kubernetes!). And it even works perfectly well in a number of practical
> cases. For example when each step require complex processing, a number of
> dependencies and require different binaries etc. But when you look at it
> more closely - this is NOT primary use case for Airflow. The primary use
> case of Airflow is that it talks to other systems via APIs and orchestrates
> their work. There is hardly any processing on Airflow worker nodes. There
> are hardly any new requirements/dependencies needed in most cases. I really
> love that Airflow is actually focusing on the "glue" layer between those
> external services. Again - the same people who do data engineering can
> interact over python API with services they use, put all the steps and
> logic as python code in the same DAG and iterate and change it and
> get immediate feedback - and even add a few lines of code if they need to
> add an extra parameter or so. Imagine the case where every step of your
> workflow is a Docker container to run - as a data engineer you have to use
> python to put the DAG together, then if you want to interact with an
> external service, you have to find an existing container that does it,
> figure out how to pass credentials to this container from your host (this
> is often non-trivial), and in many cases you find that in order to achieve
> what you want you have to build your own image because those available in
> public registries are old or don't have some features exposed. It happened
> to me many times when I tried to use such workflows, I was eventually
> forced to build and deploy somewhere my own Docker image - even if I was
> just doing iterating and trying different things. That's far more complex
> than 'pip install <x>' adding '<x> to setup.py' and adding one or two lines
> of python code to do what I want. And I am super-familiar with Docker. I
> leave and breathe Docker. But I can see how intimidating and difficult it
> must be for people who don't.
>
> That's why I think that our basic and most common deployment model (even
> the one used in production) should be based on python toolset - not
> containers. Wheels seems like a great tool for python dependency
> management. I think in most cases when we have just a few dependencies to
> install per task (for example python google libraries for google tasks)
> from wheel in a running container and create a virtualenv for it - it might
> be comparable or even faster than restarting a whole new container with
> those packages installed as a layer. Not mentioning much smaller memory and
> cpu overhead if this is done within a running container, rather than
> restarting the whole container for that task. Kubernetes and it's
> deployment models are very well suited for long running tasks that do a lot
> of work, but if you want to start a new container that starts the whole
> python interpreter with all dependencies, with it's own CPU/Memory
> requirements *JUST* to have an API call to start external service and wait
> for it to finish (most of Airflow tasks are exactly this) - this seems like
> a terrible overkill. It seems that the Native Executor
> <https://github.com/apache/airflow/pull/6750> idea discussed in
> sig-scalability group where we abstract away from deployment model and use
> queues to communicate and where we keep the worker running to serve many
> subsequent tasks is much better idea than dedicated executors such as
> KubernetesExecutor which starts a new POD for every task. We should still
> use containers under the hood of course, and have deployments using
> Kubernetes etc. But this should be transparent to the people who write
> DAGs.
>
> Sorry for such a long mail - I just think it's a super-important decision
> on the philosophy of Airflow, which use cases it serves and how well it
> serves the whole lifecycle of DAGs - from debugging to maintenance, and I
> think it should really be a foundation of how we are implementing some of
> the deployment-related features of Airflow 2.0 - in order for it to stay
> relevant, preferred by our users and focusing on those cases that it does
> already very well.
>
> Let me know what you think. But in the meantime - have a great Xmas
> Everyone!
>
> J.
>
>
> On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> wrote:
>
> > > For the docker example, you'd almost
> > want to inject or "layer" the DAG script and airflow package at run time.
> >
> > Something sort of like Heroku build packs?
> >
> > -a
> >
> > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <
> > [email protected]> wrote:
> > >This reminds me of the "DagFetcher" idea. Basically a new abstraction
> > >that
> > >can fetch a DAG object from anywhere and run a task. In theory you
> > >could
> > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or
> > >whatever makes sense to your organization. In the proposal I wrote
> > >about
> > >using a universal uri scheme to identify DAG artifacts, with support
> > >for
> > >versioning, as in s3://company_dagbag/some_dag@latest
> > >
> > >One challenge is around *not* serializing Airflow specific code in the
> > >artifact/docker, otherwise you end up with a messy heterogenous cluster
> > >that runs multiple Airflow versions. For the docker example, you'd
> > >almost
> > >want to inject or "layer" the DAG script and airflow package at run
> > >time.
> > >
> > >Max
> > >
> > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov
> > ><[email protected]>
> > >wrote:
> > >
> > >> The zip support is a bit of a hack and was a bit controversial when
> > >it was
> > >> added. I think if we go down the path of supporting more DAG sources,
> > >we
> > >> should make sure we have the right interface in place so we avoid the
> > >> current `if format == zip then: else:` and make sure that we don't
> > >tightly
> > >> couple to specific DAG sourcing implementations. Personally I feel
> > >that
> > >> Docker makes more sense than wheels (since they are fully
> > >self-contained
> > >> even at the binary dependency level), but if we go down the interface
> > >route
> > >> it might be fine to add support for both Docker and wheels.
> > >>
> > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
> > >> <[email protected]> wrote:
> > >>
> > >> > Hi Jarek,
> > >> >
> > >> > This sounds great. Is this possibly related to the work started in
> > >> > https://github.com/apache/airflow/pull/730? <
> > >> > https://github.com/apache/airflow/pull/730?>
> > >> >
> > >> > I'm not sure I’m following your proposal entirely. Initially, what
> > >would
> > >> > be a great first step would be to support loading DAGs from
> > >entry_point,
> > >> as
> > >> > proposed in the closed PR above. This would already enable most of
> > >the
> > >> > features you’ve mentioned below. Each DAG could be a Python
> > >package, and
> > >> it
> > >> > would carry all the information about required packages in its
> > >package
> > >> > meta-data.
> > >> >
> > >> > Is that what you’re envisioning? If so, I’d be happy to support you
> > >with
> > >> > the implementation!
> > >> >
> > >> > Also, I think while the idea of creating a temporary virtual
> > >environment
> > >> > for running tasks is very useful, I’d like this to be optional, as
> > >it can
> > >> > also create a lot of overhead to running tasks.
> > >> >
> > >> > Cheers,
> > >> >
> > >> >         Björn
> > >> >
> > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk
> > ><[email protected]>
> > >> > wrote:
> > >> > >
> > >> > > I had a lot of interesting discussions last few days with Apache
> > >> Airflow
> > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how
> > >many
> > >> > people
> > >> > > use Airflow in Poland). One discussion brought an interesting
> > >subject:
> > >> > > Packaging dags in wheel format. The users mentioned that they are
> > >> > > super-happy using .zip-packaged DAGs but they think it could be
> > >> improved
> > >> > > with wheel format (which is also .zip BTW). Maybe it was already
> > >> > mentioned
> > >> > > in some discussions before but I have not found any.
> > >> > >
> > >> > > *Context:*
> > >> > >
> > >> > > We are well on the way of implementing "AIP-21 Changing import
> > >paths"
> > >> and
> > >> > > will provide backport packages for Airflow 1.10. As a next step
> > >we want
> > >> > to
> > >> > > target AIP-8.
> > >> > > One of the problems to implement AIP-8 (split hooks/operators
> > >into
> > >> > separate
> > >> > > packages) is the problem of dependencies. Different
> > >operators/hooks
> > >> might
> > >> > > have different dependencies if maintained separately. Currently
> > >we
> > >> have a
> > >> > > common set of dependencies as we have only one setup.py, but if
> > >we
> > >> split
> > >> > to
> > >> > > separate packages, this might change.
> > >> > >
> > >> > > *Proposal:*
> > >> > >
> > >> > > Our users - who love the .zip DAG distribution - proposed that we
> > >> package
> > >> > > the DAGs and all related packages in a wheel package instead of
> > >pure
> > >> > .zip.
> > >> > > This would allow the users to install extra dependencies needed
> > >by the
> > >> > DAG.
> > >> > > And it struck me that we could indeed do that for DAGs but also
> > >> mitigate
> > >> > > most of the dependency problems for separately-packaged
> > >operators.
> > >> > >
> > >> > > The proposal from our users was to package the extra dependencies
> > >> > together
> > >> > > with the DAG in a wheel file. This is quite cool on it's own, but
> > >I
> > >> > thought
> > >> > > we might actually use the same approach to solve dependency
> > >problem
> > >> with
> > >> > > AIP-8.
> > >> > >
> > >> > > I think we could implement "operator group" -> extra -> "pip
> > >packages"
> > >> > > dependencies (we need them anyway for AIP-21) and then we could
> > >have
> > >> > wheel
> > >> > > packages with all the "extra" dependencies for each group of
> > >operators.
> > >> > >
> > >> > > Worker executing an operator could have the "core" dependencies
> > >> installed
> > >> > > initially but then when it is supposed to run an operator it
> > >could
> > >> > create a
> > >> > > virtualenv, install the required "extra" from wheels and run the
> > >task
> > >> for
> > >> > > this operator in this virtualenv (and remove virtualenv). We
> > >could have
> > >> > > such package-wheels prepared (one wheel package per operator
> > >group) and
> > >> > > distributed either same way as DAGs or using some shared binary
> > >> > repository
> > >> > > (and cached in the worker).
> > >> > >
> > >> > > Having such dynamically created virtualenv has also the advantage
> > >that
> > >> if
> > >> > > someone has a DAG with specific dependencies - they could be
> > >embedded
> > >> in
> > >> > > the DAG wheel, installed from it to this virtualenv, and the
> > >virtualenv
> > >> > > would be removed after the task is finished.
> > >> > >
> > >> > > The advantage of this approach is that each DAG's extra
> > >dependencies
> > >> are
> > >> > > isolated and you could have even different versions of the same
> > >> > dependency
> > >> > > used by different DAGs. I think that could save a lot of
> > >headaches for
> > >> > many
> > >> > > users.
> > >> > >
> > >> > > For me that whole idea sounds pretty cool.
> > >> > >
> > >> > > Let me know what you think.
> > >> > >
> > >> > > J.
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > Jarek Potiuk
> > >> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >> > >
> > >> > > M: +48 660 796 129 <+48660796129>
> > >> > > [image: Polidea] <https://www.polidea.com/>
> > >> >
> > >> >
> > >>
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Reply via email to