Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Jarek Potiuk Sat, 21 Dec 2019 04:23:23 -0800

I am in "before-Xmas" mood so I thought I will write more of my thoughts
about it :).

*TL;DR; I try to reason (mostly looking at it from the philosophy/usage
point of view) why container-native approach might not be best for Airflow
and why we should go python-first instead.*

I also used to be in the "docker" camp as it seemed kinda natural. Adding
DAG layer at package runtime seems like a natural thing to do. That seem to
fit perfectly well some sophisticated production deployment models where
people are using docker registry to deploy new software.

But in the meantime many more questions started to bother me:

   - Is it really the case for all the deployment models and use cases how
   Airflow is used?
   - While it is a good model for some frozen-in-time production deployment
   model, is it a good model to support the whole DAG lifecycle?  Think about
   initial development, debugging, iteration on it, but also post-deployment
   maintenance and upgrades?
   - More importantly - does it fit the current philosophy of Airflow and
   is it expected by its users ?

After asking those questions (and formulating some answers) I am not so
sure any more that containerisation should be something Airflow bases it's
deployment model on.

After spending a year with Airflow, getting more embedded in it's
philosophy and talking to the users and especially looking at the
"competition" we have - I changed my mind here. I don't think Airflow is in
the "Container-centric" world but it is really "Python-centric" world and
it is a conscious choice we should continue with in the future.

I think there are a number of advantages of Airflow that make it so popular
and really liked by the users. If we go a bit too much into
"Docker/Container/Cloud Native" world - we might get a bit closer to some
of our competitors (think Argo for example) but we might lose quite a bit
of an advantage we have. The exact advantage that makes us better for our
users, different from competition and also serves quite a bit different use
cases than "general workflow engine".

While I am not a data-scientist myself, I interacted with data scientists
and data engineers a lot (mostly while working as a robotics engineer at
NoMagic.ai) and I found that they think and act quite a bit differently
than DevOps or even traditional Software Engineers. And I think those
people are our primary users. Looking at the results of our recent survey
<https://airflow.apache.org/blog/airflow-survey/> around 70% of Airflow
users call themselves "Data Engineer" or "Data Scientist".

Let me dive a bit deeper.

For me when I think "Airflow" - I immediately think "Python". There are
certain advantages of Airflow being python-first and python-focused. The
main advantage is that the same people who are able to do data science feel
comfortable with writing the pipelines and use pre-existing abstractions
that make it easier for them to write the pipelines
(DAGs/Operators/Sensors/...) . Those are mainly data scientist who live and
breathe python as their primary tool of choice. Using Jupyter Notebooks,
writing data processing and machine learning experiments as python scripts
is part of their daily job. Docker and containers for them are merely an
execution engine for whatever they do and while they know about it and
realise why containers are useful - it's best if they do not have to bother
about containerisation. Even if they use it, it should be pretty much
transparent to them. This is in parts the reasoning behind developing
Breeze - while it uses containers to take advantage of isolation and
consistent environment for everyone it tries to hide the
dockerization/containerisation as much as possible and provide a simple,
focused interface to manage it. People who know python don't necessarily
need to understand containerisation in order to make use of it's advantage.
It's very similar to virtual machines, compilers etc. make use of them
without really knowing how they work. And it's perfectly OK - they don't
have to.

Tying the deployment of Airflow DAGs has the disadvantage that you have to
include the whole step of packaging, distribution, sharing, and using the
image to be used by the "worker" of Airflow. It also basically means that
every task execution of Airflow has to be a separate docker container -
isolated from the rest, started pretty much totally from scratch - either
as part of a new Pod in Kubernetes or spun-off as a new container via
docker-compose or docker-swarm. The whole idea of having separate DAGs
which can be updated independently and potentially have different
dependencies, maybe other python code etc. - this means pretty much that
for every single DAG that you want to update, you need to package it as an
extra layer in Docker, put it somewhere in a shared registry, and switch
your executors to use the new image, get it downloaded by the executor,
restart worker somehow (to start a container based on that new image).
That's a lot of hassle to just update one line in a DAG. Surely we can
automate that and have it fast, but it's quite difficult to explain to data
scientists that just want to change one line in the DAG that they have to
go through that process. They would need to understand how to check if
their image is properly built and distributed, if the executor they run
already picked-up the new image, if the worker has already picked the new
image - and in the case of a spelling mistake they will have to repeat that
whole process again. That's hardly what data scientists are used to. They
are used to try something and see results as quickly as possible without
too much of a hassle and knowing about some external tooling. This is the
whole point of jupyter notebooks for example - you can incrementally change
single step in your whole process and continue iterating on the rest. This
is one of the reasons we loved immediately the idea of Databand.ai to
develop DebugExecutor
<https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing> and
we helped in making it merge-ready. It lets the data scientists to iterate
and debug their DAGs using their familiar tools and process (just as if
they debug a python script) without the hassle of learning new tools and
changing the way they work. Tomek will soon write a blog post about it, but
I think it's one of the best productivity improvements we could give our
DAG-writing users in a long time.

This problem is also quite visible with container-native workflow engines
such as Argo that force you to have every single step of your workflow to
be a Docker container. That sounds great in theory (containers! isolation!
kubernetes!). And it even works perfectly well in a number of practical
cases. For example when each step require complex processing, a number of
dependencies and require different binaries etc. But when you look at it
more closely - this is NOT primary use case for Airflow. The primary use
case of Airflow is that it talks to other systems via APIs and orchestrates
their work. There is hardly any processing on Airflow worker nodes. There
are hardly any new requirements/dependencies needed in most cases. I really
love that Airflow is actually focusing on the "glue" layer between those
external services. Again - the same people who do data engineering can
interact over python API with services they use, put all the steps and
logic as python code in the same DAG and iterate and change it and
get immediate feedback - and even add a few lines of code if they need to
add an extra parameter or so. Imagine the case where every step of your
workflow is a Docker container to run - as a data engineer you have to use
python to put the DAG together, then if you want to interact with an
external service, you have to find an existing container that does it,
figure out how to pass credentials to this container from your host (this
is often non-trivial), and in many cases you find that in order to achieve
what you want you have to build your own image because those available in
public registries are old or don't have some features exposed. It happened
to me many times when I tried to use such workflows, I was eventually
forced to build and deploy somewhere my own Docker image - even if I was
just doing iterating and trying different things. That's far more complex
than 'pip install <x>' adding '<x> to setup.py' and adding one or two lines
of python code to do what I want. And I am super-familiar with Docker. I
leave and breathe Docker. But I can see how intimidating and difficult it
must be for people who don't.

That's why I think that our basic and most common deployment model (even
the one used in production) should be based on python toolset - not
containers. Wheels seems like a great tool for python dependency
management. I think in most cases when we have just a few dependencies to
install per task (for example python google libraries for google tasks)
from wheel in a running container and create a virtualenv for it - it might
be comparable or even faster than restarting a whole new container with
those packages installed as a layer. Not mentioning much smaller memory and
cpu overhead if this is done within a running container, rather than
restarting the whole container for that task. Kubernetes and it's
deployment models are very well suited for long running tasks that do a lot
of work, but if you want to start a new container that starts the whole
python interpreter with all dependencies, with it's own CPU/Memory
requirements *JUST* to have an API call to start external service and wait
for it to finish (most of Airflow tasks are exactly this) - this seems like
a terrible overkill. It seems that the Native Executor
<https://github.com/apache/airflow/pull/6750> idea discussed in
sig-scalability group where we abstract away from deployment model and use
queues to communicate and where we keep the worker running to serve many
subsequent tasks is much better idea than dedicated executors such as
KubernetesExecutor which starts a new POD for every task. We should still
use containers under the hood of course, and have deployments using
Kubernetes etc. But this should be transparent to the people who write DAGs.

Sorry for such a long mail - I just think it's a super-important decision
on the philosophy of Airflow, which use cases it serves and how well it
serves the whole lifecycle of DAGs - from debugging to maintenance, and I
think it should really be a foundation of how we are implementing some of
the deployment-related features of Airflow 2.0 - in order for it to stay
relevant, preferred by our users and focusing on those cases that it does
already very well.

Let me know what you think. But in the meantime - have a great Xmas
Everyone!

J.

On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]> wrote:

> > For the docker example, you'd almost
> want to inject or "layer" the DAG script and airflow package at run time.
>
> Something sort of like Heroku build packs?
>
> -a
>
> On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <
> [email protected]> wrote:
> >This reminds me of the "DagFetcher" idea. Basically a new abstraction
> >that
> >can fetch a DAG object from anywhere and run a task. In theory you
> >could
> >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or
> >whatever makes sense to your organization. In the proposal I wrote
> >about
> >using a universal uri scheme to identify DAG artifacts, with support
> >for
> >versioning, as in s3://company_dagbag/some_dag@latest
> >
> >One challenge is around *not* serializing Airflow specific code in the
> >artifact/docker, otherwise you end up with a messy heterogenous cluster
> >that runs multiple Airflow versions. For the docker example, you'd
> >almost
> >want to inject or "layer" the DAG script and airflow package at run
> >time.
> >
> >Max
> >
> >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov
> ><[email protected]>
> >wrote:
> >
> >> The zip support is a bit of a hack and was a bit controversial when
> >it was
> >> added. I think if we go down the path of supporting more DAG sources,
> >we
> >> should make sure we have the right interface in place so we avoid the
> >> current `if format == zip then: else:` and make sure that we don't
> >tightly
> >> couple to specific DAG sourcing implementations. Personally I feel
> >that
> >> Docker makes more sense than wheels (since they are fully
> >self-contained
> >> even at the binary dependency level), but if we go down the interface
> >route
> >> it might be fine to add support for both Docker and wheels.
> >>
> >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
> >> <[email protected]> wrote:
> >>
> >> > Hi Jarek,
> >> >
> >> > This sounds great. Is this possibly related to the work started in
> >> > https://github.com/apache/airflow/pull/730? <
> >> > https://github.com/apache/airflow/pull/730?>
> >> >
> >> > I'm not sure I’m following your proposal entirely. Initially, what
> >would
> >> > be a great first step would be to support loading DAGs from
> >entry_point,
> >> as
> >> > proposed in the closed PR above. This would already enable most of
> >the
> >> > features you’ve mentioned below. Each DAG could be a Python
> >package, and
> >> it
> >> > would carry all the information about required packages in its
> >package
> >> > meta-data.
> >> >
> >> > Is that what you’re envisioning? If so, I’d be happy to support you
> >with
> >> > the implementation!
> >> >
> >> > Also, I think while the idea of creating a temporary virtual
> >environment
> >> > for running tasks is very useful, I’d like this to be optional, as
> >it can
> >> > also create a lot of overhead to running tasks.
> >> >
> >> > Cheers,
> >> >
> >> >         Björn
> >> >
> >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk
> ><[email protected]>
> >> > wrote:
> >> > >
> >> > > I had a lot of interesting discussions last few days with Apache
> >> Airflow
> >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how
> >many
> >> > people
> >> > > use Airflow in Poland). One discussion brought an interesting
> >subject:
> >> > > Packaging dags in wheel format. The users mentioned that they are
> >> > > super-happy using .zip-packaged DAGs but they think it could be
> >> improved
> >> > > with wheel format (which is also .zip BTW). Maybe it was already
> >> > mentioned
> >> > > in some discussions before but I have not found any.
> >> > >
> >> > > *Context:*
> >> > >
> >> > > We are well on the way of implementing "AIP-21 Changing import
> >paths"
> >> and
> >> > > will provide backport packages for Airflow 1.10. As a next step
> >we want
> >> > to
> >> > > target AIP-8.
> >> > > One of the problems to implement AIP-8 (split hooks/operators
> >into
> >> > separate
> >> > > packages) is the problem of dependencies. Different
> >operators/hooks
> >> might
> >> > > have different dependencies if maintained separately. Currently
> >we
> >> have a
> >> > > common set of dependencies as we have only one setup.py, but if
> >we
> >> split
> >> > to
> >> > > separate packages, this might change.
> >> > >
> >> > > *Proposal:*
> >> > >
> >> > > Our users - who love the .zip DAG distribution - proposed that we
> >> package
> >> > > the DAGs and all related packages in a wheel package instead of
> >pure
> >> > .zip.
> >> > > This would allow the users to install extra dependencies needed
> >by the
> >> > DAG.
> >> > > And it struck me that we could indeed do that for DAGs but also
> >> mitigate
> >> > > most of the dependency problems for separately-packaged
> >operators.
> >> > >
> >> > > The proposal from our users was to package the extra dependencies
> >> > together
> >> > > with the DAG in a wheel file. This is quite cool on it's own, but
> >I
> >> > thought
> >> > > we might actually use the same approach to solve dependency
> >problem
> >> with
> >> > > AIP-8.
> >> > >
> >> > > I think we could implement "operator group" -> extra -> "pip
> >packages"
> >> > > dependencies (we need them anyway for AIP-21) and then we could
> >have
> >> > wheel
> >> > > packages with all the "extra" dependencies for each group of
> >operators.
> >> > >
> >> > > Worker executing an operator could have the "core" dependencies
> >> installed
> >> > > initially but then when it is supposed to run an operator it
> >could
> >> > create a
> >> > > virtualenv, install the required "extra" from wheels and run the
> >task
> >> for
> >> > > this operator in this virtualenv (and remove virtualenv). We
> >could have
> >> > > such package-wheels prepared (one wheel package per operator
> >group) and
> >> > > distributed either same way as DAGs or using some shared binary
> >> > repository
> >> > > (and cached in the worker).
> >> > >
> >> > > Having such dynamically created virtualenv has also the advantage
> >that
> >> if
> >> > > someone has a DAG with specific dependencies - they could be
> >embedded
> >> in
> >> > > the DAG wheel, installed from it to this virtualenv, and the
> >virtualenv
> >> > > would be removed after the task is finished.
> >> > >
> >> > > The advantage of this approach is that each DAG's extra
> >dependencies
> >> are
> >> > > isolated and you could have even different versions of the same
> >> > dependency
> >> > > used by different DAGs. I think that could save a lot of
> >headaches for
> >> > many
> >> > > users.
> >> > >
> >> > > For me that whole idea sounds pretty cool.
> >> > >
> >> > > Let me know what you think.
> >> > >
> >> > > J.
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > Jarek Potiuk
> >> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >> > >
> >> > > M: +48 660 796 129 <+48660796129>
> >> > > [image: Polidea] <https://www.polidea.com/>
> >> >
> >> >
> >>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Reply via email to