Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Jarek Potiuk Sun, 22 Dec 2019 12:16:15 -0800

I also love the idea of DAG fetcher, It fits very well the "Python-centric"
rather than "Container-centric" approach. Fetching it from different
sources like local/ .zip and then .wheel seems like an interesting
approach. I think the important parts of whatever approach we come up with
are:


- make it easy for development/iteration by the creator
- make it stable/manageable for deployment purpose
- make it manageable for incremental updates.

J.

On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek <[email protected]>
wrote:

> I like the idea of a DagFetcher (
> https://github.com/apache/airflow/pull/3138).
> I think it's a good and
> simple starting point to fetch .py files from places like local file
> system, S3 or GCS (that's what
> Composer actually do under the hood). As the next step we can think about
> wheels, zip and other
> more demanding packaging.
>
> In my opinion in case of such "big" changes we should try to iterate in
> small steps. Especially if
> we don't have any strong opinions.
>
> Bests,
> Tomek
>
> On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]>
> wrote:
>
> > I am in "before-Xmas" mood so I thought I will write more of my thoughts
> > about it :).
> >
> > *TL;DR; I try to reason (mostly looking at it from the philosophy/usage
> > point of view) why container-native approach might not be best for
> Airflow
> > and why we should go python-first instead.*
> >
> > I also used to be in the "docker" camp as it seemed kinda natural. Adding
> > DAG layer at package runtime seems like a natural thing to do. That seem
> to
> > fit perfectly well some sophisticated production deployment models where
> > people are using docker registry to deploy new software.
> >
> > But in the meantime many more questions started to bother me:
> >
> >    - Is it really the case for all the deployment models and use cases
> how
> >    Airflow is used?
> >    - While it is a good model for some frozen-in-time production
> deployment
> >    model, is it a good model to support the whole DAG lifecycle?  Think
> > about
> >    initial development, debugging, iteration on it, but also
> > post-deployment
> >    maintenance and upgrades?
> >    - More importantly - does it fit the current philosophy of Airflow and
> >    is it expected by its users ?
> >
> > After asking those questions (and formulating some answers) I am not so
> > sure any more that containerisation should be something Airflow bases
> it's
> > deployment model on.
> >
> > After spending a year with Airflow, getting more embedded in it's
> > philosophy and talking to the users and especially looking at the
> > "competition" we have - I changed my mind here. I don't think Airflow is
> in
> > the "Container-centric" world but it is really "Python-centric" world and
> > it is a conscious choice we should continue with in the future.
> >
> > I think there are a number of advantages of Airflow that make it so
> popular
> > and really liked by the users. If we go a bit too much into
> > "Docker/Container/Cloud Native" world - we might get a bit closer to some
> > of our competitors (think Argo for example) but we might lose quite a bit
> > of an advantage we have. The exact advantage that makes us better for our
> > users, different from competition and also serves quite a bit different
> use
> > cases than "general workflow engine".
> >
> > While I am not a data-scientist myself, I interacted with data scientists
> > and data engineers a lot (mostly while working as a robotics engineer at
> > NoMagic.ai) and I found that they think and act quite a bit differently
> > than DevOps or even traditional Software Engineers. And I think those
> > people are our primary users. Looking at the results of our recent survey
> > <https://airflow.apache.org/blog/airflow-survey/> around 70% of Airflow
> > users call themselves "Data Engineer" or "Data Scientist".
> >
> > Let me dive a bit deeper.
> >
> > For me when I think "Airflow" - I immediately think "Python". There are
> > certain advantages of Airflow being python-first and python-focused. The
> > main advantage is that the same people who are able to do data science
> feel
> > comfortable with writing the pipelines and use pre-existing abstractions
> > that make it easier for them to write the pipelines
> > (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live
> and
> > breathe python as their primary tool of choice. Using Jupyter Notebooks,
> > writing data processing and machine learning experiments as python
> scripts
> > is part of their daily job. Docker and containers for them are merely an
> > execution engine for whatever they do and while they know about it and
> > realise why containers are useful - it's best if they do not have to
> bother
> > about containerisation. Even if they use it, it should be pretty much
> > transparent to them. This is in parts the reasoning behind developing
> > Breeze - while it uses containers to take advantage of isolation and
> > consistent environment for everyone it tries to hide the
> > dockerization/containerisation as much as possible and provide a simple,
> > focused interface to manage it. People who know python don't necessarily
> > need to understand containerisation in order to make use of it's
> advantage.
> > It's very similar to virtual machines, compilers etc. make use of them
> > without really knowing how they work. And it's perfectly OK - they don't
> > have to.
> >
> > Tying the deployment of Airflow DAGs has the disadvantage that you have
> to
> > include the whole step of packaging, distribution, sharing, and using the
> > image to be used by the "worker" of Airflow. It also basically means that
> > every task execution of Airflow has to be a separate docker container -
> > isolated from the rest, started pretty much totally from scratch - either
> > as part of a new Pod in Kubernetes or spun-off as a new container via
> > docker-compose or docker-swarm. The whole idea of having separate DAGs
> > which can be updated independently and potentially have different
> > dependencies, maybe other python code etc. - this means pretty much that
> > for every single DAG that you want to update, you need to package it as
> an
> > extra layer in Docker, put it somewhere in a shared registry, and switch
> > your executors to use the new image, get it downloaded by the executor,
> > restart worker somehow (to start a container based on that new image).
> > That's a lot of hassle to just update one line in a DAG. Surely we can
> > automate that and have it fast, but it's quite difficult to explain to
> data
> > scientists that just want to change one line in the DAG that they have to
> > go through that process. They would need to understand how to check if
> > their image is properly built and distributed, if the executor they run
> > already picked-up the new image, if the worker has already picked the new
> > image - and in the case of a spelling mistake they will have to repeat
> that
> > whole process again. That's hardly what data scientists are used to. They
> > are used to try something and see results as quickly as possible without
> > too much of a hassle and knowing about some external tooling. This is the
> > whole point of jupyter notebooks for example - you can incrementally
> change
> > single step in your whole process and continue iterating on the rest.
> This
> > is one of the reasons we loved immediately the idea of Databand.ai to
> > develop DebugExecutor
> > <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing>
> > and
> > we helped in making it merge-ready. It lets the data scientists to
> iterate
> > and debug their DAGs using their familiar tools and process (just as if
> > they debug a python script) without the hassle of learning new tools and
> > changing the way they work. Tomek will soon write a blog post about it,
> but
> > I think it's one of the best productivity improvements we could give our
> > DAG-writing users in a long time.
> >
> > This problem is also quite visible with container-native workflow engines
> > such as Argo that force you to have every single step of your workflow to
> > be a Docker container. That sounds great in theory (containers!
> isolation!
> > kubernetes!). And it even works perfectly well in a number of practical
> > cases. For example when each step require complex processing, a number of
> > dependencies and require different binaries etc. But when you look at it
> > more closely - this is NOT primary use case for Airflow. The primary use
> > case of Airflow is that it talks to other systems via APIs and
> orchestrates
> > their work. There is hardly any processing on Airflow worker nodes. There
> > are hardly any new requirements/dependencies needed in most cases. I
> really
> > love that Airflow is actually focusing on the "glue" layer between those
> > external services. Again - the same people who do data engineering can
> > interact over python API with services they use, put all the steps and
> > logic as python code in the same DAG and iterate and change it and
> > get immediate feedback - and even add a few lines of code if they need to
> > add an extra parameter or so. Imagine the case where every step of your
> > workflow is a Docker container to run - as a data engineer you have to
> use
> > python to put the DAG together, then if you want to interact with an
> > external service, you have to find an existing container that does it,
> > figure out how to pass credentials to this container from your host (this
> > is often non-trivial), and in many cases you find that in order to
> achieve
> > what you want you have to build your own image because those available in
> > public registries are old or don't have some features exposed. It
> happened
> > to me many times when I tried to use such workflows, I was eventually
> > forced to build and deploy somewhere my own Docker image - even if I was
> > just doing iterating and trying different things. That's far more complex
> > than 'pip install <x>' adding '<x> to setup.py' and adding one or two
> lines
> > of python code to do what I want. And I am super-familiar with Docker. I
> > leave and breathe Docker. But I can see how intimidating and difficult it
> > must be for people who don't.
> >
> > That's why I think that our basic and most common deployment model (even
> > the one used in production) should be based on python toolset - not
> > containers. Wheels seems like a great tool for python dependency
> > management. I think in most cases when we have just a few dependencies to
> > install per task (for example python google libraries for google tasks)
> > from wheel in a running container and create a virtualenv for it - it
> might
> > be comparable or even faster than restarting a whole new container with
> > those packages installed as a layer. Not mentioning much smaller memory
> and
> > cpu overhead if this is done within a running container, rather than
> > restarting the whole container for that task. Kubernetes and it's
> > deployment models are very well suited for long running tasks that do a
> lot
> > of work, but if you want to start a new container that starts the whole
> > python interpreter with all dependencies, with it's own CPU/Memory
> > requirements *JUST* to have an API call to start external service and
> wait
> > for it to finish (most of Airflow tasks are exactly this) - this seems
> like
> > a terrible overkill. It seems that the Native Executor
> > <https://github.com/apache/airflow/pull/6750> idea discussed in
> > sig-scalability group where we abstract away from deployment model and
> use
> > queues to communicate and where we keep the worker running to serve many
> > subsequent tasks is much better idea than dedicated executors such as
> > KubernetesExecutor which starts a new POD for every task. We should still
> > use containers under the hood of course, and have deployments using
> > Kubernetes etc. But this should be transparent to the people who write
> > DAGs.
> >
> > Sorry for such a long mail - I just think it's a super-important decision
> > on the philosophy of Airflow, which use cases it serves and how well it
> > serves the whole lifecycle of DAGs - from debugging to maintenance, and I
> > think it should really be a foundation of how we are implementing some of
> > the deployment-related features of Airflow 2.0 - in order for it to stay
> > relevant, preferred by our users and focusing on those cases that it does
> > already very well.
> >
> > Let me know what you think. But in the meantime - have a great Xmas
> > Everyone!
> >
> > J.
> >
> >
> > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]>
> wrote:
> >
> > > > For the docker example, you'd almost
> > > want to inject or "layer" the DAG script and airflow package at run
> time.
> > >
> > > Something sort of like Heroku build packs?
> > >
> > > -a
> > >
> > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <
> > > [email protected]> wrote:
> > > >This reminds me of the "DagFetcher" idea. Basically a new abstraction
> > > >that
> > > >can fetch a DAG object from anywhere and run a task. In theory you
> > > >could
> > > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or
> > > >whatever makes sense to your organization. In the proposal I wrote
> > > >about
> > > >using a universal uri scheme to identify DAG artifacts, with support
> > > >for
> > > >versioning, as in s3://company_dagbag/some_dag@latest
> > > >
> > > >One challenge is around *not* serializing Airflow specific code in the
> > > >artifact/docker, otherwise you end up with a messy heterogenous
> cluster
> > > >that runs multiple Airflow versions. For the docker example, you'd
> > > >almost
> > > >want to inject or "layer" the DAG script and airflow package at run
> > > >time.
> > > >
> > > >Max
> > > >
> > > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov
> > > ><[email protected]>
> > > >wrote:
> > > >
> > > >> The zip support is a bit of a hack and was a bit controversial when
> > > >it was
> > > >> added. I think if we go down the path of supporting more DAG
> sources,
> > > >we
> > > >> should make sure we have the right interface in place so we avoid
> the
> > > >> current `if format == zip then: else:` and make sure that we don't
> > > >tightly
> > > >> couple to specific DAG sourcing implementations. Personally I feel
> > > >that
> > > >> Docker makes more sense than wheels (since they are fully
> > > >self-contained
> > > >> even at the binary dependency level), but if we go down the
> interface
> > > >route
> > > >> it might be fine to add support for both Docker and wheels.
> > > >>
> > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
> > > >> <[email protected]> wrote:
> > > >>
> > > >> > Hi Jarek,
> > > >> >
> > > >> > This sounds great. Is this possibly related to the work started in
> > > >> > https://github.com/apache/airflow/pull/730? <
> > > >> > https://github.com/apache/airflow/pull/730?>
> > > >> >
> > > >> > I'm not sure I’m following your proposal entirely. Initially, what
> > > >would
> > > >> > be a great first step would be to support loading DAGs from
> > > >entry_point,
> > > >> as
> > > >> > proposed in the closed PR above. This would already enable most of
> > > >the
> > > >> > features you’ve mentioned below. Each DAG could be a Python
> > > >package, and
> > > >> it
> > > >> > would carry all the information about required packages in its
> > > >package
> > > >> > meta-data.
> > > >> >
> > > >> > Is that what you’re envisioning? If so, I’d be happy to support
> you
> > > >with
> > > >> > the implementation!
> > > >> >
> > > >> > Also, I think while the idea of creating a temporary virtual
> > > >environment
> > > >> > for running tasks is very useful, I’d like this to be optional, as
> > > >it can
> > > >> > also create a lot of overhead to running tasks.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> >         Björn
> > > >> >
> > > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk
> > > ><[email protected]>
> > > >> > wrote:
> > > >> > >
> > > >> > > I had a lot of interesting discussions last few days with Apache
> > > >> Airflow
> > > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how
> > > >many
> > > >> > people
> > > >> > > use Airflow in Poland). One discussion brought an interesting
> > > >subject:
> > > >> > > Packaging dags in wheel format. The users mentioned that they
> are
> > > >> > > super-happy using .zip-packaged DAGs but they think it could be
> > > >> improved
> > > >> > > with wheel format (which is also .zip BTW). Maybe it was already
> > > >> > mentioned
> > > >> > > in some discussions before but I have not found any.
> > > >> > >
> > > >> > > *Context:*
> > > >> > >
> > > >> > > We are well on the way of implementing "AIP-21 Changing import
> > > >paths"
> > > >> and
> > > >> > > will provide backport packages for Airflow 1.10. As a next step
> > > >we want
> > > >> > to
> > > >> > > target AIP-8.
> > > >> > > One of the problems to implement AIP-8 (split hooks/operators
> > > >into
> > > >> > separate
> > > >> > > packages) is the problem of dependencies. Different
> > > >operators/hooks
> > > >> might
> > > >> > > have different dependencies if maintained separately. Currently
> > > >we
> > > >> have a
> > > >> > > common set of dependencies as we have only one setup.py, but if
> > > >we
> > > >> split
> > > >> > to
> > > >> > > separate packages, this might change.
> > > >> > >
> > > >> > > *Proposal:*
> > > >> > >
> > > >> > > Our users - who love the .zip DAG distribution - proposed that
> we
> > > >> package
> > > >> > > the DAGs and all related packages in a wheel package instead of
> > > >pure
> > > >> > .zip.
> > > >> > > This would allow the users to install extra dependencies needed
> > > >by the
> > > >> > DAG.
> > > >> > > And it struck me that we could indeed do that for DAGs but also
> > > >> mitigate
> > > >> > > most of the dependency problems for separately-packaged
> > > >operators.
> > > >> > >
> > > >> > > The proposal from our users was to package the extra
> dependencies
> > > >> > together
> > > >> > > with the DAG in a wheel file. This is quite cool on it's own,
> but
> > > >I
> > > >> > thought
> > > >> > > we might actually use the same approach to solve dependency
> > > >problem
> > > >> with
> > > >> > > AIP-8.
> > > >> > >
> > > >> > > I think we could implement "operator group" -> extra -> "pip
> > > >packages"
> > > >> > > dependencies (we need them anyway for AIP-21) and then we could
> > > >have
> > > >> > wheel
> > > >> > > packages with all the "extra" dependencies for each group of
> > > >operators.
> > > >> > >
> > > >> > > Worker executing an operator could have the "core" dependencies
> > > >> installed
> > > >> > > initially but then when it is supposed to run an operator it
> > > >could
> > > >> > create a
> > > >> > > virtualenv, install the required "extra" from wheels and run the
> > > >task
> > > >> for
> > > >> > > this operator in this virtualenv (and remove virtualenv). We
> > > >could have
> > > >> > > such package-wheels prepared (one wheel package per operator
> > > >group) and
> > > >> > > distributed either same way as DAGs or using some shared binary
> > > >> > repository
> > > >> > > (and cached in the worker).
> > > >> > >
> > > >> > > Having such dynamically created virtualenv has also the
> advantage
> > > >that
> > > >> if
> > > >> > > someone has a DAG with specific dependencies - they could be
> > > >embedded
> > > >> in
> > > >> > > the DAG wheel, installed from it to this virtualenv, and the
> > > >virtualenv
> > > >> > > would be removed after the task is finished.
> > > >> > >
> > > >> > > The advantage of this approach is that each DAG's extra
> > > >dependencies
> > > >> are
> > > >> > > isolated and you could have even different versions of the same
> > > >> > dependency
> > > >> > > used by different DAGs. I think that could save a lot of
> > > >headaches for
> > > >> > many
> > > >> > > users.
> > > >> > >
> > > >> > > For me that whole idea sounds pretty cool.
> > > >> > >
> > > >> > > Let me know what you think.
> > > >> > >
> > > >> > > J.
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > > Jarek Potiuk
> > > >> > > Polidea <https://www.polidea.com/> | Principal Software
> Engineer
> > > >> > >
> > > >> > > M: +48 660 796 129 <+48660796129>
> > > >> > > [image: Polidea] <https://www.polidea.com/>
> > > >> >
> > > >> >
> > > >>
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Reply via email to