Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Chao-Han Tsai Sun, 22 Dec 2019 13:35:53 -0800

Probably it is a good time to revisit
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
 again?


On Sun, Dec 22, 2019 at 12:16 PM Jarek Potiuk <[email protected]>
wrote:

> I also love the idea of DAG fetcher, It fits very well the "Python-centric"
> rather than "Container-centric" approach. Fetching it from different
> sources like local/ .zip and then .wheel seems like an interesting
> approach. I think the important parts of whatever approach we come up with
> are:
>
> - make it easy for development/iteration by the creator
> - make it stable/manageable for deployment purpose
> - make it manageable for incremental updates.
>
> J.
>
> On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek <[email protected]>
> wrote:
>
> > I like the idea of a DagFetcher (
> > https://github.com/apache/airflow/pull/3138).
> > I think it's a good and
> > simple starting point to fetch .py files from places like local file
> > system, S3 or GCS (that's what
> > Composer actually do under the hood). As the next step we can think about
> > wheels, zip and other
> > more demanding packaging.
> >
> > In my opinion in case of such "big" changes we should try to iterate in
> > small steps. Especially if
> > we don't have any strong opinions.
> >
> > Bests,
> > Tomek
> >
> > On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk <[email protected]>
> > wrote:
> >
> > > I am in "before-Xmas" mood so I thought I will write more of my
> thoughts
> > > about it :).
> > >
> > > *TL;DR; I try to reason (mostly looking at it from the philosophy/usage
> > > point of view) why container-native approach might not be best for
> > Airflow
> > > and why we should go python-first instead.*
> > >
> > > I also used to be in the "docker" camp as it seemed kinda natural.
> Adding
> > > DAG layer at package runtime seems like a natural thing to do. That
> seem
> > to
> > > fit perfectly well some sophisticated production deployment models
> where
> > > people are using docker registry to deploy new software.
> > >
> > > But in the meantime many more questions started to bother me:
> > >
> > >    - Is it really the case for all the deployment models and use cases
> > how
> > >    Airflow is used?
> > >    - While it is a good model for some frozen-in-time production
> > deployment
> > >    model, is it a good model to support the whole DAG lifecycle?  Think
> > > about
> > >    initial development, debugging, iteration on it, but also
> > > post-deployment
> > >    maintenance and upgrades?
> > >    - More importantly - does it fit the current philosophy of Airflow
> and
> > >    is it expected by its users ?
> > >
> > > After asking those questions (and formulating some answers) I am not so
> > > sure any more that containerisation should be something Airflow bases
> > it's
> > > deployment model on.
> > >
> > > After spending a year with Airflow, getting more embedded in it's
> > > philosophy and talking to the users and especially looking at the
> > > "competition" we have - I changed my mind here. I don't think Airflow
> is
> > in
> > > the "Container-centric" world but it is really "Python-centric" world
> and
> > > it is a conscious choice we should continue with in the future.
> > >
> > > I think there are a number of advantages of Airflow that make it so
> > popular
> > > and really liked by the users. If we go a bit too much into
> > > "Docker/Container/Cloud Native" world - we might get a bit closer to
> some
> > > of our competitors (think Argo for example) but we might lose quite a
> bit
> > > of an advantage we have. The exact advantage that makes us better for
> our
> > > users, different from competition and also serves quite a bit different
> > use
> > > cases than "general workflow engine".
> > >
> > > While I am not a data-scientist myself, I interacted with data
> scientists
> > > and data engineers a lot (mostly while working as a robotics engineer
> at
> > > NoMagic.ai) and I found that they think and act quite a bit differently
> > > than DevOps or even traditional Software Engineers. And I think those
> > > people are our primary users. Looking at the results of our recent
> survey
> > > <https://airflow.apache.org/blog/airflow-survey/> around 70% of
> Airflow
> > > users call themselves "Data Engineer" or "Data Scientist".
> > >
> > > Let me dive a bit deeper.
> > >
> > > For me when I think "Airflow" - I immediately think "Python". There are
> > > certain advantages of Airflow being python-first and python-focused.
> The
> > > main advantage is that the same people who are able to do data science
> > feel
> > > comfortable with writing the pipelines and use pre-existing
> abstractions
> > > that make it easier for them to write the pipelines
> > > (DAGs/Operators/Sensors/...) . Those are mainly data scientist who live
> > and
> > > breathe python as their primary tool of choice. Using Jupyter
> Notebooks,
> > > writing data processing and machine learning experiments as python
> > scripts
> > > is part of their daily job. Docker and containers for them are merely
> an
> > > execution engine for whatever they do and while they know about it and
> > > realise why containers are useful - it's best if they do not have to
> > bother
> > > about containerisation. Even if they use it, it should be pretty much
> > > transparent to them. This is in parts the reasoning behind developing
> > > Breeze - while it uses containers to take advantage of isolation and
> > > consistent environment for everyone it tries to hide the
> > > dockerization/containerisation as much as possible and provide a
> simple,
> > > focused interface to manage it. People who know python don't
> necessarily
> > > need to understand containerisation in order to make use of it's
> > advantage.
> > > It's very similar to virtual machines, compilers etc. make use of them
> > > without really knowing how they work. And it's perfectly OK - they
> don't
> > > have to.
> > >
> > > Tying the deployment of Airflow DAGs has the disadvantage that you have
> > to
> > > include the whole step of packaging, distribution, sharing, and using
> the
> > > image to be used by the "worker" of Airflow. It also basically means
> that
> > > every task execution of Airflow has to be a separate docker container -
> > > isolated from the rest, started pretty much totally from scratch -
> either
> > > as part of a new Pod in Kubernetes or spun-off as a new container via
> > > docker-compose or docker-swarm. The whole idea of having separate DAGs
> > > which can be updated independently and potentially have different
> > > dependencies, maybe other python code etc. - this means pretty much
> that
> > > for every single DAG that you want to update, you need to package it as
> > an
> > > extra layer in Docker, put it somewhere in a shared registry, and
> switch
> > > your executors to use the new image, get it downloaded by the executor,
> > > restart worker somehow (to start a container based on that new image).
> > > That's a lot of hassle to just update one line in a DAG. Surely we can
> > > automate that and have it fast, but it's quite difficult to explain to
> > data
> > > scientists that just want to change one line in the DAG that they have
> to
> > > go through that process. They would need to understand how to check if
> > > their image is properly built and distributed, if the executor they run
> > > already picked-up the new image, if the worker has already picked the
> new
> > > image - and in the case of a spelling mistake they will have to repeat
> > that
> > > whole process again. That's hardly what data scientists are used to.
> They
> > > are used to try something and see results as quickly as possible
> without
> > > too much of a hassle and knowing about some external tooling. This is
> the
> > > whole point of jupyter notebooks for example - you can incrementally
> > change
> > > single step in your whole process and continue iterating on the rest.
> > This
> > > is one of the reasons we loved immediately the idea of Databand.ai to
> > > develop DebugExecutor
> > > <https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing
> >
> > > and
> > > we helped in making it merge-ready. It lets the data scientists to
> > iterate
> > > and debug their DAGs using their familiar tools and process (just as if
> > > they debug a python script) without the hassle of learning new tools
> and
> > > changing the way they work. Tomek will soon write a blog post about it,
> > but
> > > I think it's one of the best productivity improvements we could give
> our
> > > DAG-writing users in a long time.
> > >
> > > This problem is also quite visible with container-native workflow
> engines
> > > such as Argo that force you to have every single step of your workflow
> to
> > > be a Docker container. That sounds great in theory (containers!
> > isolation!
> > > kubernetes!). And it even works perfectly well in a number of practical
> > > cases. For example when each step require complex processing, a number
> of
> > > dependencies and require different binaries etc. But when you look at
> it
> > > more closely - this is NOT primary use case for Airflow. The primary
> use
> > > case of Airflow is that it talks to other systems via APIs and
> > orchestrates
> > > their work. There is hardly any processing on Airflow worker nodes.
> There
> > > are hardly any new requirements/dependencies needed in most cases. I
> > really
> > > love that Airflow is actually focusing on the "glue" layer between
> those
> > > external services. Again - the same people who do data engineering can
> > > interact over python API with services they use, put all the steps and
> > > logic as python code in the same DAG and iterate and change it and
> > > get immediate feedback - and even add a few lines of code if they need
> to
> > > add an extra parameter or so. Imagine the case where every step of your
> > > workflow is a Docker container to run - as a data engineer you have to
> > use
> > > python to put the DAG together, then if you want to interact with an
> > > external service, you have to find an existing container that does it,
> > > figure out how to pass credentials to this container from your host
> (this
> > > is often non-trivial), and in many cases you find that in order to
> > achieve
> > > what you want you have to build your own image because those available
> in
> > > public registries are old or don't have some features exposed. It
> > happened
> > > to me many times when I tried to use such workflows, I was eventually
> > > forced to build and deploy somewhere my own Docker image - even if I
> was
> > > just doing iterating and trying different things. That's far more
> complex
> > > than 'pip install <x>' adding '<x> to setup.py' and adding one or two
> > lines
> > > of python code to do what I want. And I am super-familiar with Docker.
> I
> > > leave and breathe Docker. But I can see how intimidating and difficult
> it
> > > must be for people who don't.
> > >
> > > That's why I think that our basic and most common deployment model
> (even
> > > the one used in production) should be based on python toolset - not
> > > containers. Wheels seems like a great tool for python dependency
> > > management. I think in most cases when we have just a few dependencies
> to
> > > install per task (for example python google libraries for google tasks)
> > > from wheel in a running container and create a virtualenv for it - it
> > might
> > > be comparable or even faster than restarting a whole new container with
> > > those packages installed as a layer. Not mentioning much smaller memory
> > and
> > > cpu overhead if this is done within a running container, rather than
> > > restarting the whole container for that task. Kubernetes and it's
> > > deployment models are very well suited for long running tasks that do a
> > lot
> > > of work, but if you want to start a new container that starts the whole
> > > python interpreter with all dependencies, with it's own CPU/Memory
> > > requirements *JUST* to have an API call to start external service and
> > wait
> > > for it to finish (most of Airflow tasks are exactly this) - this seems
> > like
> > > a terrible overkill. It seems that the Native Executor
> > > <https://github.com/apache/airflow/pull/6750> idea discussed in
> > > sig-scalability group where we abstract away from deployment model and
> > use
> > > queues to communicate and where we keep the worker running to serve
> many
> > > subsequent tasks is much better idea than dedicated executors such as
> > > KubernetesExecutor which starts a new POD for every task. We should
> still
> > > use containers under the hood of course, and have deployments using
> > > Kubernetes etc. But this should be transparent to the people who write
> > > DAGs.
> > >
> > > Sorry for such a long mail - I just think it's a super-important
> decision
> > > on the philosophy of Airflow, which use cases it serves and how well it
> > > serves the whole lifecycle of DAGs - from debugging to maintenance,
> and I
> > > think it should really be a foundation of how we are implementing some
> of
> > > the deployment-related features of Airflow 2.0 - in order for it to
> stay
> > > relevant, preferred by our users and focusing on those cases that it
> does
> > > already very well.
> > >
> > > Let me know what you think. But in the meantime - have a great Xmas
> > > Everyone!
> > >
> > > J.
> > >
> > >
> > > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor <[email protected]>
> > wrote:
> > >
> > > > > For the docker example, you'd almost
> > > > want to inject or "layer" the DAG script and airflow package at run
> > time.
> > > >
> > > > Something sort of like Heroku build packs?
> > > >
> > > > -a
> > > >
> > > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <
> > > > [email protected]> wrote:
> > > > >This reminds me of the "DagFetcher" idea. Basically a new
> abstraction
> > > > >that
> > > > >can fetch a DAG object from anywhere and run a task. In theory you
> > > > >could
> > > > >extend it to do "zip on s3", "pex on GFS", "docker on artifactory"
> or
> > > > >whatever makes sense to your organization. In the proposal I wrote
> > > > >about
> > > > >using a universal uri scheme to identify DAG artifacts, with support
> > > > >for
> > > > >versioning, as in s3://company_dagbag/some_dag@latest
> > > > >
> > > > >One challenge is around *not* serializing Airflow specific code in
> the
> > > > >artifact/docker, otherwise you end up with a messy heterogenous
> > cluster
> > > > >that runs multiple Airflow versions. For the docker example, you'd
> > > > >almost
> > > > >want to inject or "layer" the DAG script and airflow package at run
> > > > >time.
> > > > >
> > > > >Max
> > > > >
> > > > >On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov
> > > > ><[email protected]>
> > > > >wrote:
> > > > >
> > > > >> The zip support is a bit of a hack and was a bit controversial
> when
> > > > >it was
> > > > >> added. I think if we go down the path of supporting more DAG
> > sources,
> > > > >we
> > > > >> should make sure we have the right interface in place so we avoid
> > the
> > > > >> current `if format == zip then: else:` and make sure that we don't
> > > > >tightly
> > > > >> couple to specific DAG sourcing implementations. Personally I feel
> > > > >that
> > > > >> Docker makes more sense than wheels (since they are fully
> > > > >self-contained
> > > > >> even at the binary dependency level), but if we go down the
> > interface
> > > > >route
> > > > >> it might be fine to add support for both Docker and wheels.
> > > > >>
> > > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
> > > > >> <[email protected]> wrote:
> > > > >>
> > > > >> > Hi Jarek,
> > > > >> >
> > > > >> > This sounds great. Is this possibly related to the work started
> in
> > > > >> > https://github.com/apache/airflow/pull/730? <
> > > > >> > https://github.com/apache/airflow/pull/730?>
> > > > >> >
> > > > >> > I'm not sure I’m following your proposal entirely. Initially,
> what
> > > > >would
> > > > >> > be a great first step would be to support loading DAGs from
> > > > >entry_point,
> > > > >> as
> > > > >> > proposed in the closed PR above. This would already enable most
> of
> > > > >the
> > > > >> > features you’ve mentioned below. Each DAG could be a Python
> > > > >package, and
> > > > >> it
> > > > >> > would carry all the information about required packages in its
> > > > >package
> > > > >> > meta-data.
> > > > >> >
> > > > >> > Is that what you’re envisioning? If so, I’d be happy to support
> > you
> > > > >with
> > > > >> > the implementation!
> > > > >> >
> > > > >> > Also, I think while the idea of creating a temporary virtual
> > > > >environment
> > > > >> > for running tasks is very useful, I’d like this to be optional,
> as
> > > > >it can
> > > > >> > also create a lot of overhead to running tasks.
> > > > >> >
> > > > >> > Cheers,
> > > > >> >
> > > > >> >         Björn
> > > > >> >
> > > > >> > > On 14. Dec 2019, at 11:10, Jarek Potiuk
> > > > ><[email protected]>
> > > > >> > wrote:
> > > > >> > >
> > > > >> > > I had a lot of interesting discussions last few days with
> Apache
> > > > >> Airflow
> > > > >> > > users at PyDataWarsaw 2019 (I was actually quite surprised how
> > > > >many
> > > > >> > people
> > > > >> > > use Airflow in Poland). One discussion brought an interesting
> > > > >subject:
> > > > >> > > Packaging dags in wheel format. The users mentioned that they
> > are
> > > > >> > > super-happy using .zip-packaged DAGs but they think it could
> be
> > > > >> improved
> > > > >> > > with wheel format (which is also .zip BTW). Maybe it was
> already
> > > > >> > mentioned
> > > > >> > > in some discussions before but I have not found any.
> > > > >> > >
> > > > >> > > *Context:*
> > > > >> > >
> > > > >> > > We are well on the way of implementing "AIP-21 Changing import
> > > > >paths"
> > > > >> and
> > > > >> > > will provide backport packages for Airflow 1.10. As a next
> step
> > > > >we want
> > > > >> > to
> > > > >> > > target AIP-8.
> > > > >> > > One of the problems to implement AIP-8 (split hooks/operators
> > > > >into
> > > > >> > separate
> > > > >> > > packages) is the problem of dependencies. Different
> > > > >operators/hooks
> > > > >> might
> > > > >> > > have different dependencies if maintained separately.
> Currently
> > > > >we
> > > > >> have a
> > > > >> > > common set of dependencies as we have only one setup.py, but
> if
> > > > >we
> > > > >> split
> > > > >> > to
> > > > >> > > separate packages, this might change.
> > > > >> > >
> > > > >> > > *Proposal:*
> > > > >> > >
> > > > >> > > Our users - who love the .zip DAG distribution - proposed that
> > we
> > > > >> package
> > > > >> > > the DAGs and all related packages in a wheel package instead
> of
> > > > >pure
> > > > >> > .zip.
> > > > >> > > This would allow the users to install extra dependencies
> needed
> > > > >by the
> > > > >> > DAG.
> > > > >> > > And it struck me that we could indeed do that for DAGs but
> also
> > > > >> mitigate
> > > > >> > > most of the dependency problems for separately-packaged
> > > > >operators.
> > > > >> > >
> > > > >> > > The proposal from our users was to package the extra
> > dependencies
> > > > >> > together
> > > > >> > > with the DAG in a wheel file. This is quite cool on it's own,
> > but
> > > > >I
> > > > >> > thought
> > > > >> > > we might actually use the same approach to solve dependency
> > > > >problem
> > > > >> with
> > > > >> > > AIP-8.
> > > > >> > >
> > > > >> > > I think we could implement "operator group" -> extra -> "pip
> > > > >packages"
> > > > >> > > dependencies (we need them anyway for AIP-21) and then we
> could
> > > > >have
> > > > >> > wheel
> > > > >> > > packages with all the "extra" dependencies for each group of
> > > > >operators.
> > > > >> > >
> > > > >> > > Worker executing an operator could have the "core"
> dependencies
> > > > >> installed
> > > > >> > > initially but then when it is supposed to run an operator it
> > > > >could
> > > > >> > create a
> > > > >> > > virtualenv, install the required "extra" from wheels and run
> the
> > > > >task
> > > > >> for
> > > > >> > > this operator in this virtualenv (and remove virtualenv). We
> > > > >could have
> > > > >> > > such package-wheels prepared (one wheel package per operator
> > > > >group) and
> > > > >> > > distributed either same way as DAGs or using some shared
> binary
> > > > >> > repository
> > > > >> > > (and cached in the worker).
> > > > >> > >
> > > > >> > > Having such dynamically created virtualenv has also the
> > advantage
> > > > >that
> > > > >> if
> > > > >> > > someone has a DAG with specific dependencies - they could be
> > > > >embedded
> > > > >> in
> > > > >> > > the DAG wheel, installed from it to this virtualenv, and the
> > > > >virtualenv
> > > > >> > > would be removed after the task is finished.
> > > > >> > >
> > > > >> > > The advantage of this approach is that each DAG's extra
> > > > >dependencies
> > > > >> are
> > > > >> > > isolated and you could have even different versions of the
> same
> > > > >> > dependency
> > > > >> > > used by different DAGs. I think that could save a lot of
> > > > >headaches for
> > > > >> > many
> > > > >> > > users.
> > > > >> > >
> > > > >> > > For me that whole idea sounds pretty cool.
> > > > >> > >
> > > > >> > > Let me know what you think.
> > > > >> > >
> > > > >> > > J.
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > >
> > > > >> > > Jarek Potiuk
> > > > >> > > Polidea <https://www.polidea.com/> | Principal Software
> > Engineer
> > > > >> > >
> > > > >> > > M: +48 660 796 129 <+48660796129>
> > > > >> > > [image: Polidea] <https://www.polidea.com/>
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>


-- 

Chao-Han Tsai

Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Reply via email to