This reminds me of the "DagFetcher" idea. Basically a new abstraction that
can fetch a DAG object from anywhere and run a task. In theory you could
extend it to do "zip on s3", "pex on GFS", "docker on artifactory" or
whatever makes sense to your organization. In the proposal I wrote about
using a universal uri scheme to identify DAG artifacts, with support for
versioning, as in s3://company_dagbag/some_dag@latest

One challenge is around *not* serializing Airflow specific code in the
artifact/docker, otherwise you end up with a messy heterogenous cluster
that runs multiple Airflow versions. For the docker example, you'd almost
want to inject or "layer" the DAG script and airflow package at run time.

Max

On Mon, Dec 16, 2019 at 7:17 AM Dan Davydov <ddavy...@twitter.com.invalid>
wrote:

> The zip support is a bit of a hack and was a bit controversial when it was
> added. I think if we go down the path of supporting more DAG sources, we
> should make sure we have the right interface in place so we avoid the
> current `if format == zip then: else:` and make sure that we don't tightly
> couple to specific DAG sourcing implementations. Personally I feel that
> Docker makes more sense than wheels (since they are fully self-contained
> even at the binary dependency level), but if we go down the interface route
> it might be fine to add support for both Docker and wheels.
>
> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
> <bjoern.pol...@soundcloud.com.invalid> wrote:
>
> > Hi Jarek,
> >
> > This sounds great. Is this possibly related to the work started in
> > https://github.com/apache/airflow/pull/730? <
> > https://github.com/apache/airflow/pull/730?>
> >
> > I'm not sure I’m following your proposal entirely. Initially, what would
> > be a great first step would be to support loading DAGs from entry_point,
> as
> > proposed in the closed PR above. This would already enable most of the
> > features you’ve mentioned below. Each DAG could be a Python package, and
> it
> > would carry all the information about required packages in its package
> > meta-data.
> >
> > Is that what you’re envisioning? If so, I’d be happy to support you with
> > the implementation!
> >
> > Also, I think while the idea of creating a temporary virtual environment
> > for running tasks is very useful, I’d like this to be optional, as it can
> > also create a lot of overhead to running tasks.
> >
> > Cheers,
> >
> >         Björn
> >
> > > On 14. Dec 2019, at 11:10, Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> > >
> > > I had a lot of interesting discussions last few days with Apache
> Airflow
> > > users at PyDataWarsaw 2019 (I was actually quite surprised how many
> > people
> > > use Airflow in Poland). One discussion brought an interesting subject:
> > > Packaging dags in wheel format. The users mentioned that they are
> > > super-happy using .zip-packaged DAGs but they think it could be
> improved
> > > with wheel format (which is also .zip BTW). Maybe it was already
> > mentioned
> > > in some discussions before but I have not found any.
> > >
> > > *Context:*
> > >
> > > We are well on the way of implementing "AIP-21 Changing import paths"
> and
> > > will provide backport packages for Airflow 1.10. As a next step we want
> > to
> > > target AIP-8.
> > > One of the problems to implement AIP-8 (split hooks/operators into
> > separate
> > > packages) is the problem of dependencies. Different operators/hooks
> might
> > > have different dependencies if maintained separately. Currently we
> have a
> > > common set of dependencies as we have only one setup.py, but if we
> split
> > to
> > > separate packages, this might change.
> > >
> > > *Proposal:*
> > >
> > > Our users - who love the .zip DAG distribution - proposed that we
> package
> > > the DAGs and all related packages in a wheel package instead of pure
> > .zip.
> > > This would allow the users to install extra dependencies needed by the
> > DAG.
> > > And it struck me that we could indeed do that for DAGs but also
> mitigate
> > > most of the dependency problems for separately-packaged operators.
> > >
> > > The proposal from our users was to package the extra dependencies
> > together
> > > with the DAG in a wheel file. This is quite cool on it's own, but I
> > thought
> > > we might actually use the same approach to solve dependency problem
> with
> > > AIP-8.
> > >
> > > I think we could implement "operator group" -> extra -> "pip packages"
> > > dependencies (we need them anyway for AIP-21) and then we could have
> > wheel
> > > packages with all the "extra" dependencies for each group of operators.
> > >
> > > Worker executing an operator could have the "core" dependencies
> installed
> > > initially but then when it is supposed to run an operator it could
> > create a
> > > virtualenv, install the required "extra" from wheels and run the task
> for
> > > this operator in this virtualenv (and remove virtualenv). We could have
> > > such package-wheels prepared (one wheel package per operator group) and
> > > distributed either same way as DAGs or using some shared binary
> > repository
> > > (and cached in the worker).
> > >
> > > Having such dynamically created virtualenv has also the advantage that
> if
> > > someone has a DAG with specific dependencies - they could be embedded
> in
> > > the DAG wheel, installed from it to this virtualenv, and the virtualenv
> > > would be removed after the task is finished.
> > >
> > > The advantage of this approach is that each DAG's extra dependencies
> are
> > > isolated and you could have even different versions of the same
> > dependency
> > > used by different DAGs. I think that could save a lot of headaches for
> > many
> > > users.
> > >
> > > For me that whole idea sounds pretty cool.
> > >
> > > Let me know what you think.
> > >
> > > J.
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> >
> >
>

Reply via email to