Hi Jarek,
This sounds great. Is this possibly related to the work started in
https://github.com/apache/airflow/pull/730?
<https://github.com/apache/airflow/pull/730?>
I'm not sure I’m following your proposal entirely. Initially, what would be a
great first step would be to support loading DAGs from entry_point, as proposed
in the closed PR above. This would already enable most of the features you’ve
mentioned below. Each DAG could be a Python package, and it would carry all the
information about required packages in its package meta-data.
Is that what you’re envisioning? If so, I’d be happy to support you with the
implementation!
Also, I think while the idea of creating a temporary virtual environment for
running tasks is very useful, I’d like this to be optional, as it can also
create a lot of overhead to running tasks.
Cheers,
Björn
> On 14. Dec 2019, at 11:10, Jarek Potiuk <[email protected]> wrote:
>
> I had a lot of interesting discussions last few days with Apache Airflow
> users at PyDataWarsaw 2019 (I was actually quite surprised how many people
> use Airflow in Poland). One discussion brought an interesting subject:
> Packaging dags in wheel format. The users mentioned that they are
> super-happy using .zip-packaged DAGs but they think it could be improved
> with wheel format (which is also .zip BTW). Maybe it was already mentioned
> in some discussions before but I have not found any.
>
> *Context:*
>
> We are well on the way of implementing "AIP-21 Changing import paths" and
> will provide backport packages for Airflow 1.10. As a next step we want to
> target AIP-8.
> One of the problems to implement AIP-8 (split hooks/operators into separate
> packages) is the problem of dependencies. Different operators/hooks might
> have different dependencies if maintained separately. Currently we have a
> common set of dependencies as we have only one setup.py, but if we split to
> separate packages, this might change.
>
> *Proposal:*
>
> Our users - who love the .zip DAG distribution - proposed that we package
> the DAGs and all related packages in a wheel package instead of pure .zip.
> This would allow the users to install extra dependencies needed by the DAG.
> And it struck me that we could indeed do that for DAGs but also mitigate
> most of the dependency problems for separately-packaged operators.
>
> The proposal from our users was to package the extra dependencies together
> with the DAG in a wheel file. This is quite cool on it's own, but I thought
> we might actually use the same approach to solve dependency problem with
> AIP-8.
>
> I think we could implement "operator group" -> extra -> "pip packages"
> dependencies (we need them anyway for AIP-21) and then we could have wheel
> packages with all the "extra" dependencies for each group of operators.
>
> Worker executing an operator could have the "core" dependencies installed
> initially but then when it is supposed to run an operator it could create a
> virtualenv, install the required "extra" from wheels and run the task for
> this operator in this virtualenv (and remove virtualenv). We could have
> such package-wheels prepared (one wheel package per operator group) and
> distributed either same way as DAGs or using some shared binary repository
> (and cached in the worker).
>
> Having such dynamically created virtualenv has also the advantage that if
> someone has a DAG with specific dependencies - they could be embedded in
> the DAG wheel, installed from it to this virtualenv, and the virtualenv
> would be removed after the task is finished.
>
> The advantage of this approach is that each DAG's extra dependencies are
> isolated and you could have even different versions of the same dependency
> used by different DAGs. I think that could save a lot of headaches for many
> users.
>
> For me that whole idea sounds pretty cool.
>
> Let me know what you think.
>
> J.
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>