I had a lot of interesting discussions last few days with Apache Airflow users at PyDataWarsaw 2019 (I was actually quite surprised how many people use Airflow in Poland). One discussion brought an interesting subject: Packaging dags in wheel format. The users mentioned that they are super-happy using .zip-packaged DAGs but they think it could be improved with wheel format (which is also .zip BTW). Maybe it was already mentioned in some discussions before but I have not found any.
*Context:* We are well on the way of implementing "AIP-21 Changing import paths" and will provide backport packages for Airflow 1.10. As a next step we want to target AIP-8. One of the problems to implement AIP-8 (split hooks/operators into separate packages) is the problem of dependencies. Different operators/hooks might have different dependencies if maintained separately. Currently we have a common set of dependencies as we have only one setup.py, but if we split to separate packages, this might change. *Proposal:* Our users - who love the .zip DAG distribution - proposed that we package the DAGs and all related packages in a wheel package instead of pure .zip. This would allow the users to install extra dependencies needed by the DAG. And it struck me that we could indeed do that for DAGs but also mitigate most of the dependency problems for separately-packaged operators. The proposal from our users was to package the extra dependencies together with the DAG in a wheel file. This is quite cool on it's own, but I thought we might actually use the same approach to solve dependency problem with AIP-8. I think we could implement "operator group" -> extra -> "pip packages" dependencies (we need them anyway for AIP-21) and then we could have wheel packages with all the "extra" dependencies for each group of operators. Worker executing an operator could have the "core" dependencies installed initially but then when it is supposed to run an operator it could create a virtualenv, install the required "extra" from wheels and run the task for this operator in this virtualenv (and remove virtualenv). We could have such package-wheels prepared (one wheel package per operator group) and distributed either same way as DAGs or using some shared binary repository (and cached in the worker). Having such dynamically created virtualenv has also the advantage that if someone has a DAG with specific dependencies - they could be embedded in the DAG wheel, installed from it to this virtualenv, and the virtualenv would be removed after the task is finished. The advantage of this approach is that each DAG's extra dependencies are isolated and you could have even different versions of the same dependency used by different DAGs. I think that could save a lot of headaches for many users. For me that whole idea sounds pretty cool. Let me know what you think. J. -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
