Hello,

I heard that one team wants to separate DagProcessor from Scheduler in the
future. If this division is done along with the separation of some classes
abstraction/plugin mechanism. One plugin can be based on separate Python
environments and another on containers. Recently I did refactoring which
significantly simplify this task.  I created new DagFileProcessor class
with only one public method - process_file. -
https://github.com/apache/airflow/blob/d4a8afb5aef5b00f7c52c7ff1cc4088c8a00f4a7/airflow/jobs/scheduler_job.py#L297
Other methods are not called by other classes. If you call these methods in
separate environments, you will be able to change the interpreter and
dependencies. This class is already running in a separate process, so it
would not be difficult to separate it completely.
https://github.com/apache/airflow/blob/d4a8afb5aef5b00f7c52c7ff1cc4088c8a00f4a7/airflow/jobs/scheduler_job.py#L173-L185

Together with TaskRunner, which specifies the runtime environment for task
execution phase, we will have full freedom in how the entire work
environment is created. New Task Runners can run code in new virtual
environments or in containers.

I agree that it is worth thinking about the problem with dependencies and
taking up work in this area.

Best regards,
Kamil

On Sat, Dec 14, 2019 at 11:10 AM Jarek Potiuk <[email protected]>
wrote:

> I had a lot of interesting discussions last few days with Apache Airflow
> users at PyDataWarsaw 2019 (I was actually quite surprised how many people
> use Airflow in Poland). One discussion brought an interesting subject:
> Packaging dags in wheel format. The users mentioned that they are
> super-happy using .zip-packaged DAGs but they think it could be improved
> with wheel format (which is also .zip BTW). Maybe it was already mentioned
> in some discussions before but I have not found any.
>
> *Context:*
>
> We are well on the way of implementing "AIP-21 Changing import paths" and
> will provide backport packages for Airflow 1.10. As a next step we want to
> target AIP-8.
> One of the problems to implement AIP-8 (split hooks/operators into separate
> packages) is the problem of dependencies. Different operators/hooks might
> have different dependencies if maintained separately. Currently we have a
> common set of dependencies as we have only one setup.py, but if we split to
> separate packages, this might change.
>
> *Proposal:*
>
> Our users - who love the .zip DAG distribution - proposed that we package
> the DAGs and all related packages in a wheel package instead of pure .zip.
> This would allow the users to install extra dependencies needed by the DAG.
> And it struck me that we could indeed do that for DAGs but also mitigate
> most of the dependency problems for separately-packaged operators.
>
> The proposal from our users was to package the extra dependencies together
> with the DAG in a wheel file. This is quite cool on it's own, but I thought
> we might actually use the same approach to solve dependency problem with
> AIP-8.
>
> I think we could implement "operator group" -> extra -> "pip packages"
> dependencies (we need them anyway for AIP-21) and then we could have wheel
> packages with all the "extra" dependencies for each group of operators.
>
> Worker executing an operator could have the "core" dependencies installed
> initially but then when it is supposed to run an operator it could create a
> virtualenv, install the required "extra" from wheels and run the task for
> this operator in this virtualenv (and remove virtualenv). We could have
> such package-wheels prepared (one wheel package per operator group) and
> distributed either same way as DAGs or using some shared binary repository
> (and cached in the worker).
>
> Having such dynamically created virtualenv has also the advantage that if
> someone has a DAG with specific dependencies - they could be embedded in
> the DAG wheel, installed from it to this virtualenv, and the virtualenv
> would be removed after the task is finished.
>
> The advantage of this approach is that each DAG's extra dependencies are
> isolated and you could have even different versions of the same dependency
> used by different DAGs. I think that could save a lot of headaches for many
> users.
>
> For me that whole idea sounds pretty cool.
>
> Let me know what you think.
>
> J.
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Reply via email to