*TL;DR; Discuss whether shared memory data sharing for some tasks is an
interesting feature for future Airflow.*

I had a few discussions recently with several Airflow users (including at
Slack [1] and in person at Warsaw Airflow meetup) about using shared memory
for inter-task communication.

Airflow is not currently good for such case. It sounds doable, but fairly
complex to implement (and modifies Airflow paradigm a bit). I am not 100%
sure if it's a good idea to have such feature in the future.

I see the need for it and I like it, however I would love to ask you for
opinions.

*Context*

The case is to have several independent tasks using a lot of temporary data
in memory. They either run in parallel and share loaded data, or use shared
memory to pass results between tasks. Examples: machine learning (like
audio processing). It makes sense to only load the audio files once (to
memory) and run several tasks on those loaded data.

Best way to achieve it now is to combine such sharing-memory tasks into
single operator (Docker-compose for example ?) and run them as a single
Airflow Task. But maybe those tasks could still be modelled as separate
tasks in Airflow DAG. One benefit is that there might be different
dependencies for different tasks, processing results from some tasks could
be sent independently using different - existing - operators.

As a workaround - we can play with queues and have one dedicated machine to
run all such tasks, but it has multiple limitations.

*High-level idea*

High level  it would require defining some affinity between tasks to make
sure that:

1) they are all executed on the same worker machine
2) the processes should remain in-memory until all tasks finish for data
sharing (even if there is a dependency between the tasks)
3) back-filling should act on the whole group of such tasks as "single
unit".

I would love to hear your feedback.

J


[1] Slack discussion on shared memory:
https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200

J.
-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to