Re: [DISCUSS] Using shared memory for inter-task communication

Jarek Potiuk Tue, 26 Nov 2019 03:43:18 -0800

when only one POD finishes its work for example -> when only one *container*
finishes its work for example


On Tue, Nov 26, 2019 at 12:41 PM Jarek Potiuk <[email protected]>
wrote:

> Another thought.
>
> It looks like a "Sub-task operator". Kind of a special "Operator" to
> handle such case (for example Docker-Compose or Kubernetes-POD driven).
>
> Currently we can trigger other tasks at the same time (parallel) or
> sequentially (with dependency for finishing/failing the task). But we could
> also have dependencies that are triggered DURING the task execution (when
> some of the data has already been produced but some other processes are
> still running inside the task - and will trigger another dependency after
> it finishes).
>
> We could have the operator trigger other tasks during the run (when only
> one POD finishes its work for example). This could be far less
> paradigm-changing and can work with all the executors I think. It's still a
> bit niche case though and maybe unnecessary optimization.
>
> J.
>
> On Tue, Nov 26, 2019 at 12:25 PM Ash Berlin-Taylor <[email protected]> wrote:
>
>> My gut reaction to this is no, not as a general purpose thing. It would
>> only work for LocalExecutor reliably - it's never going to work with Kube
>> executor, and almost never work in Celery.
>>
>> Also the cases where the data is small enough to fit in memory but not so
>> large you need to put it in S3/hdfs etc don't seem worth special casing for?
>>
>> > On 26 Nov 2019, at 11:21, Jarek Potiuk <[email protected]>
>> wrote:
>> >
>> > *TL;DR; Discuss whether shared memory data sharing for some tasks is an
>> > interesting feature for future Airflow.*
>> >
>> > I had a few discussions recently with several Airflow users (including
>> at
>> > Slack [1] and in person at Warsaw Airflow meetup) about using shared
>> memory
>> > for inter-task communication.
>> >
>> > Airflow is not currently good for such case. It sounds doable, but
>> fairly
>> > complex to implement (and modifies Airflow paradigm a bit). I am not
>> 100%
>> > sure if it's a good idea to have such feature in the future.
>> >
>> > I see the need for it and I like it, however I would love to ask you for
>> > opinions.
>> >
>> > *Context*
>> >
>> > The case is to have several independent tasks using a lot of temporary
>> data
>> > in memory. They either run in parallel and share loaded data, or use
>> shared
>> > memory to pass results between tasks. Examples: machine learning (like
>> > audio processing). It makes sense to only load the audio files once (to
>> > memory) and run several tasks on those loaded data.
>> >
>> > Best way to achieve it now is to combine such sharing-memory tasks into
>> > single operator (Docker-compose for example ?) and run them as a single
>> > Airflow Task. But maybe those tasks could still be modelled as separate
>> > tasks in Airflow DAG. One benefit is that there might be different
>> > dependencies for different tasks, processing results from some tasks
>> could
>> > be sent independently using different - existing - operators.
>> >
>> > As a workaround - we can play with queues and have one dedicated
>> machine to
>> > run all such tasks, but it has multiple limitations.
>> >
>> > *High-level idea*
>> >
>> > High level  it would require defining some affinity between tasks to
>> make
>> > sure that:
>> >
>> > 1) they are all executed on the same worker machine
>> > 2) the processes should remain in-memory until all tasks finish for data
>> > sharing (even if there is a dependency between the tasks)
>> > 3) back-filling should act on the whole group of such tasks as "single
>> > unit".
>> >
>> > I would love to hear your feedback.
>> >
>> > J
>> >
>> >
>> > [1] Slack discussion on shared memory:
>> > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
>> >
>> > J.
>> > --
>> >
>> > Jarek Potiuk
>> > Polidea <https://www.polidea.com/> | Principal Software Engineer
>> >
>> > M: +48 660 796 129 <+48660796129>
>> > [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Using shared memory for inter-task communication

Reply via email to