Re: [DISCUSS] Using shared memory for inter-task communication

Bolke de Bruin Wed, 27 Nov 2019 01:25:02 -0800

My 2 cents:

I don’t think this makes sense at all as it goes against to core of
Airflow: Airflow does not do data processing by itself. So the only think
you should share between tasks is meta data and that you do through XCom.
We can redesign com if you want but it is also the only viable option for a
distributed environment. Going semi-distributed will create a whole lot of
problems in itself.


B.


On 27 November 2019 at 09:13:58, Alex Guziel (alex.guz...@airbnb.com.invalid)
wrote:

Agreed on running before we can crawl. The logical way to do this now is to
group it as one big task with more resources. With respect to affinity on
the same machine, that's basically what it is. I guess this hinges on well
your solution can handle workloads with different resource requirements.

With respect to differing dependencies between tasks, the advantage of
multiple tasks seems miniscule since they have to wait for the others
before ending, so it's pretty much the union of all dependencies, with some
caveats.

On Tue, Nov 26, 2019 at 8:07 AM James Meickle
<jmeic...@quantopian.com.invalid> wrote:

> I think this idea is running before we can even crawl. Before it makes
any
> sense to implement this in Airflow, I think it needs three other things:
>
> - A reliable, well-designed component for passing data between tasks
first
> (not XCom!); where shared memory is an _implementation_ of data passing
> - An understanding of temporary resources (not encoded as explicit DAG
> steps but stood up/torn down implicitly); where the shared memory _is_ a
> temporary resource
> - An understanding of cooperative scheduling and retrying (what if one
half
> fails but the other half is still running?); where this is required to
use
> shared memory safely without subtle race conditions
>
> And as stated, this is easy-ish on local executor and crushingly hard
with
> anything else. Yet in the cases where you need this, you... probably
don't
> want to be running on local executor.
>
> On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>
> > *TL;DR; Discuss whether shared memory data sharing for some tasks is an
> > interesting feature for future Airflow.*
> >
> > I had a few discussions recently with several Airflow users (including
at
> > Slack [1] and in person at Warsaw Airflow meetup) about using shared
> memory
> > for inter-task communication.
> >
> > Airflow is not currently good for such case. It sounds doable, but
fairly
> > complex to implement (and modifies Airflow paradigm a bit). I am not
100%
> > sure if it's a good idea to have such feature in the future.
> >
> > I see the need for it and I like it, however I would love to ask you
for
> > opinions.
> >
> > *Context*
> >
> > The case is to have several independent tasks using a lot of temporary
> data
> > in memory. They either run in parallel and share loaded data, or use
> shared
> > memory to pass results between tasks. Examples: machine learning (like
> > audio processing). It makes sense to only load the audio files once (to
> > memory) and run several tasks on those loaded data.
> >
> > Best way to achieve it now is to combine such sharing-memory tasks into
> > single operator (Docker-compose for example ?) and run them as a single
> > Airflow Task. But maybe those tasks could still be modelled as separate
> > tasks in Airflow DAG. One benefit is that there might be different
> > dependencies for different tasks, processing results from some tasks
> could
> > be sent independently using different - existing - operators.
> >
> > As a workaround - we can play with queues and have one dedicated
machine
> to
> > run all such tasks, but it has multiple limitations.
> >
> > *High-level idea*
> >
> > High level it would require defining some affinity between tasks to
make
> > sure that:
> >
> > 1) they are all executed on the same worker machine
> > 2) the processes should remain in-memory until all tasks finish for
data
> > sharing (even if there is a dependency between the tasks)
> > 3) back-filling should act on the whole group of such tasks as "single
> > unit".
> >
> > I would love to hear your feedback.
> >
> > J
> >
> >
> > [1] Slack discussion on shared memory:
> > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
> >
> > J.
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>

Re: [DISCUSS] Using shared memory for inter-task communication

Reply via email to