Re: [DISCUSS] Using shared memory for inter-task communication

Jarek Potiuk Wed, 27 Nov 2019 02:48:42 -0800

Yeah. I was thinking about the new landing page/website where we
specifically have section "Use cases" and we can describe some actual
examples (and counter-examples specifically) :).


J.

On Wed, Nov 27, 2019 at 11:43 AM Bolke de Bruin <[email protected]> wrote:

> From our website :-)
>
> "Airflow *is not* a data streaming solution. Tasks do not move data from
> one to the other (though tasks can exchange metadata!). Airflow is not in
> the Spark Streaming <http://spark.apache.org/streaming/> or Storm
> <https://storm.apache.org/> space, it is more comparable to Oozie
> <http://oozie.apache.org/> or Azkaban <https://azkaban.github.io/>.”
>
> B.
>
>
> On 27 November 2019 at 11:41:40, Jarek Potiuk ([email protected])
> wrote:
>
> Listening to all those comments, that reaffirms the gut feelings I had.
> Even if like the idea of optimisations, I think it makes sense to say
> "it's
> not an Airflow-domain problem really". I think now that XCom is good what
> it is for and introducing "generic" data passing mechanism goes way beyond
> what Airflow is designed for. At the end indeed Airflow is merely an
> orchestrator, not data processor.
>
> When last time I spoke to someone mentioning this case my answer was ("Use
> your own operator and run it on a big machine via docker/docker
> compose/Kubernetes Operator". I also think even enabling triggers from
> within operator is really way premature optimisation (and indeed the
> benefit will be miniscule).
>
> So my point of view I do not see it as something that we should focus at
> all as community. But it's good we are discussing it.
>
> Learning from the whole discussion: I think we should - on our new website
> - mention not only use cases that Airflow is good at, but also to
> explain what cases it is not designed for.
> That might help people to understand where Airflow limits are. It would be
> great if people can answer those questions based on the official website.
>
> I made a mental note to propose a PR to the website when it's open for
> PRs,
>
> J.
>
> On Wed, Nov 27, 2019 at 11:40 AM Tomasz Urbaszek <
> [email protected]> wrote:
>
> > I agree with Bolke, Airflow is not a data processing tool. Also it
> should
> > not become
> > one as we already have some awesome solutions like Apache Storm, Flink
> or
> > Beam.
> >
> > Tomek
> >
> >
> > On Wed, Nov 27, 2019 at 10:24 AM Bolke de Bruin <[email protected]>
> wrote:
> >
> > > My 2 cents:
> > >
> > > I don’t think this makes sense at all as it goes against to core of
> > > Airflow: Airflow does not do data processing by itself. So the only
> think
> > > you should share between tasks is meta data and that you do through
> XCom.
> > > We can redesign com if you want but it is also the only viable option
> > for a
> > > distributed environment. Going semi-distributed will create a whole
> lot
> > of
> > > problems in itself.
> > >
> > > B.
> > >
> > >
> > > On 27 November 2019 at 09:13:58, Alex Guziel ([email protected]
> > > .invalid)
> > > wrote:
> > >
> > > Agreed on running before we can crawl. The logical way to do this now
> is
> > to
> > > group it as one big task with more resources. With respect to affinity
> on
> > > the same machine, that's basically what it is. I guess this hinges on
> > well
> > > your solution can handle workloads with different resource
> requirements.
> > >
> > > With respect to differing dependencies between tasks, the advantage of
> > > multiple tasks seems miniscule since they have to wait for the others
> > > before ending, so it's pretty much the union of all dependencies, with
> > some
> > > caveats.
> > >
> > > On Tue, Nov 26, 2019 at 8:07 AM James Meickle
> > > <[email protected]> wrote:
> > >
> > > > I think this idea is running before we can even crawl. Before it
> makes
> > > any
> > > > sense to implement this in Airflow, I think it needs three other
> > things:
> > > >
> > > > - A reliable, well-designed component for passing data between tasks
> > > first
> > > > (not XCom!); where shared memory is an _implementation_ of data
> passing
> > > > - An understanding of temporary resources (not encoded as explicit
> DAG
> > > > steps but stood up/torn down implicitly); where the shared memory
> _is_
> > a
> > > > temporary resource
> > > > - An understanding of cooperative scheduling and retrying (what if
> one
> > > half
> > > > fails but the other half is still running?); where this is required
> to
> > > use
> > > > shared memory safely without subtle race conditions
> > > >
> > > > And as stated, this is easy-ish on local executor and crushingly
> hard
> > > with
> > > > anything else. Yet in the cases where you need this, you... probably
> > > don't
> > > > want to be running on local executor.
> > > >
> > > > On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > *TL;DR; Discuss whether shared memory data sharing for some tasks
> is
> > an
> > > > > interesting feature for future Airflow.*
> > > > >
> > > > > I had a few discussions recently with several Airflow users
> > (including
> > > at
> > > > > Slack [1] and in person at Warsaw Airflow meetup) about using
> shared
> > > > memory
> > > > > for inter-task communication.
> > > > >
> > > > > Airflow is not currently good for such case. It sounds doable, but
> > > fairly
> > > > > complex to implement (and modifies Airflow paradigm a bit). I am
> not
> > > 100%
> > > > > sure if it's a good idea to have such feature in the future.
> > > > >
> > > > > I see the need for it and I like it, however I would love to ask
> you
> > > for
> > > > > opinions.
> > > > >
> > > > > *Context*
> > > > >
> > > > > The case is to have several independent tasks using a lot of
> > temporary
> > > > data
> > > > > in memory. They either run in parallel and share loaded data, or
> use
> > > > shared
> > > > > memory to pass results between tasks. Examples: machine learning
> > (like
> > > > > audio processing). It makes sense to only load the audio files
> once
> > (to
> > > > > memory) and run several tasks on those loaded data.
> > > > >
> > > > > Best way to achieve it now is to combine such sharing-memory tasks
> > into
> > > > > single operator (Docker-compose for example ?) and run them as a
> > single
> > > > > Airflow Task. But maybe those tasks could still be modelled as
> > separate
> > > > > tasks in Airflow DAG. One benefit is that there might be different
> > > > > dependencies for different tasks, processing results from some
> tasks
> > > > could
> > > > > be sent independently using different - existing - operators.
> > > > >
> > > > > As a workaround - we can play with queues and have one dedicated
> > > machine
> > > > to
> > > > > run all such tasks, but it has multiple limitations.
> > > > >
> > > > > *High-level idea*
> > > > >
> > > > > High level it would require defining some affinity between tasks
> to
> > > make
> > > > > sure that:
> > > > >
> > > > > 1) they are all executed on the same worker machine
> > > > > 2) the processes should remain in-memory until all tasks finish
> for
> > > data
> > > > > sharing (even if there is a dependency between the tasks)
> > > > > 3) back-filling should act on the whole group of such tasks as
> > "single
> > > > > unit".
> > > > >
> > > > > I would love to hear your feedback.
> > > > >
> > > > > J
> > > > >
> > > > >
> > > > > [1] Slack discussion on shared memory:
> > > > >
> > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
> > > > >
> > > > > J.
> > > > > --
> > > > >
> > > > > Jarek Potiuk
> > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > >
> > > > > M: +48 660 796 129 <+48660796129>
> > > > > [image: Polidea] <https://www.polidea.com/>
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > Tomasz Urbaszek
> > Polidea <https://www.polidea.com/> | Junior Software Engineer
> >
> > M: +48 505 628 493 <+48505628493>
> > E: [email protected] <[email protected]>
> >
> > Unique Tech
> > Check out our projects! <https://www.polidea.com/our-work>
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: [DISCUSS] Using shared memory for inter-task communication

Reply via email to