Re: [DISCUSS] Using shared memory for inter-task communication

Maxime Beauchemin Wed, 27 Nov 2019 11:42:27 -0800

If memory is shared across tasks, they are by definition not idempotent,
which can be troublesome. What if you have a chain of 3 tasks and the last
one failed while operating on the memory that came from task number 2? The
whole chain may have to be re-executed, which to me sounds like it's really
should be just one task.


I'd say just write a new operator that does more (most of the logic should
be in hooks anyways, operators should just be successions of simple hook
calls generally), or write operators that can "checkpoint" and have
assumptions on pushing/pulling data (say parquet on s3).

Max

On Wed, Nov 27, 2019 at 2:48 AM Jarek Potiuk <jarek.pot...@polidea.com>
wrote:

> Yeah. I was thinking about the new landing page/website where we
> specifically have section "Use cases" and we can describe some actual
> examples (and counter-examples specifically) :).
>
> J.
>
> On Wed, Nov 27, 2019 at 11:43 AM Bolke de Bruin <bdbr...@gmail.com> wrote:
>
> > From our website :-)
> >
> > "Airflow *is not* a data streaming solution. Tasks do not move data from
> > one to the other (though tasks can exchange metadata!). Airflow is not in
> > the Spark Streaming <http://spark.apache.org/streaming/> or Storm
> > <https://storm.apache.org/> space, it is more comparable to Oozie
> > <http://oozie.apache.org/> or Azkaban <https://azkaban.github.io/>.”
> >
> > B.
> >
> >
> > On 27 November 2019 at 11:41:40, Jarek Potiuk (jarek.pot...@polidea.com)
> > wrote:
> >
> > Listening to all those comments, that reaffirms the gut feelings I had.
> > Even if like the idea of optimisations, I think it makes sense to say
> > "it's
> > not an Airflow-domain problem really". I think now that XCom is good what
> > it is for and introducing "generic" data passing mechanism goes way
> beyond
> > what Airflow is designed for. At the end indeed Airflow is merely an
> > orchestrator, not data processor.
> >
> > When last time I spoke to someone mentioning this case my answer was
> ("Use
> > your own operator and run it on a big machine via docker/docker
> > compose/Kubernetes Operator". I also think even enabling triggers from
> > within operator is really way premature optimisation (and indeed the
> > benefit will be miniscule).
> >
> > So my point of view I do not see it as something that we should focus at
> > all as community. But it's good we are discussing it.
> >
> > Learning from the whole discussion: I think we should - on our new
> website
> > - mention not only use cases that Airflow is good at, but also to
> > explain what cases it is not designed for.
> > That might help people to understand where Airflow limits are. It would
> be
> > great if people can answer those questions based on the official website.
> >
> > I made a mental note to propose a PR to the website when it's open for
> > PRs,
> >
> > J.
> >
> > On Wed, Nov 27, 2019 at 11:40 AM Tomasz Urbaszek <
> > tomasz.urbas...@polidea.com> wrote:
> >
> > > I agree with Bolke, Airflow is not a data processing tool. Also it
> > should
> > > not become
> > > one as we already have some awesome solutions like Apache Storm, Flink
> > or
> > > Beam.
> > >
> > > Tomek
> > >
> > >
> > > On Wed, Nov 27, 2019 at 10:24 AM Bolke de Bruin <bdbr...@gmail.com>
> > wrote:
> > >
> > > > My 2 cents:
> > > >
> > > > I don’t think this makes sense at all as it goes against to core of
> > > > Airflow: Airflow does not do data processing by itself. So the only
> > think
> > > > you should share between tasks is meta data and that you do through
> > XCom.
> > > > We can redesign com if you want but it is also the only viable option
> > > for a
> > > > distributed environment. Going semi-distributed will create a whole
> > lot
> > > of
> > > > problems in itself.
> > > >
> > > > B.
> > > >
> > > >
> > > > On 27 November 2019 at 09:13:58, Alex Guziel (alex.guz...@airbnb.com
> > > > .invalid)
> > > > wrote:
> > > >
> > > > Agreed on running before we can crawl. The logical way to do this now
> > is
> > > to
> > > > group it as one big task with more resources. With respect to
> affinity
> > on
> > > > the same machine, that's basically what it is. I guess this hinges on
> > > well
> > > > your solution can handle workloads with different resource
> > requirements.
> > > >
> > > > With respect to differing dependencies between tasks, the advantage
> of
> > > > multiple tasks seems miniscule since they have to wait for the others
> > > > before ending, so it's pretty much the union of all dependencies,
> with
> > > some
> > > > caveats.
> > > >
> > > > On Tue, Nov 26, 2019 at 8:07 AM James Meickle
> > > > <jmeic...@quantopian.com.invalid> wrote:
> > > >
> > > > > I think this idea is running before we can even crawl. Before it
> > makes
> > > > any
> > > > > sense to implement this in Airflow, I think it needs three other
> > > things:
> > > > >
> > > > > - A reliable, well-designed component for passing data between
> tasks
> > > > first
> > > > > (not XCom!); where shared memory is an _implementation_ of data
> > passing
> > > > > - An understanding of temporary resources (not encoded as explicit
> > DAG
> > > > > steps but stood up/torn down implicitly); where the shared memory
> > _is_
> > > a
> > > > > temporary resource
> > > > > - An understanding of cooperative scheduling and retrying (what if
> > one
> > > > half
> > > > > fails but the other half is still running?); where this is required
> > to
> > > > use
> > > > > shared memory safely without subtle race conditions
> > > > >
> > > > > And as stated, this is easy-ish on local executor and crushingly
> > hard
> > > > with
> > > > > anything else. Yet in the cases where you need this, you...
> probably
> > > > don't
> > > > > want to be running on local executor.
> > > > >
> > > > > On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk <
> > jarek.pot...@polidea.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > *TL;DR; Discuss whether shared memory data sharing for some tasks
> > is
> > > an
> > > > > > interesting feature for future Airflow.*
> > > > > >
> > > > > > I had a few discussions recently with several Airflow users
> > > (including
> > > > at
> > > > > > Slack [1] and in person at Warsaw Airflow meetup) about using
> > shared
> > > > > memory
> > > > > > for inter-task communication.
> > > > > >
> > > > > > Airflow is not currently good for such case. It sounds doable,
> but
> > > > fairly
> > > > > > complex to implement (and modifies Airflow paradigm a bit). I am
> > not
> > > > 100%
> > > > > > sure if it's a good idea to have such feature in the future.
> > > > > >
> > > > > > I see the need for it and I like it, however I would love to ask
> > you
> > > > for
> > > > > > opinions.
> > > > > >
> > > > > > *Context*
> > > > > >
> > > > > > The case is to have several independent tasks using a lot of
> > > temporary
> > > > > data
> > > > > > in memory. They either run in parallel and share loaded data, or
> > use
> > > > > shared
> > > > > > memory to pass results between tasks. Examples: machine learning
> > > (like
> > > > > > audio processing). It makes sense to only load the audio files
> > once
> > > (to
> > > > > > memory) and run several tasks on those loaded data.
> > > > > >
> > > > > > Best way to achieve it now is to combine such sharing-memory
> tasks
> > > into
> > > > > > single operator (Docker-compose for example ?) and run them as a
> > > single
> > > > > > Airflow Task. But maybe those tasks could still be modelled as
> > > separate
> > > > > > tasks in Airflow DAG. One benefit is that there might be
> different
> > > > > > dependencies for different tasks, processing results from some
> > tasks
> > > > > could
> > > > > > be sent independently using different - existing - operators.
> > > > > >
> > > > > > As a workaround - we can play with queues and have one dedicated
> > > > machine
> > > > > to
> > > > > > run all such tasks, but it has multiple limitations.
> > > > > >
> > > > > > *High-level idea*
> > > > > >
> > > > > > High level it would require defining some affinity between tasks
> > to
> > > > make
> > > > > > sure that:
> > > > > >
> > > > > > 1) they are all executed on the same worker machine
> > > > > > 2) the processes should remain in-memory until all tasks finish
> > for
> > > > data
> > > > > > sharing (even if there is a dependency between the tasks)
> > > > > > 3) back-filling should act on the whole group of such tasks as
> > > "single
> > > > > > unit".
> > > > > >
> > > > > > I would love to hear your feedback.
> > > > > >
> > > > > > J
> > > > > >
> > > > > >
> > > > > > [1] Slack discussion on shared memory:
> > > > > >
> > > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
> > > > > >
> > > > > > J.
> > > > > > --
> > > > > >
> > > > > > Jarek Potiuk
> > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > > > > >
> > > > > > M: +48 660 796 129 <+48660796129>
> > > > > > [image: Polidea] <https://www.polidea.com/>
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Tomasz Urbaszek
> > > Polidea <https://www.polidea.com/> | Junior Software Engineer
> > >
> > > M: +48 505 628 493 <+48505628493>
> > > E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>
> > >
> > > Unique Tech
> > > Check out our projects! <https://www.polidea.com/our-work>
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
> >
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: [DISCUSS] Using shared memory for inter-task communication

Reply via email to