I agree with Bolke, Airflow is not a data processing tool. Also it should
not become
one as we already have some awesome solutions like Apache Storm, Flink or
Beam.

Tomek


On Wed, Nov 27, 2019 at 10:24 AM Bolke de Bruin <bdbr...@gmail.com> wrote:

> My 2 cents:
>
> I don’t think this makes sense at all as it goes against to core of
> Airflow: Airflow does not do data processing by itself. So the only think
> you should share between tasks is meta data and that you do through XCom.
> We can redesign com if you want but it is also the only viable option for a
> distributed environment. Going semi-distributed will create a whole lot of
> problems in itself.
>
> B.
>
>
> On 27 November 2019 at 09:13:58, Alex Guziel (alex.guz...@airbnb.com
> .invalid)
> wrote:
>
> Agreed on running before we can crawl. The logical way to do this now is to
> group it as one big task with more resources. With respect to affinity on
> the same machine, that's basically what it is. I guess this hinges on well
> your solution can handle workloads with different resource requirements.
>
> With respect to differing dependencies between tasks, the advantage of
> multiple tasks seems miniscule since they have to wait for the others
> before ending, so it's pretty much the union of all dependencies, with some
> caveats.
>
> On Tue, Nov 26, 2019 at 8:07 AM James Meickle
> <jmeic...@quantopian.com.invalid> wrote:
>
> > I think this idea is running before we can even crawl. Before it makes
> any
> > sense to implement this in Airflow, I think it needs three other things:
> >
> > - A reliable, well-designed component for passing data between tasks
> first
> > (not XCom!); where shared memory is an _implementation_ of data passing
> > - An understanding of temporary resources (not encoded as explicit DAG
> > steps but stood up/torn down implicitly); where the shared memory _is_ a
> > temporary resource
> > - An understanding of cooperative scheduling and retrying (what if one
> half
> > fails but the other half is still running?); where this is required to
> use
> > shared memory safely without subtle race conditions
> >
> > And as stated, this is easy-ish on local executor and crushingly hard
> with
> > anything else. Yet in the cases where you need this, you... probably
> don't
> > want to be running on local executor.
> >
> > On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> > > *TL;DR; Discuss whether shared memory data sharing for some tasks is an
> > > interesting feature for future Airflow.*
> > >
> > > I had a few discussions recently with several Airflow users (including
> at
> > > Slack [1] and in person at Warsaw Airflow meetup) about using shared
> > memory
> > > for inter-task communication.
> > >
> > > Airflow is not currently good for such case. It sounds doable, but
> fairly
> > > complex to implement (and modifies Airflow paradigm a bit). I am not
> 100%
> > > sure if it's a good idea to have such feature in the future.
> > >
> > > I see the need for it and I like it, however I would love to ask you
> for
> > > opinions.
> > >
> > > *Context*
> > >
> > > The case is to have several independent tasks using a lot of temporary
> > data
> > > in memory. They either run in parallel and share loaded data, or use
> > shared
> > > memory to pass results between tasks. Examples: machine learning (like
> > > audio processing). It makes sense to only load the audio files once (to
> > > memory) and run several tasks on those loaded data.
> > >
> > > Best way to achieve it now is to combine such sharing-memory tasks into
> > > single operator (Docker-compose for example ?) and run them as a single
> > > Airflow Task. But maybe those tasks could still be modelled as separate
> > > tasks in Airflow DAG. One benefit is that there might be different
> > > dependencies for different tasks, processing results from some tasks
> > could
> > > be sent independently using different - existing - operators.
> > >
> > > As a workaround - we can play with queues and have one dedicated
> machine
> > to
> > > run all such tasks, but it has multiple limitations.
> > >
> > > *High-level idea*
> > >
> > > High level it would require defining some affinity between tasks to
> make
> > > sure that:
> > >
> > > 1) they are all executed on the same worker machine
> > > 2) the processes should remain in-memory until all tasks finish for
> data
> > > sharing (even if there is a dependency between the tasks)
> > > 3) back-filling should act on the whole group of such tasks as "single
> > > unit".
> > >
> > > I would love to hear your feedback.
> > >
> > > J
> > >
> > >
> > > [1] Slack discussion on shared memory:
> > > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
> > >
> > > J.
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>


-- 

Tomasz Urbaszek
Polidea <https://www.polidea.com/> | Junior Software Engineer

M: +48 505 628 493 <+48505628493>
E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com>

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Reply via email to