I agree with Bolke, Airflow is not a data processing tool. Also it should not become one as we already have some awesome solutions like Apache Storm, Flink or Beam.
Tomek On Wed, Nov 27, 2019 at 10:24 AM Bolke de Bruin <bdbr...@gmail.com> wrote: > My 2 cents: > > I don’t think this makes sense at all as it goes against to core of > Airflow: Airflow does not do data processing by itself. So the only think > you should share between tasks is meta data and that you do through XCom. > We can redesign com if you want but it is also the only viable option for a > distributed environment. Going semi-distributed will create a whole lot of > problems in itself. > > B. > > > On 27 November 2019 at 09:13:58, Alex Guziel (alex.guz...@airbnb.com > .invalid) > wrote: > > Agreed on running before we can crawl. The logical way to do this now is to > group it as one big task with more resources. With respect to affinity on > the same machine, that's basically what it is. I guess this hinges on well > your solution can handle workloads with different resource requirements. > > With respect to differing dependencies between tasks, the advantage of > multiple tasks seems miniscule since they have to wait for the others > before ending, so it's pretty much the union of all dependencies, with some > caveats. > > On Tue, Nov 26, 2019 at 8:07 AM James Meickle > <jmeic...@quantopian.com.invalid> wrote: > > > I think this idea is running before we can even crawl. Before it makes > any > > sense to implement this in Airflow, I think it needs three other things: > > > > - A reliable, well-designed component for passing data between tasks > first > > (not XCom!); where shared memory is an _implementation_ of data passing > > - An understanding of temporary resources (not encoded as explicit DAG > > steps but stood up/torn down implicitly); where the shared memory _is_ a > > temporary resource > > - An understanding of cooperative scheduling and retrying (what if one > half > > fails but the other half is still running?); where this is required to > use > > shared memory safely without subtle race conditions > > > > And as stated, this is easy-ish on local executor and crushingly hard > with > > anything else. Yet in the cases where you need this, you... probably > don't > > want to be running on local executor. > > > > On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk <jarek.pot...@polidea.com> > > wrote: > > > > > *TL;DR; Discuss whether shared memory data sharing for some tasks is an > > > interesting feature for future Airflow.* > > > > > > I had a few discussions recently with several Airflow users (including > at > > > Slack [1] and in person at Warsaw Airflow meetup) about using shared > > memory > > > for inter-task communication. > > > > > > Airflow is not currently good for such case. It sounds doable, but > fairly > > > complex to implement (and modifies Airflow paradigm a bit). I am not > 100% > > > sure if it's a good idea to have such feature in the future. > > > > > > I see the need for it and I like it, however I would love to ask you > for > > > opinions. > > > > > > *Context* > > > > > > The case is to have several independent tasks using a lot of temporary > > data > > > in memory. They either run in parallel and share loaded data, or use > > shared > > > memory to pass results between tasks. Examples: machine learning (like > > > audio processing). It makes sense to only load the audio files once (to > > > memory) and run several tasks on those loaded data. > > > > > > Best way to achieve it now is to combine such sharing-memory tasks into > > > single operator (Docker-compose for example ?) and run them as a single > > > Airflow Task. But maybe those tasks could still be modelled as separate > > > tasks in Airflow DAG. One benefit is that there might be different > > > dependencies for different tasks, processing results from some tasks > > could > > > be sent independently using different - existing - operators. > > > > > > As a workaround - we can play with queues and have one dedicated > machine > > to > > > run all such tasks, but it has multiple limitations. > > > > > > *High-level idea* > > > > > > High level it would require defining some affinity between tasks to > make > > > sure that: > > > > > > 1) they are all executed on the same worker machine > > > 2) the processes should remain in-memory until all tasks finish for > data > > > sharing (even if there is a dependency between the tasks) > > > 3) back-filling should act on the whole group of such tasks as "single > > > unit". > > > > > > I would love to hear your feedback. > > > > > > J > > > > > > > > > [1] Slack discussion on shared memory: > > > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200 > > > > > > J. > > > -- > > > > > > Jarek Potiuk > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > M: +48 660 796 129 <+48660796129> > > > [image: Polidea] <https://www.polidea.com/> > > > > > > -- Tomasz Urbaszek Polidea <https://www.polidea.com/> | Junior Software Engineer M: +48 505 628 493 <+48505628493> E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com> Unique Tech Check out our projects! <https://www.polidea.com/our-work>