Yeah. I was thinking about the new landing page/website where we specifically have section "Use cases" and we can describe some actual examples (and counter-examples specifically) :).
J. On Wed, Nov 27, 2019 at 11:43 AM Bolke de Bruin <bdbr...@gmail.com> wrote: > From our website :-) > > "Airflow *is not* a data streaming solution. Tasks do not move data from > one to the other (though tasks can exchange metadata!). Airflow is not in > the Spark Streaming <http://spark.apache.org/streaming/> or Storm > <https://storm.apache.org/> space, it is more comparable to Oozie > <http://oozie.apache.org/> or Azkaban <https://azkaban.github.io/>.” > > B. > > > On 27 November 2019 at 11:41:40, Jarek Potiuk (jarek.pot...@polidea.com) > wrote: > > Listening to all those comments, that reaffirms the gut feelings I had. > Even if like the idea of optimisations, I think it makes sense to say > "it's > not an Airflow-domain problem really". I think now that XCom is good what > it is for and introducing "generic" data passing mechanism goes way beyond > what Airflow is designed for. At the end indeed Airflow is merely an > orchestrator, not data processor. > > When last time I spoke to someone mentioning this case my answer was ("Use > your own operator and run it on a big machine via docker/docker > compose/Kubernetes Operator". I also think even enabling triggers from > within operator is really way premature optimisation (and indeed the > benefit will be miniscule). > > So my point of view I do not see it as something that we should focus at > all as community. But it's good we are discussing it. > > Learning from the whole discussion: I think we should - on our new website > - mention not only use cases that Airflow is good at, but also to > explain what cases it is not designed for. > That might help people to understand where Airflow limits are. It would be > great if people can answer those questions based on the official website. > > I made a mental note to propose a PR to the website when it's open for > PRs, > > J. > > On Wed, Nov 27, 2019 at 11:40 AM Tomasz Urbaszek < > tomasz.urbas...@polidea.com> wrote: > > > I agree with Bolke, Airflow is not a data processing tool. Also it > should > > not become > > one as we already have some awesome solutions like Apache Storm, Flink > or > > Beam. > > > > Tomek > > > > > > On Wed, Nov 27, 2019 at 10:24 AM Bolke de Bruin <bdbr...@gmail.com> > wrote: > > > > > My 2 cents: > > > > > > I don’t think this makes sense at all as it goes against to core of > > > Airflow: Airflow does not do data processing by itself. So the only > think > > > you should share between tasks is meta data and that you do through > XCom. > > > We can redesign com if you want but it is also the only viable option > > for a > > > distributed environment. Going semi-distributed will create a whole > lot > > of > > > problems in itself. > > > > > > B. > > > > > > > > > On 27 November 2019 at 09:13:58, Alex Guziel (alex.guz...@airbnb.com > > > .invalid) > > > wrote: > > > > > > Agreed on running before we can crawl. The logical way to do this now > is > > to > > > group it as one big task with more resources. With respect to affinity > on > > > the same machine, that's basically what it is. I guess this hinges on > > well > > > your solution can handle workloads with different resource > requirements. > > > > > > With respect to differing dependencies between tasks, the advantage of > > > multiple tasks seems miniscule since they have to wait for the others > > > before ending, so it's pretty much the union of all dependencies, with > > some > > > caveats. > > > > > > On Tue, Nov 26, 2019 at 8:07 AM James Meickle > > > <jmeic...@quantopian.com.invalid> wrote: > > > > > > > I think this idea is running before we can even crawl. Before it > makes > > > any > > > > sense to implement this in Airflow, I think it needs three other > > things: > > > > > > > > - A reliable, well-designed component for passing data between tasks > > > first > > > > (not XCom!); where shared memory is an _implementation_ of data > passing > > > > - An understanding of temporary resources (not encoded as explicit > DAG > > > > steps but stood up/torn down implicitly); where the shared memory > _is_ > > a > > > > temporary resource > > > > - An understanding of cooperative scheduling and retrying (what if > one > > > half > > > > fails but the other half is still running?); where this is required > to > > > use > > > > shared memory safely without subtle race conditions > > > > > > > > And as stated, this is easy-ish on local executor and crushingly > hard > > > with > > > > anything else. Yet in the cases where you need this, you... probably > > > don't > > > > want to be running on local executor. > > > > > > > > On Tue, Nov 26, 2019 at 6:22 AM Jarek Potiuk < > jarek.pot...@polidea.com > > > > > > > wrote: > > > > > > > > > *TL;DR; Discuss whether shared memory data sharing for some tasks > is > > an > > > > > interesting feature for future Airflow.* > > > > > > > > > > I had a few discussions recently with several Airflow users > > (including > > > at > > > > > Slack [1] and in person at Warsaw Airflow meetup) about using > shared > > > > memory > > > > > for inter-task communication. > > > > > > > > > > Airflow is not currently good for such case. It sounds doable, but > > > fairly > > > > > complex to implement (and modifies Airflow paradigm a bit). I am > not > > > 100% > > > > > sure if it's a good idea to have such feature in the future. > > > > > > > > > > I see the need for it and I like it, however I would love to ask > you > > > for > > > > > opinions. > > > > > > > > > > *Context* > > > > > > > > > > The case is to have several independent tasks using a lot of > > temporary > > > > data > > > > > in memory. They either run in parallel and share loaded data, or > use > > > > shared > > > > > memory to pass results between tasks. Examples: machine learning > > (like > > > > > audio processing). It makes sense to only load the audio files > once > > (to > > > > > memory) and run several tasks on those loaded data. > > > > > > > > > > Best way to achieve it now is to combine such sharing-memory tasks > > into > > > > > single operator (Docker-compose for example ?) and run them as a > > single > > > > > Airflow Task. But maybe those tasks could still be modelled as > > separate > > > > > tasks in Airflow DAG. One benefit is that there might be different > > > > > dependencies for different tasks, processing results from some > tasks > > > > could > > > > > be sent independently using different - existing - operators. > > > > > > > > > > As a workaround - we can play with queues and have one dedicated > > > machine > > > > to > > > > > run all such tasks, but it has multiple limitations. > > > > > > > > > > *High-level idea* > > > > > > > > > > High level it would require defining some affinity between tasks > to > > > make > > > > > sure that: > > > > > > > > > > 1) they are all executed on the same worker machine > > > > > 2) the processes should remain in-memory until all tasks finish > for > > > data > > > > > sharing (even if there is a dependency between the tasks) > > > > > 3) back-filling should act on the whole group of such tasks as > > "single > > > > > unit". > > > > > > > > > > I would love to hear your feedback. > > > > > > > > > > J > > > > > > > > > > > > > > > [1] Slack discussion on shared memory: > > > > > > > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200 > > > > > > > > > > J. > > > > > -- > > > > > > > > > > Jarek Potiuk > > > > > Polidea <https://www.polidea.com/> | Principal Software Engineer > > > > > > > > > > M: +48 660 796 129 <+48660796129> > > > > > [image: Polidea] <https://www.polidea.com/> > > > > > > > > > > > > > > > > > > -- > > > > Tomasz Urbaszek > > Polidea <https://www.polidea.com/> | Junior Software Engineer > > > > M: +48 505 628 493 <+48505628493> > > E: tomasz.urbas...@polidea.com <tomasz.urbasz...@polidea.com> > > > > Unique Tech > > Check out our projects! <https://www.polidea.com/our-work> > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>