Splittable DoFN in Spark discussion

2018-03-14 Thread Holden Karau
So we had a quick chat about what it would take to add something like SplittableDoFns to Spark. I'd done some sketchy thinking about this last year but didn't get very far. My back-of-the-envelope design was as follows: For input type T Output type V Implement a mapper which outputs type (T, V) a

Re: Splittable DoFN in Spark discussion

2018-03-14 Thread Reuven Lax
Could we alternatively use a state mapping function to keep track of the computation so far instead of outputting V each time? (also the progress so far is probably of a different type R rather than V). On Wed, Mar 14, 2018 at 4:28 PM Holden Karau wrote: > So we had a quick chat about what it w

Re: Splittable DoFN in Spark discussion

2018-03-14 Thread Eugene Kirpichov
Does Spark have support for timers? (I know it has support for state) On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax wrote: > Could we alternatively use a state mapping function to keep track of the > computation so far instead of outputting V each time? (also the progress so > far is probably of a

Re: Splittable DoFN in Spark discussion

2018-03-14 Thread Holden Karau
So the timers would have to be in our own code. On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov wrote: > Does Spark have support for timers? (I know it has support for state) > > On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax wrote: > >> Could we alternatively use a state mapping function to keep

Re: Splittable DoFN in Spark discussion

2018-03-14 Thread Reuven Lax
How would timers be implemented? By outputing and reprocessing, the same way you proposed for SDF? On Wed, Mar 14, 2018 at 7:25 PM Holden Karau wrote: > So the timers would have to be in our own code. > > On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov > wrote: > >> Does Spark have support fo

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Eugene Kirpichov
Reviving this thread. I think SDF is a pretty big risk for Spark runner streaming. Holden, is it correct that Spark appears to have no way at all to produce an infinite DStream from a finite RDD? Maybe we can somehow dynamically create a new DStream for every initial restriction, said DStream being

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Holden Karau
On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov wrote: > Reviving this thread. I think SDF is a pretty big risk for Spark runner > streaming. Holden, is it correct that Spark appears to have no way at all > to produce an infinite DStream from a finite RDD? Maybe we can somehow > dynamically crea

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Eugene Kirpichov
On Fri, Mar 23, 2018 at 6:12 PM Holden Karau wrote: > On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov > wrote: > >> Reviving this thread. I think SDF is a pretty big risk for Spark runner >> streaming. Holden, is it correct that Spark appears to have no way at all >> to produce an infinite DStr

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Holden Karau
On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov wrote: > On Fri, Mar 23, 2018 at 6:12 PM Holden Karau wrote: > >> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov >> wrote: >> >>> Reviving this thread. I think SDF is a pretty big risk for Spark runner >>> streaming. Holden, is it correct that

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Eugene Kirpichov
On Fri, Mar 23, 2018 at 6:49 PM Holden Karau wrote: > On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov > wrote: > >> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau >> wrote: >> >>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov >>> wrote: >>> Reviving this thread. I think SDF is a pretty

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Holden Karau
On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov wrote: > On Fri, Mar 23, 2018 at 6:49 PM Holden Karau wrote: > >> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov >> wrote: >> >>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau >>> wrote: >>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov

Re: Splittable DoFN in Spark discussion

2018-03-24 Thread Eugene Kirpichov
On Fri, Mar 23, 2018, 11:17 PM Holden Karau wrote: > On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov > wrote: > >> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau >> wrote: >> >>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov >>> wrote: >>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau >

Re: Splittable DoFN in Spark discussion

2018-03-24 Thread Holden Karau
On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov wrote: > > > On Fri, Mar 23, 2018, 11:17 PM Holden Karau wrote: > >> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov >> wrote: >> >>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau >>> wrote: >>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpic

Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Thomas Weise
Hopefully the new "continuous processing mode" in Spark will enable SDF implementation (and real streaming)? Thanks, Thomas On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau wrote: > > On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov > wrote: > >> >> >> On Fri, Mar 23, 2018, 11:17 PM Holden Karau

Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Holden Karau
That would certainly be good. On Sun, Mar 25, 2018 at 9:01 PM, Thomas Weise wrote: > Hopefully the new "continuous processing mode" in Spark will enable SDF > implementation (and real streaming)? > > Thanks, > Thomas > > > On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau > wrote: > >> >> On Sat, M

Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Reuven Lax
But this new mode isn't a semantic change, right? It's moving away from micro batches into something that looks a lot like what Flink does - continuous processing with asynchronous snapshot boundaries. On Sun, Mar 25, 2018 at 9:01 PM Thomas Weise wrote: > Hopefully the new "continuous processing

Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Holden Karau
I mean the new mode is very much in the Dataset not the DStream API (although you can use the Dataset API with the old modes too). On Sun, Mar 25, 2018 at 9:11 PM, Reuven Lax wrote: > But this new mode isn't a semantic change, right? It's moving away from > micro batches into something that look

Re: Splittable DoFN in Spark discussion

2018-04-12 Thread Eugene Kirpichov
(resurrecting thread as I'm back from leave) I looked at this mode, and indeed as Reuven points out it seems that it affects execution details, but doesn't offer any new APIs. Holden - your suggestions of piggybacking an unbounded-per-element SDF on top of an infinite stream would work if 1) there

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
Would like to revive this thread one more time. At this point I'm pretty certain that Spark can't support this out of the box and we're gonna have to make changes to Spark. Holden, could you advise who would be some Spark experts (yourself included :) ) who could advise what kind of Spark change

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Kenneth Knowles
I don't think I understand what the limitations of timers are that you are referring to. FWIW I would say implementing other primitives like SDF is an explicit non-goal for Beam state & timers. I got lost at some point in this thread, but is it actually necessary that a bounded PCollection maps to

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
Kenn - I'm arguing that in Spark SDF style computation can not be expressed at all, and neither can Beam's timers. Spark, unlike Flink, does not have a timer facility (only state), and as far as I can tell its programming model has no other primitive that can map a finite RDD into an infinite DStr

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Reuven Lax
Could we do this behind the scenes by writing a Receiver that publishes periodic pings? On Tue, Apr 24, 2018 at 10:09 PM Eugene Kirpichov wrote: > Kenn - I'm arguing that in Spark SDF style computation can not be > expressed at all, and neither can Beam's timers. > > Spark, unlike Flink, does no

Re: Splittable DoFN in Spark discussion

2018-04-26 Thread Holden Karau
Yeah that's been the implied source of being able to be continuous, you union with a receiver which produce an infinite number of batches (the "never ending queue stream" but not actually a queuestream since they have some limitations but our own implementation there of). On Tue, Apr 24, 2018 at 1

Re: Splittable DoFN in Spark discussion

2018-04-27 Thread Kenneth Knowles
I'm still pretty shallow on this topic & this thread, so forgive if I'm restating or missing things. My understanding is that the Spark runner does support Beam's triggering semantics for unbounded aggregations, using the same support code from runners/core that all runners use. Relevant code in S

Re: Splittable DoFN in Spark discussion

2018-04-27 Thread Robert Bradshaw
On Fri, Apr 27, 2018 at 11:56 AM Kenneth Knowles wrote: > I'm still pretty shallow on this topic & this thread, so forgive if I'm restating or missing things. > My understanding is that the Spark runner does support Beam's triggering semantics for unbounded aggregations, using the same support c

Re: Splittable DoFN in Spark discussion

2018-04-27 Thread Kenneth Knowles
On Fri, Apr 27, 2018 at 12:06 PM Robert Bradshaw wrote: > On Fri, Apr 27, 2018 at 11:56 AM Kenneth Knowles wrote: > > > I'm still pretty shallow on this topic & this thread, so forgive if I'm > restating or missing things. > > > My understanding is that the Spark runner does support Beam's trigg

Re: Splittable DoFN in Spark discussion

2018-04-30 Thread Eugene Kirpichov
I think this stuff is happening in SparkGroupAlsoByWindowViaWindowSet: https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java#L610 As far as I can tell, there is no infinite stream of pings involved. How