On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com> wrote:
> Reviving this thread. I think SDF is a pretty big risk for Spark runner > streaming. Holden, is it correct that Spark appears to have no way at all > to produce an infinite DStream from a finite RDD? Maybe we can somehow > dynamically create a new DStream for every initial restriction, said > DStream being obtained using a Receiver that under the hood actually runs > the SDF? (this is of course less efficient than a timer-capable runner > would do, and I have doubts about the fault tolerance) > So on the streaming side we could simply do it with a fixed number of levels on DStreams. It’s not great but it would work. More generally this does raise an important question if we want to target datasets instead of rdds/DStreams in which case i would need to do some more poking. > On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote: > >> How would timers be implemented? By outputing and reprocessing, the same >> way you proposed for SDF? >> > i mean the timers could be inside the mappers within the system. Could use a singleton so if a partition is re-executed it doesn’t end up as a straggler. > >> >> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> So the timers would have to be in our own code. >>> >>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com> >>> wrote: >>> >>>> Does Spark have support for timers? (I know it has support for state) >>>> >>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> Could we alternatively use a state mapping function to keep track of >>>>> the computation so far instead of outputting V each time? (also the >>>>> progress so far is probably of a different type R rather than V). >>>>> >>>>> >>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>> >>>>>> So we had a quick chat about what it would take to add something like >>>>>> SplittableDoFns to Spark. I'd done some sketchy thinking about this last >>>>>> year but didn't get very far. >>>>>> >>>>>> My back-of-the-envelope design was as follows: >>>>>> For input type T >>>>>> Output type V >>>>>> >>>>>> Implement a mapper which outputs type (T, V) >>>>>> and if the computation finishes T will be populated otherwise V will >>>>>> be >>>>>> >>>>>> For determining how long to run we'd up to either K seconds or listen >>>>>> for a signal on a port >>>>>> >>>>>> Once we're done running we take the result and filter for the ones >>>>>> with T and V into seperate collections re-run until finished >>>>>> and then union the results >>>>>> >>>>>> >>>>>> This is maybe not a great design but it was minimally complicated and >>>>>> I figured terrible was a good place to start and improve from. >>>>>> >>>>>> >>>>>> Let me know your thoughts, especially the parts where this is worse >>>>>> than I remember because its been awhile since I thought about this. >>>>>> >>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> >>>>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> -- Twitter: https://twitter.com/holdenkarau