On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca> wrote:
> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> Reviving this thread. I think SDF is a pretty big risk for Spark runner >> streaming. Holden, is it correct that Spark appears to have no way at all >> to produce an infinite DStream from a finite RDD? Maybe we can somehow >> dynamically create a new DStream for every initial restriction, said >> DStream being obtained using a Receiver that under the hood actually runs >> the SDF? (this is of course less efficient than a timer-capable runner >> would do, and I have doubts about the fault tolerance) >> > So on the streaming side we could simply do it with a fixed number of > levels on DStreams. It’s not great but it would work. > Not sure I understand this. Let me try to clarify what SDF demands of the runner. Imagine the following case: a file contains a list of "master" Kafka topics, on which there are published additional Kafka topics to read. PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile) PCollection<String> nestedTopics = masterTopics.apply(ParDo(ReadFromKafkaFn)) PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn)) This exemplifies both use cases of a streaming SDF that emits infinite output for every input: - Applying it to a finite set of inputs (in this case to the result of reading a text file) - Applying it to an infinite set of inputs (i.e. having an unbounded number of streams being read concurrently, each of the streams themselves is unbounded too) Does the multi-level solution you have in mind work for this case? I suppose the second case is harder, so we can focus on that. > > More generally this does raise an important question if we want to target > datasets instead of rdds/DStreams in which case i would need to do some > more poking. > > >> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote: >> >>> How would timers be implemented? By outputing and reprocessing, the same >>> way you proposed for SDF? >>> >> i mean the timers could be inside the mappers within the system. Could > use a singleton so if a partition is re-executed it doesn’t end up as a > straggler. > >> >>> >>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>>> So the timers would have to be in our own code. >>>> >>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com> >>>> wrote: >>>> >>>>> Does Spark have support for timers? (I know it has support for state) >>>>> >>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> Could we alternatively use a state mapping function to keep track of >>>>>> the computation so far instead of outputting V each time? (also the >>>>>> progress so far is probably of a different type R rather than V). >>>>>> >>>>>> >>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca> >>>>>> wrote: >>>>>> >>>>>>> So we had a quick chat about what it would take to add something >>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about this >>>>>>> last year but didn't get very far. >>>>>>> >>>>>>> My back-of-the-envelope design was as follows: >>>>>>> For input type T >>>>>>> Output type V >>>>>>> >>>>>>> Implement a mapper which outputs type (T, V) >>>>>>> and if the computation finishes T will be populated otherwise V will >>>>>>> be >>>>>>> >>>>>>> For determining how long to run we'd up to either K seconds or >>>>>>> listen for a signal on a port >>>>>>> >>>>>>> Once we're done running we take the result and filter for the ones >>>>>>> with T and V into seperate collections re-run until finished >>>>>>> and then union the results >>>>>>> >>>>>>> >>>>>>> This is maybe not a great design but it was minimally complicated >>>>>>> and I figured terrible was a good place to start and improve from. >>>>>>> >>>>>>> >>>>>>> Let me know your thoughts, especially the parts where this is worse >>>>>>> than I remember because its been awhile since I thought about this. >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> >>>>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> >>> -- > Twitter: https://twitter.com/holdenkarau >