On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca> wrote:
> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com> > wrote: > >> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com> >>> wrote: >>> >>>> Reviving this thread. I think SDF is a pretty big risk for Spark runner >>>> streaming. Holden, is it correct that Spark appears to have no way at all >>>> to produce an infinite DStream from a finite RDD? Maybe we can somehow >>>> dynamically create a new DStream for every initial restriction, said >>>> DStream being obtained using a Receiver that under the hood actually runs >>>> the SDF? (this is of course less efficient than a timer-capable runner >>>> would do, and I have doubts about the fault tolerance) >>>> >>> So on the streaming side we could simply do it with a fixed number of >>> levels on DStreams. It’s not great but it would work. >>> >> Not sure I understand this. Let me try to clarify what SDF demands of the >> runner. Imagine the following case: a file contains a list of "master" >> Kafka topics, on which there are published additional Kafka topics to read. >> >> PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile) >> PCollection<String> nestedTopics = >> masterTopics.apply(ParDo(ReadFromKafkaFn)) >> PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn)) >> >> This exemplifies both use cases of a streaming SDF that emits infinite >> output for every input: >> - Applying it to a finite set of inputs (in this case to the result of >> reading a text file) >> - Applying it to an infinite set of inputs (i.e. having an unbounded >> number of streams being read concurrently, each of the streams themselves >> is unbounded too) >> >> Does the multi-level solution you have in mind work for this case? I >> suppose the second case is harder, so we can focus on that. >> > So none of those are a splittabledofn right? > Not sure what you mean? ReadFromKafkaFn in these examples is a splittable DoFn and we're trying to figure out how to make Spark run it. > > Assuming that we have a given dstream though in Spark we can get the > underlying RDD implementation for each microbatch and do our work inside of > that. > >> >> >>> >>> More generally this does raise an important question if we want to >>> target datasets instead of rdds/DStreams in which case i would need to do >>> some more poking. >>> >>> >>>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> How would timers be implemented? By outputing and reprocessing, the >>>>> same way you proposed for SDF? >>>>> >>>> i mean the timers could be inside the mappers within the system. Could >>> use a singleton so if a partition is re-executed it doesn’t end up as a >>> straggler. >>> >>>> >>>>> >>>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca> >>>>> wrote: >>>>> >>>>>> So the timers would have to be in our own code. >>>>>> >>>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov < >>>>>> kirpic...@google.com> wrote: >>>>>> >>>>>>> Does Spark have support for timers? (I know it has support for state) >>>>>>> >>>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote: >>>>>>> >>>>>>>> Could we alternatively use a state mapping function to keep track >>>>>>>> of the computation so far instead of outputting V each time? (also the >>>>>>>> progress so far is probably of a different type R rather than V). >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> So we had a quick chat about what it would take to add something >>>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about >>>>>>>>> this >>>>>>>>> last year but didn't get very far. >>>>>>>>> >>>>>>>>> My back-of-the-envelope design was as follows: >>>>>>>>> For input type T >>>>>>>>> Output type V >>>>>>>>> >>>>>>>>> Implement a mapper which outputs type (T, V) >>>>>>>>> and if the computation finishes T will be populated otherwise V >>>>>>>>> will be >>>>>>>>> >>>>>>>>> For determining how long to run we'd up to either K seconds or >>>>>>>>> listen for a signal on a port >>>>>>>>> >>>>>>>>> Once we're done running we take the result and filter for the ones >>>>>>>>> with T and V into seperate collections re-run until finished >>>>>>>>> and then union the results >>>>>>>>> >>>>>>>>> >>>>>>>>> This is maybe not a great design but it was minimally complicated >>>>>>>>> and I figured terrible was a good place to start and improve from. >>>>>>>>> >>>>>>>>> >>>>>>>>> Let me know your thoughts, especially the parts where this is >>>>>>>>> worse than I remember because its been awhile since I thought about >>>>>>>>> this. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>>> >>>>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> >>>>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> -- > Twitter: https://twitter.com/holdenkarau >