Re: Splittable DoFN in Spark discussion

Eugene Kirpichov Fri, 23 Mar 2018 16:00:06 -0700

On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca> wrote:


> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Reviving this thread. I think SDF is a pretty big risk for Spark runner
>>>> streaming. Holden, is it correct that Spark appears to have no way at all
>>>> to produce an infinite DStream from a finite RDD? Maybe we can somehow
>>>> dynamically create a new DStream for every initial restriction, said
>>>> DStream being obtained using a Receiver that under the hood actually runs
>>>> the SDF? (this is of course less efficient than a timer-capable runner
>>>> would do, and I have doubts about the fault tolerance)
>>>>
>>> So on the streaming side we could simply do it with a fixed number of
>>> levels on DStreams. It’s not great but it would work.
>>>
>> Not sure I understand this. Let me try to clarify what SDF demands of the
>> runner. Imagine the following case: a file contains a list of "master"
>> Kafka topics, on which there are published additional Kafka topics to read.
>>
>> PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile)
>> PCollection<String> nestedTopics =
>> masterTopics.apply(ParDo(ReadFromKafkaFn))
>> PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn))
>>
>> This exemplifies both use cases of a streaming SDF that emits infinite
>> output for every input:
>> - Applying it to a finite set of inputs (in this case to the result of
>> reading a text file)
>> - Applying it to an infinite set of inputs (i.e. having an unbounded
>> number of streams being read concurrently, each of the streams themselves
>> is unbounded too)
>>
>> Does the multi-level solution you have in mind work for this case? I
>> suppose the second case is harder, so we can focus on that.
>>
> So none of those are a splittabledofn right?
>
Not sure what you mean? ReadFromKafkaFn in these examples is a splittable
DoFn and we're trying to figure out how to make Spark run it.


>
> Assuming that we have a given dstream though in Spark we can get the
> underlying RDD implementation for each microbatch and do our work inside of
> that.
>
>>
>>
>>>
>>> More generally this does raise an important question if we want to
>>> target datasets instead of rdds/DStreams in which case i would need to do
>>> some more poking.
>>>
>>>
>>>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> How would timers be implemented? By outputing and reprocessing, the
>>>>> same way you proposed for SDF?
>>>>>
>>>> i mean the timers could be inside the mappers within the system. Could
>>> use a singleton so if a partition is re-executed it doesn’t end up as a
>>> straggler.
>>>
>>>>
>>>>>
>>>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> So the timers would have to be in our own code.
>>>>>>
>>>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>>
>>>>>>> Does Spark have support for timers? (I know it has support for state)
>>>>>>>
>>>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote:
>>>>>>>
>>>>>>>> Could we alternatively use a state mapping function to keep track
>>>>>>>> of the computation so far instead of outputting V each time? (also the
>>>>>>>> progress so far is probably of a different type R rather than V).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> So we had a quick chat about what it would take to add something
>>>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about 
>>>>>>>>> this
>>>>>>>>> last year but didn't get very far.
>>>>>>>>>
>>>>>>>>> My back-of-the-envelope design was as follows:
>>>>>>>>> For input type T
>>>>>>>>> Output type V
>>>>>>>>>
>>>>>>>>> Implement a mapper which outputs type (T, V)
>>>>>>>>> and if the computation finishes T will be populated otherwise V
>>>>>>>>> will be
>>>>>>>>>
>>>>>>>>> For determining how long to run we'd up to either K seconds or
>>>>>>>>> listen for a signal on a port
>>>>>>>>>
>>>>>>>>> Once we're done running we take the result and filter for the ones
>>>>>>>>> with T and V into seperate collections re-run until finished
>>>>>>>>> and then union the results
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is maybe not a great design but it was minimally complicated
>>>>>>>>> and I figured terrible was a good place to start and improve from.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let me know your thoughts, especially the parts where this is
>>>>>>>>> worse than I remember because its been awhile since I thought about 
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Splittable DoFN in Spark discussion

Reply via email to