Re: Splittable DoFN in Spark discussion

Holden Karau Fri, 23 Mar 2018 15:13:13 -0700

On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
wrote:


> Reviving this thread. I think SDF is a pretty big risk for Spark runner
> streaming. Holden, is it correct that Spark appears to have no way at all
> to produce an infinite DStream from a finite RDD? Maybe we can somehow
> dynamically create a new DStream for every initial restriction, said
> DStream being obtained using a Receiver that under the hood actually runs
> the SDF? (this is of course less efficient than a timer-capable runner
> would do, and I have doubts about the fault tolerance)
>
So on the streaming side we could simply do it with a fixed number of
levels on DStreams. It’s not great but it would work.

More generally this does raise an important question if we want to target
datasets instead of rdds/DStreams in which case i would need to do some
more poking.


> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote:
>
>> How would timers be implemented? By outputing and reprocessing, the same
>> way you proposed for SDF?
>>
> i mean the timers could be inside the mappers within the system. Could use
a singleton so if a partition is re-executed it doesn’t end up as a
straggler.

>
>>
>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> So the timers would have to be in our own code.
>>>
>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Does Spark have support for timers? (I know it has support for state)
>>>>
>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote:
>>>>
>>>>> Could we alternatively use a state mapping function to keep track of
>>>>> the computation so far instead of outputting V each time? (also the
>>>>> progress so far is probably of a different type R rather than V).
>>>>>
>>>>>
>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> So we had a quick chat about what it would take to add something like
>>>>>> SplittableDoFns to Spark. I'd done some sketchy thinking about this last
>>>>>> year but didn't get very far.
>>>>>>
>>>>>> My back-of-the-envelope design was as follows:
>>>>>> For input type T
>>>>>> Output type V
>>>>>>
>>>>>> Implement a mapper which outputs type (T, V)
>>>>>> and if the computation finishes T will be populated otherwise V will
>>>>>> be
>>>>>>
>>>>>> For determining how long to run we'd up to either K seconds or listen
>>>>>> for a signal on a port
>>>>>>
>>>>>> Once we're done running we take the result and filter for the ones
>>>>>> with T and V into seperate collections re-run until finished
>>>>>> and then union the results
>>>>>>
>>>>>>
>>>>>> This is maybe not a great design but it was minimally complicated and
>>>>>> I figured terrible was a good place to start and improve from.
>>>>>>
>>>>>>
>>>>>> Let me know your thoughts, especially the parts where this is worse
>>>>>> than I remember because its been awhile since I thought about this.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau

Re: Splittable DoFN in Spark discussion

Reply via email to