Re: Splittable DoFN in Spark discussion

Eugene Kirpichov Fri, 23 Mar 2018 15:20:19 -0700

On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca> wrote:


> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Reviving this thread. I think SDF is a pretty big risk for Spark runner
>> streaming. Holden, is it correct that Spark appears to have no way at all
>> to produce an infinite DStream from a finite RDD? Maybe we can somehow
>> dynamically create a new DStream for every initial restriction, said
>> DStream being obtained using a Receiver that under the hood actually runs
>> the SDF? (this is of course less efficient than a timer-capable runner
>> would do, and I have doubts about the fault tolerance)
>>
> So on the streaming side we could simply do it with a fixed number of
> levels on DStreams. It’s not great but it would work.
>
Not sure I understand this. Let me try to clarify what SDF demands of the
runner. Imagine the following case: a file contains a list of "master"
Kafka topics, on which there are published additional Kafka topics to read.

PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile)
PCollection<String> nestedTopics =
masterTopics.apply(ParDo(ReadFromKafkaFn))
PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn))

This exemplifies both use cases of a streaming SDF that emits infinite
output for every input:
- Applying it to a finite set of inputs (in this case to the result of
reading a text file)
- Applying it to an infinite set of inputs (i.e. having an unbounded number
of streams being read concurrently, each of the streams themselves is
unbounded too)

Does the multi-level solution you have in mind work for this case? I
suppose the second case is harder, so we can focus on that.


>
> More generally this does raise an important question if we want to target
> datasets instead of rdds/DStreams in which case i would need to do some
> more poking.
>
>
>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote:
>>
>>> How would timers be implemented? By outputing and reprocessing, the same
>>> way you proposed for SDF?
>>>
>> i mean the timers could be inside the mappers within the system. Could
> use a singleton so if a partition is re-executed it doesn’t end up as a
> straggler.
>
>>
>>>
>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> So the timers would have to be in our own code.
>>>>
>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> Does Spark have support for timers? (I know it has support for state)
>>>>>
>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> Could we alternatively use a state mapping function to keep track of
>>>>>> the computation so far instead of outputting V each time? (also the
>>>>>> progress so far is probably of a different type R rather than V).
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> So we had a quick chat about what it would take to add something
>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about this
>>>>>>> last year but didn't get very far.
>>>>>>>
>>>>>>> My back-of-the-envelope design was as follows:
>>>>>>> For input type T
>>>>>>> Output type V
>>>>>>>
>>>>>>> Implement a mapper which outputs type (T, V)
>>>>>>> and if the computation finishes T will be populated otherwise V will
>>>>>>> be
>>>>>>>
>>>>>>> For determining how long to run we'd up to either K seconds or
>>>>>>> listen for a signal on a port
>>>>>>>
>>>>>>> Once we're done running we take the result and filter for the ones
>>>>>>> with T and V into seperate collections re-run until finished
>>>>>>> and then union the results
>>>>>>>
>>>>>>>
>>>>>>> This is maybe not a great design but it was minimally complicated
>>>>>>> and I figured terrible was a good place to start and improve from.
>>>>>>>
>>>>>>>
>>>>>>> Let me know your thoughts, especially the parts where this is worse
>>>>>>> than I remember because its been awhile since I thought about this.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Splittable DoFN in Spark discussion

Reply via email to