Hi everyone,

is there any ongoing discussion/documentation on the redesign of sinks?
I think it could be a good thing to abstract away the underlying
streaming model, however that isn't directly related to Holden's first
point. The way I understand it, is to slightly change the
DataStreamWriter API (the thing that's returned when you call
"df.writeStream") to allow passing in a custom sink provider instead
of only accepting strings. This would allow users to write their own
providers and sinks, and give them a strongly typed, possibly generic
way to do so. The sink api is currently available to users indirectly
(you can create your own sink provider and load it with the built-in
DataSource reflection functionality), therefore I don't quite
understand why exposing it indirectly through a typed interface should
be delayed before finalizing the API.
On a side note, I saw that sources have a similar limitation in that
they are currently only available through a stringly-typed interface.
Could a similar solution be applied to sources? Maybe the writer and
reader api's could even be unified to a certain degree.

Shivaram, I like your ideas on the proposed redesign! Can we discuss
this further?

cheers,
--Jakob


On Mon, Sep 26, 2016 at 5:12 PM, Shivaram Venkataraman
<shiva...@eecs.berkeley.edu> wrote:
> Disclaimer - I am not very closely involved with Structured Streaming
> design / development, so this is just my two cents from looking at the
> discussion in the linked JIRAs and PRs.
>
> It seems to me there are a couple of issues being conflated here: (a)
> is the question of how to specify or add more functionality to the
> Sink API such as ability to get model updates back to the driver [A
> design issue IMHO] (b) question of how to pass parameters to
> DataFrameWriter, esp. strings vs. typed objects and whether the API is
> stable vs. experimental
>
> TLDR is that I think we should first focus on refactoring the Sink and
> add new functionality after that. Detailed comments below.
>
> Sink design / functionality: Looking at SPARK-10815, a JIRA linked
> from SPARK-16407, it looks like the existing Sink API is limited
> because it is tied to the RDD/Dataframe definitions. It also has
> surprising limitations like not being able to run operators on `data`
> and only using `collect/foreach`.  Given these limitations, I think it
> makes sense to redesign the Sink API first *before* adding new
> functionality to the existing Sink. I understand that we have not
> marked this experimental in 2.0.0 -- but I guess since
> StructuredStreaming is new as a whole, so we can probably break the
> Sink API in a upcoming 2.1.0 release.
>
> As a part of the redesign, I think we need to do two things: (i) come
> up with a new data handle that separates RDD from what is passed to
> the Sink (ii) Have some way to specify code that can run on the
> driver. This might not be an issue if the data handle already has
> clean abstraction for this.
>
> Micro-batching: Ideally it would be good to not expose the micro-batch
> processing model in the Sink API as this might change going forward.
> Given the consistency model we are presenting I think there will be
> some notion of batch / time-range identifier in the API. But I think
> if we can avoid having hard constraints on where functions will get
> run (i.e. on the driver vs. as a part of a job etc.) and when
> functions will get run (i.e. strictly after every micro-batch) it
> might give us more freedom in improving performance going forward [1].
>
> Parameter passing: I think your point that typed is better than
> untyped is pretty good and supporting both APIs isn't necessarily bad
> either. My understand of the discussion around this is that we should
> do this after Sink is refactored to avoid exposing the old APIs ?
>
> Thanks
> Shivaram
>
> [1] FWIW this is something I am looking at and
> https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/
> has some details about this.
>
>
> On Mon, Sep 26, 2016 at 1:38 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>> Hi Spark Developers,
>>
>>
>> After some discussion on SPARK-16407 (and on the PR) we’ve decided to jump
>> back to the developer list (SPARK-16407 itself comes from our early work on
>> SPARK-16424 to enable ML with the new Structured Streaming API). SPARK-16407
>> is proposing to extend the current DataStreamWriter API to allow users to
>> specify a specific instance of a StreamSinkProvider - this makes it easier
>> for users to create sinks that are configured with things besides strings
>> (for example things like lambdas). An example of something like this already
>> inside Spark is the ForeachSink.
>>
>>
>> We have been working on adding support for online learning in Structured
>> Streaming, similar to what Spark Streaming and MLLib provide today. Details
>> are available in  SPARK-16424. Along the way, we noticed that there is
>> currently no way for code running in the driver to access the streaming
>> output of a Structured Streaming query (in our case ideally as an Dataset or
>> RDD - but regardless of the underlying data structure). In our specific
>> case, we wanted to update a model in the driver using aggregates computed by
>> a Structured Streaming query.
>>
>>
>> A lot of other applications are going to have similar requirements. For
>> example, there is no way (outside of using private Spark internals)* to
>> implement a console sink with a user supplied formatting function, or
>> configure a templated or generic sink at runtime, trigger a custom Python
>> call-back or even implement the ForeachSink outside of Spark. For work
>> inside of Spark to enable Structured Streaming with ML we clearly don’t need
>> SPARK-16407 as we can directly access the internals (although it would be
>> cleaner to not have to) but if we want to empower people working outside of
>> the Spark codebase itself with Structured Streaming I think we need to
>> provide some mechanism for this and it would be great to see what
>> options/ideas the community can come up with.
>>
>>
>> One of the arguments against SPARK-16407 seems to be mostly that it exposes
>> the Sink API which is implemented using micro-batching, but the counter
>> argument to this is that the Sink API is already exposed (instead of passing
>> in an instance the user needs to pass in a class name which is then created
>> through reflection and has configuration parameters passed in as a map of
>> strings).
>>
>>
>> Personally I think we should exposed a more nicely typed API instead of
>> depending on Strings for all configuration, and that if at some point the
>> Sink API itself needs to change if/when Spark Streaming moves away from
>> micro-batching we would still likely want to allow users to provide the
>> typed interface as well to give Sink creators more flexibility with
>> configuration.
>>
>>
>> Now obviously this is based on my understanding of the lay of the land which
>> could be a little off since the Spark Structured Streaming design docs and
>> JIRAs don’t seem to be being actively updated - so I’d love to know what
>> assumptions I’ve made that don’t match the current plans for structured
>> streaming.
>>
>>
>> Cheers,
>>
>>
>> Holden :)
>>
>>
>> Related Links:
>>
>> The JIRA for this proposal https://issues.apache.org/jira/browse/SPARK-16407
>>
>> The Structured Streaming ML JIRA
>> https://issues.apache.org/jira/browse/SPARK-16424
>>
>> https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing
>>
>> https://github.com/apache/spark/pull/14691
>>
>> https://github.com/holdenk/spark-structured-streaming-ml
>>
>>
>> *Strictly speaking one _could_ pass in a string of Java code and then
>> compile it inside the Sink with Janino - but that clearly isn’t reasonable.
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to