Hi everyone, is there any ongoing discussion/documentation on the redesign of sinks? I think it could be a good thing to abstract away the underlying streaming model, however that isn't directly related to Holden's first point. The way I understand it, is to slightly change the DataStreamWriter API (the thing that's returned when you call "df.writeStream") to allow passing in a custom sink provider instead of only accepting strings. This would allow users to write their own providers and sinks, and give them a strongly typed, possibly generic way to do so. The sink api is currently available to users indirectly (you can create your own sink provider and load it with the built-in DataSource reflection functionality), therefore I don't quite understand why exposing it indirectly through a typed interface should be delayed before finalizing the API. On a side note, I saw that sources have a similar limitation in that they are currently only available through a stringly-typed interface. Could a similar solution be applied to sources? Maybe the writer and reader api's could even be unified to a certain degree.
Shivaram, I like your ideas on the proposed redesign! Can we discuss this further? cheers, --Jakob On Mon, Sep 26, 2016 at 5:12 PM, Shivaram Venkataraman <shiva...@eecs.berkeley.edu> wrote: > Disclaimer - I am not very closely involved with Structured Streaming > design / development, so this is just my two cents from looking at the > discussion in the linked JIRAs and PRs. > > It seems to me there are a couple of issues being conflated here: (a) > is the question of how to specify or add more functionality to the > Sink API such as ability to get model updates back to the driver [A > design issue IMHO] (b) question of how to pass parameters to > DataFrameWriter, esp. strings vs. typed objects and whether the API is > stable vs. experimental > > TLDR is that I think we should first focus on refactoring the Sink and > add new functionality after that. Detailed comments below. > > Sink design / functionality: Looking at SPARK-10815, a JIRA linked > from SPARK-16407, it looks like the existing Sink API is limited > because it is tied to the RDD/Dataframe definitions. It also has > surprising limitations like not being able to run operators on `data` > and only using `collect/foreach`. Given these limitations, I think it > makes sense to redesign the Sink API first *before* adding new > functionality to the existing Sink. I understand that we have not > marked this experimental in 2.0.0 -- but I guess since > StructuredStreaming is new as a whole, so we can probably break the > Sink API in a upcoming 2.1.0 release. > > As a part of the redesign, I think we need to do two things: (i) come > up with a new data handle that separates RDD from what is passed to > the Sink (ii) Have some way to specify code that can run on the > driver. This might not be an issue if the data handle already has > clean abstraction for this. > > Micro-batching: Ideally it would be good to not expose the micro-batch > processing model in the Sink API as this might change going forward. > Given the consistency model we are presenting I think there will be > some notion of batch / time-range identifier in the API. But I think > if we can avoid having hard constraints on where functions will get > run (i.e. on the driver vs. as a part of a job etc.) and when > functions will get run (i.e. strictly after every micro-batch) it > might give us more freedom in improving performance going forward [1]. > > Parameter passing: I think your point that typed is better than > untyped is pretty good and supporting both APIs isn't necessarily bad > either. My understand of the discussion around this is that we should > do this after Sink is refactored to avoid exposing the old APIs ? > > Thanks > Shivaram > > [1] FWIW this is something I am looking at and > https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/ > has some details about this. > > > On Mon, Sep 26, 2016 at 1:38 PM, Holden Karau <hol...@pigscanfly.ca> wrote: >> Hi Spark Developers, >> >> >> After some discussion on SPARK-16407 (and on the PR) we’ve decided to jump >> back to the developer list (SPARK-16407 itself comes from our early work on >> SPARK-16424 to enable ML with the new Structured Streaming API). SPARK-16407 >> is proposing to extend the current DataStreamWriter API to allow users to >> specify a specific instance of a StreamSinkProvider - this makes it easier >> for users to create sinks that are configured with things besides strings >> (for example things like lambdas). An example of something like this already >> inside Spark is the ForeachSink. >> >> >> We have been working on adding support for online learning in Structured >> Streaming, similar to what Spark Streaming and MLLib provide today. Details >> are available in SPARK-16424. Along the way, we noticed that there is >> currently no way for code running in the driver to access the streaming >> output of a Structured Streaming query (in our case ideally as an Dataset or >> RDD - but regardless of the underlying data structure). In our specific >> case, we wanted to update a model in the driver using aggregates computed by >> a Structured Streaming query. >> >> >> A lot of other applications are going to have similar requirements. For >> example, there is no way (outside of using private Spark internals)* to >> implement a console sink with a user supplied formatting function, or >> configure a templated or generic sink at runtime, trigger a custom Python >> call-back or even implement the ForeachSink outside of Spark. For work >> inside of Spark to enable Structured Streaming with ML we clearly don’t need >> SPARK-16407 as we can directly access the internals (although it would be >> cleaner to not have to) but if we want to empower people working outside of >> the Spark codebase itself with Structured Streaming I think we need to >> provide some mechanism for this and it would be great to see what >> options/ideas the community can come up with. >> >> >> One of the arguments against SPARK-16407 seems to be mostly that it exposes >> the Sink API which is implemented using micro-batching, but the counter >> argument to this is that the Sink API is already exposed (instead of passing >> in an instance the user needs to pass in a class name which is then created >> through reflection and has configuration parameters passed in as a map of >> strings). >> >> >> Personally I think we should exposed a more nicely typed API instead of >> depending on Strings for all configuration, and that if at some point the >> Sink API itself needs to change if/when Spark Streaming moves away from >> micro-batching we would still likely want to allow users to provide the >> typed interface as well to give Sink creators more flexibility with >> configuration. >> >> >> Now obviously this is based on my understanding of the lay of the land which >> could be a little off since the Spark Structured Streaming design docs and >> JIRAs don’t seem to be being actively updated - so I’d love to know what >> assumptions I’ve made that don’t match the current plans for structured >> streaming. >> >> >> Cheers, >> >> >> Holden :) >> >> >> Related Links: >> >> The JIRA for this proposal https://issues.apache.org/jira/browse/SPARK-16407 >> >> The Structured Streaming ML JIRA >> https://issues.apache.org/jira/browse/SPARK-16424 >> >> https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing >> >> https://github.com/apache/spark/pull/14691 >> >> https://github.com/holdenk/spark-structured-streaming-ml >> >> >> *Strictly speaking one _could_ pass in a string of Java code and then >> compile it inside the Sink with Janino - but that clearly isn’t reasonable. >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org