Re: StructuredStreaming status

Matei Zaharia Wed, 19 Oct 2016 21:42:06 -0700

Both Spark Streaming and Structured Streaming preserve locality for operator 
state actually. They only reshuffle state if a cluster node fails or if the 
load becomes heavily imbalanced and it's better to launch a task on another 
node and load the state remotely.


Matei

> On Oct 19, 2016, at 9:38 PM, Abhishek R. Singh 
> <abhis...@tetrationanalytics.com> wrote:
> 
> Its not so much about latency actually. The bigger rub for me is that the 
> state has to be reshuffled every micro/mini-batch (unless I am not 
> understanding it right - spark 2.0 state model i.e.).
> 
> Operator model avoids it by preserving state locality. Event time processing 
> and state purging are the other essentials (which are thankfully getting 
> addressed).
> 
> Any guidance on (timelines for) expected exit from alpha state would also be 
> greatly appreciated.
> 
> -Abhishek-
> 
>> On Oct 19, 2016, at 5:36 PM, Matei Zaharia <matei.zaha...@gmail.com 
>> <mailto:matei.zaha...@gmail.com>> wrote:
>> 
>> I'm also curious whether there are concerns other than latency with the way 
>> stuff executes in Structured Streaming (now that the time steps don't have 
>> to act as triggers), as well as what latency people want for various apps.
>> 
>> The stateful operator designs for streaming systems aren't inherently 
>> "better" than micro-batching -- they lose a lot of stuff that is possible in 
>> Spark, such as load balancing work dynamically across nodes, speculative 
>> execution for stragglers, scaling clusters up and down elastically, etc. 
>> Moreover, Spark itself could execute the current model with much lower 
>> latency. The question is just what combinations of latency, throughput, 
>> fault recovery, etc to target.
>> 
>> Matei
>> 
>>> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com 
>>> <mailto:amitsel...@gmail.com>> wrote:
>>> 
>>> 
>>> 
>>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman 
>>> <shiva...@eecs.berkeley.edu <mailto:shiva...@eecs.berkeley.edu>> wrote:
>>> At the AMPLab we've been working on a research project that looks at
>>> just the scheduling latencies and on techniques to get lower
>>> scheduling latency. It moves away from the micro-batch model, but
>>> reuses the fault tolerance etc. in Spark. However we haven't yet
>>> figure out all the parts in integrating this with the rest of
>>> structured streaming. I'll try to post a design doc / SIP about this
>>> soon.
>>> 
>>> On a related note - are there other problems users face with
>>> micro-batch other than latency ?
>>> I think that the fact that they serve as an output trigger is a problem, 
>>> but Structured Streaming seems to resolve this now.  
>>> 
>>> Thanks
>>> Shivaram
>>> 
>>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>>> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>>> > I know people are seriously thinking about latency.  So far that has not
>>> > been the limiting factor in the users I've been working with.
>>> >
>>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org 
>>> > <mailto:c...@koeninger.org>> wrote:
>>> >>
>>> >> Is anyone seriously thinking about alternatives to microbatches?
>>> >>
>>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>>> >> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote:
>>> >> > Anything that is actively being designed should be in JIRA, and it 
>>> >> > seems
>>> >> > like you found most of it.  In general, release windows can be found on
>>> >> > the
>>> >> > wiki.
>>> >> >
>>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>>> >> > mentioned.
>>> >> > It may also include some of the following.
>>> >> >
>>> >> > The items I'd like to start thinking about next are:
>>> >> >  - Evicting state from the store based on event time watermarks
>>> >> >  - Sessionization (grouping together related events by key / eventTime)
>>> >> >  - Improvements to the query planner (remove some of the restrictions 
>>> >> > on
>>> >> > what queries can be run).
>>> >> >
>>> >> > This is roughly in order based on what I've been hearing users hit the
>>> >> > most.
>>> >> > Would love more feedback on what is blocking real use cases.
>>> >> >
>>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io 
>>> >> > <mailto:ofir.ma...@equalum.io>>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi,
>>> >> >> I hope it is the right forum.
>>> >> >> I am looking for some information of what to expect from
>>> >> >> StructuredStreaming in its next releases to help me choose when / 
>>> >> >> where
>>> >> >> to
>>> >> >> start using it more seriously (or where to invest in workarounds and
>>> >> >> where
>>> >> >> to wait). I couldn't find a good place where such planning discussed
>>> >> >> for 2.1
>>> >> >> (like, for example ML and SPARK-15581).
>>> >> >> I'm aware of the 2.0 documented limits
>>> >> >>
>>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
>>> >> >>  
>>> >> >> <http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations>),
>>> >> >> like no support for multiple aggregations levels, joins are strictly 
>>> >> >> to
>>> >> >> a
>>> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
>>> >> >> (like
>>> >> >> no sink for interactive queries) etc etc
>>> >> >> I'm also aware of some changes that have landed in master, like the 
>>> >> >> new
>>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>>> >> >> metrics in SPARK-17731, and some improvements for the file source.
>>> >> >> If I remember correctly, the discussion on Spark release cadence
>>> >> >> concluded
>>> >> >> with a preference to a four-month cycles, with likely code freeze
>>> >> >> pretty
>>> >> >> soon (end of October). So I believe the scope for 2.1 should likely
>>> >> >> quite
>>> >> >> clear to some, and that 2.2 planning should likely be starting about
>>> >> >> now.
>>> >> >> Any visibility / sharing will be highly appreciated!
>>> >> >> thanks in advance,
>>> >> >>
>>> >> >> Ofir Manor
>>> >> >>
>>> >> >> Co-Founder & CTO | Equalum
>>> >> >>
>>> >> >> Mobile: +972-54-7801286 <tel:054-780-1286> | Email: 
>>> >> >> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io>
>>> >> >
>>> >> >
>>> >
>>> >
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> 
>> 
>

Re: StructuredStreaming status

Reply via email to