Both Spark Streaming and Structured Streaming preserve locality for operator state actually. They only reshuffle state if a cluster node fails or if the load becomes heavily imbalanced and it's better to launch a task on another node and load the state remotely.
Matei > On Oct 19, 2016, at 9:38 PM, Abhishek R. Singh > <abhis...@tetrationanalytics.com> wrote: > > Its not so much about latency actually. The bigger rub for me is that the > state has to be reshuffled every micro/mini-batch (unless I am not > understanding it right - spark 2.0 state model i.e.). > > Operator model avoids it by preserving state locality. Event time processing > and state purging are the other essentials (which are thankfully getting > addressed). > > Any guidance on (timelines for) expected exit from alpha state would also be > greatly appreciated. > > -Abhishek- > >> On Oct 19, 2016, at 5:36 PM, Matei Zaharia <matei.zaha...@gmail.com >> <mailto:matei.zaha...@gmail.com>> wrote: >> >> I'm also curious whether there are concerns other than latency with the way >> stuff executes in Structured Streaming (now that the time steps don't have >> to act as triggers), as well as what latency people want for various apps. >> >> The stateful operator designs for streaming systems aren't inherently >> "better" than micro-batching -- they lose a lot of stuff that is possible in >> Spark, such as load balancing work dynamically across nodes, speculative >> execution for stragglers, scaling clusters up and down elastically, etc. >> Moreover, Spark itself could execute the current model with much lower >> latency. The question is just what combinations of latency, throughput, >> fault recovery, etc to target. >> >> Matei >> >>> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsel...@gmail.com >>> <mailto:amitsel...@gmail.com>> wrote: >>> >>> >>> >>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman >>> <shiva...@eecs.berkeley.edu <mailto:shiva...@eecs.berkeley.edu>> wrote: >>> At the AMPLab we've been working on a research project that looks at >>> just the scheduling latencies and on techniques to get lower >>> scheduling latency. It moves away from the micro-batch model, but >>> reuses the fault tolerance etc. in Spark. However we haven't yet >>> figure out all the parts in integrating this with the rest of >>> structured streaming. I'll try to post a design doc / SIP about this >>> soon. >>> >>> On a related note - are there other problems users face with >>> micro-batch other than latency ? >>> I think that the fact that they serve as an output trigger is a problem, >>> but Structured Streaming seems to resolve this now. >>> >>> Thanks >>> Shivaram >>> >>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust >>> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote: >>> > I know people are seriously thinking about latency. So far that has not >>> > been the limiting factor in the users I've been working with. >>> > >>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <c...@koeninger.org >>> > <mailto:c...@koeninger.org>> wrote: >>> >> >>> >> Is anyone seriously thinking about alternatives to microbatches? >>> >> >>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust >>> >> <mich...@databricks.com <mailto:mich...@databricks.com>> wrote: >>> >> > Anything that is actively being designed should be in JIRA, and it >>> >> > seems >>> >> > like you found most of it. In general, release windows can be found on >>> >> > the >>> >> > wiki. >>> >> > >>> >> > 2.1 has a lot of stability fixes as well as the kafka support you >>> >> > mentioned. >>> >> > It may also include some of the following. >>> >> > >>> >> > The items I'd like to start thinking about next are: >>> >> > - Evicting state from the store based on event time watermarks >>> >> > - Sessionization (grouping together related events by key / eventTime) >>> >> > - Improvements to the query planner (remove some of the restrictions >>> >> > on >>> >> > what queries can be run). >>> >> > >>> >> > This is roughly in order based on what I've been hearing users hit the >>> >> > most. >>> >> > Would love more feedback on what is blocking real use cases. >>> >> > >>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io >>> >> > <mailto:ofir.ma...@equalum.io>> >>> >> > wrote: >>> >> >> >>> >> >> Hi, >>> >> >> I hope it is the right forum. >>> >> >> I am looking for some information of what to expect from >>> >> >> StructuredStreaming in its next releases to help me choose when / >>> >> >> where >>> >> >> to >>> >> >> start using it more seriously (or where to invest in workarounds and >>> >> >> where >>> >> >> to wait). I couldn't find a good place where such planning discussed >>> >> >> for 2.1 >>> >> >> (like, for example ML and SPARK-15581). >>> >> >> I'm aware of the 2.0 documented limits >>> >> >> >>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations >>> >> >> >>> >> >> <http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations>), >>> >> >> like no support for multiple aggregations levels, joins are strictly >>> >> >> to >>> >> >> a >>> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks >>> >> >> (like >>> >> >> no sink for interactive queries) etc etc >>> >> >> I'm also aware of some changes that have landed in master, like the >>> >> >> new >>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the >>> >> >> metrics in SPARK-17731, and some improvements for the file source. >>> >> >> If I remember correctly, the discussion on Spark release cadence >>> >> >> concluded >>> >> >> with a preference to a four-month cycles, with likely code freeze >>> >> >> pretty >>> >> >> soon (end of October). So I believe the scope for 2.1 should likely >>> >> >> quite >>> >> >> clear to some, and that 2.2 planning should likely be starting about >>> >> >> now. >>> >> >> Any visibility / sharing will be highly appreciated! >>> >> >> thanks in advance, >>> >> >> >>> >> >> Ofir Manor >>> >> >> >>> >> >> Co-Founder & CTO | Equalum >>> >> >> >>> >> >> Mobile: +972-54-7801286 <tel:054-780-1286> | Email: >>> >> >> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io> >>> >> > >>> >> > >>> > >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> <mailto:dev-unsubscr...@spark.apache.org> >>> >> >