Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > On a personal note, I'm quite surprised that this is all the progress in > Structured Streaming over the last three months since 2.0 was released. I > was under the impression that this was one of the biggest things that the > Spark community actively works on, but that is clearly not the

Re: StructuredStreaming status

2016-10-20 Thread Amit Sela
On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia wrote: > Yeah, as Shivaram pointed out, there have been research projects that > looked at it. Also, Structured Streaming was explicitly designed to not > make microbatching part of the API or part of the output behavior

RE: StructuredStreaming status

2016-10-20 Thread assaf.mendelson
, it is just an idea for optimization for specific use cases. From: Michael Armbrust [via Apache Spark Developers List] [mailto:ml-node+s1001551n1952...@n3.nabble.com] Sent: Thursday, October 20, 2016 11:16 AM To: Mendelson, Assaf Subject: Re: StructuredStreaming status let’s say we would have

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > let’s say we would have implemented distinct count by saving a map with > the key being the distinct value and the value being the last time we saw > this value. This would mean that we wouldn’t really need to save all the > steps in the middle and copy the data, we could only save the last

RE: StructuredStreaming status

2016-10-19 Thread assaf.mendelson
.nabble.com] Sent: Thursday, October 20, 2016 3:42 AM To: Mendelson, Assaf Subject: Re: StructuredStreaming status I'm also curious whether there are concerns other than latency with the way stuff executes in Structured Streaming (now that the time steps don't have to act as triggers), as well as what

Re: StructuredStreaming status

2016-10-19 Thread Abhishek R. Singh
Its not so much about latency actually. The bigger rub for me is that the state has to be reshuffled every micro/mini-batch (unless I am not understanding it right - spark 2.0 state model i.e.). Operator model avoids it by preserving state locality. Event time processing and state purging are

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Both Spark Streaming and Structured Streaming preserve locality for operator state actually. They only reshuffle state if a cluster node fails or if the load becomes heavily imbalanced and it's better to launch a task on another node and load the state remotely. Matei > On Oct 19, 2016, at

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
Yeah, as Shivaram pointed out, there have been research projects that looked at it. Also, Structured Streaming was explicitly designed to not make microbatching part of the API or part of the output behavior (tying triggers to it). However, when people begin working on that is a function of

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
I don't think it's just about what to target - if you could target 1ms batches, without harming 1 second or 1 minute batches why wouldn't you? I think it's about having a clear strategy and dedicating resources to it. If scheduling batches at an order of magnitude or two lower latency is the

Re: StructuredStreaming status

2016-10-19 Thread Matei Zaharia
I'm also curious whether there are concerns other than latency with the way stuff executes in Structured Streaming (now that the time steps don't have to act as triggers), as well as what latency people want for various apps. The stateful operator designs for streaming systems aren't inherently

Re: StructuredStreaming status

2016-10-19 Thread Ofir Manor
Thanks a lot Michael! I really appreciate your sharing. Logistically, I suggest to find a way to tag all structured streaming JIRAs, so it wouldn't so hard to look for them, for anyone wanting to participate, and also have something like the ML roadmap JIRA. regarding your list, evicting space

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > At the AMPLab we've been working on a research project that looks at > just the scheduling latencies and on techniques to get lower > scheduling latency. It moves away from the micro-batch model, but >

Re: StructuredStreaming status

2016-10-19 Thread Shivaram Venkataraman
At the AMPLab we've been working on a research project that looks at just the scheduling latencies and on techniques to get lower scheduling latency. It moves away from the micro-batch model, but reuses the fault tolerance etc. in Spark. However we haven't yet figure out all the parts in

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
I've been working on the Apache Beam Spark runner which is (in this context) basically running a streaming model that focuses on event-time and correctness with Spark, and as I see it (even in spark 1.6.x) the micro-batches are really just added latency, which will work-out for some users, and not

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
I know people are seriously thinking about latency. So far that has not been the limiting factor in the users I've been working with. On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger wrote: > Is anyone seriously thinking about alternatives to microbatches? > > On Wed, Oct

Re: StructuredStreaming status

2016-10-19 Thread Cody Koeninger
Is anyone seriously thinking about alternatives to microbatches? On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust wrote: > Anything that is actively being designed should be in JIRA, and it seems > like you found most of it. In general, release windows can be found on

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
Anything that is actively being designed should be in JIRA, and it seems like you found most of it. In general, release windows can be found on the wiki . 2.1 has a lot of stability fixes as well as the kafka support you mentioned.

StructuredStreaming status

2016-10-18 Thread Ofir Manor
Hi, I hope it is the right forum. I am looking for some information of what to expect from StructuredStreaming in its next releases to help me choose when / where to start using it more seriously (or where to invest in workarounds and where to wait). I couldn't find a good place where such