Re: StructuredStreaming status

2016-10-20 Thread Amit Sela
On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia 
wrote:

> Yeah, as Shivaram pointed out, there have been research projects that
> looked at it. Also, Structured Streaming was explicitly designed to not
> make microbatching part of the API or part of the output behavior (tying
> triggers to it).
>
But Streaming Query sources
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala#L41>
are
still designed with microbatches in mind, can this be removed and leave
offset tracking to the executors ?

> However, when people begin working on that is a function of demand
> relative to other features. I don't think we can commit to one plan before
> exploring more options, but basically there is Shivaram's project, which
> adds a few new concepts to the scheduler, and there's the option to reduce
> control plane latency in the current system, which hasn't been heavily
> optimized yet but should be doable (lots of systems can handle 10,000s of
> RPCs per second).
>
> Matei
>
> On Oct 19, 2016, at 9:20 PM, Cody Koeninger  wrote:
>
> I don't think it's just about what to target - if you could target 1ms
> batches, without harming 1 second or 1 minute batches why wouldn't you?
> I think it's about having a clear strategy and dedicating resources to it.
> If  scheduling batches at an order of magnitude or two lower latency is the
> strategy, and that's actually feasible, that's great. But I haven't seen
> that clear direction, and this is by no means a recent issue.
>
> On Oct 19, 2016 7:36 PM, "Matei Zaharia"  wrote:
>
> I'm also curious whether there are concerns other than latency with the
> way stuff executes in Structured Streaming (now that the time steps don't
> have to act as triggers), as well as what latency people want for various
> apps.
>
> The stateful operator designs for streaming systems aren't inherently
> "better" than micro-batching -- they lose a lot of stuff that is possible
> in Spark, such as load balancing work dynamically across nodes, speculative
> execution for stragglers, scaling clusters up and down elastically, etc.
> Moreover, Spark itself could execute the current model with much lower
> latency. The question is just what combinations of latency, throughput,
> fault recovery, etc to target.
>
> Matei
>
> On Oct 19, 2016, at 2:18 PM, Amit Sela  wrote:
>
>
>
> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
>
> On a related note - are there other problems users face with
> micro-batch other than latency ?
>
> I think that the fact that they serve as an output trigger is a problem,
> but Structured Streaming seems to resolve this now.
>
>
> Thanks
> Shivaram
>
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>  wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >>  wrote:
> >> > Anything that is actively being designed should be in JIRA, and it
> seems
> >> > like you found most of it.  In general, release windows can be found
> on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key /
> eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions
> on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more f

Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> At the AMPLab we've been working on a research project that looks at
> just the scheduling latencies and on techniques to get lower
> scheduling latency. It moves away from the micro-batch model, but
> reuses the fault tolerance etc. in Spark. However we haven't yet
> figure out all the parts in integrating this with the rest of
> structured streaming. I'll try to post a design doc / SIP about this
> soon.
>
> On a related note - are there other problems users face with
> micro-batch other than latency ?
>
I think that the fact that they serve as an output trigger is a problem,
but Structured Streaming seems to resolve this now.

>
> Thanks
> Shivaram
>
> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>  wrote:
> > I know people are seriously thinking about latency.  So far that has not
> > been the limiting factor in the users I've been working with.
> >
> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
> >>
> >> Is anyone seriously thinking about alternatives to microbatches?
> >>
> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> >>  wrote:
> >> > Anything that is actively being designed should be in JIRA, and it
> seems
> >> > like you found most of it.  In general, release windows can be found
> on
> >> > the
> >> > wiki.
> >> >
> >> > 2.1 has a lot of stability fixes as well as the kafka support you
> >> > mentioned.
> >> > It may also include some of the following.
> >> >
> >> > The items I'd like to start thinking about next are:
> >> >  - Evicting state from the store based on event time watermarks
> >> >  - Sessionization (grouping together related events by key /
> eventTime)
> >> >  - Improvements to the query planner (remove some of the restrictions
> on
> >> > what queries can be run).
> >> >
> >> > This is roughly in order based on what I've been hearing users hit the
> >> > most.
> >> > Would love more feedback on what is blocking real use cases.
> >> >
> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> >> > wrote:
> >> >>
> >> >> Hi,
> >> >> I hope it is the right forum.
> >> >> I am looking for some information of what to expect from
> >> >> StructuredStreaming in its next releases to help me choose when /
> where
> >> >> to
> >> >> start using it more seriously (or where to invest in workarounds and
> >> >> where
> >> >> to wait). I couldn't find a good place where such planning discussed
> >> >> for 2.1
> >> >> (like, for example ML and SPARK-15581).
> >> >> I'm aware of the 2.0 documented limits
> >> >>
> >> >> (
> http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> ),
> >> >> like no support for multiple aggregations levels, joins are strictly
> to
> >> >> a
> >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> >> >> (like
> >> >> no sink for interactive queries) etc etc
> >> >> I'm also aware of some changes that have landed in master, like the
> new
> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> >> metrics in SPARK-17731, and some improvements for the file source.
> >> >> If I remember correctly, the discussion on Spark release cadence
> >> >> concluded
> >> >> with a preference to a four-month cycles, with likely code freeze
> >> >> pretty
> >> >> soon (end of October). So I believe the scope for 2.1 should likely
> >> >> quite
> >> >> clear to some, and that 2.2 planning should likely be starting about
> >> >> now.
> >> >> Any visibility / sharing will be highly appreciated!
> >> >> thanks in advance,
> >> >>
> >> >> Ofir Manor
> >> >>
> >> >> Co-Founder & CTO | Equalum
> >> >>
> >> >> Mobile: +972-54-7801286 <054-780-1286> | Email:
> ofir.ma...@equalum.io
> >> >
> >> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: StructuredStreaming status

2016-10-19 Thread Amit Sela
I've been working on the Apache Beam Spark runner which is (in this
context) basically running a streaming model that focuses on event-time and
correctness with Spark, and as I see it (even in spark 1.6.x) the
micro-batches are really just added latency, which will work-out for some
users, and not for others and that's OK. Structured Streaming triggers make
it even better with computing on trigger (other systems do it constantly,
but output only on trigger so no much difference there).

I'm actually curious about a couple of things:

   - State store API - having a state API available is extremely useful for
   streaming in many fronts:
  - Available for sources (and sinks ?) to avoid immortalizing
  micro-batch reads (simply let tasks pick-off where they left the previous
  micro-batch).
  - Can help rid of resuming from checkpoint, you can simply restart -
  that goes for upgrading Spark jobs as well as wrapping accumulators and
  broadcasts in getOrCreate methods (an of not resuming from checkpoint you
  can avoid wrapping you DAG construction in getOrCreate as well).
  - The fact that it aims to be pluggable enabling building platforms
  (not just frameworks) with Spark.
  - Finally, it is basically the basis for any stateful computation
  spark will support.
   - Evicting state as Michael pointed out, which currently, if using for
   example overlapping windows grows the Dataset really quickly.
   - Encoders API - Where does it stand ? will developers/users be able to
   define a custom schema for say a generically typed class ? will it get
   along with inner classes, static classes etc. ?

Thanks,
Amit

On Wed, Oct 19, 2016 at 11:30 PM Michael Armbrust 
wrote:

> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger 
> wrote:
>
> Is anyone seriously thinking about alternatives to microbatches?
>
> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>  wrote:
> > Anything that is actively being designed should be in JIRA, and it seems
> > like you found most of it.  In general, release windows can be found on
> the
> > wiki.
> >
> > 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.
> > It may also include some of the following.
> >
> > The items I'd like to start thinking about next are:
> >  - Evicting state from the store based on event time watermarks
> >  - Sessionization (grouping together related events by key / eventTime)
> >  - Improvements to the query planner (remove some of the restrictions on
> > what queries can be run).
> >
> > This is roughly in order based on what I've been hearing users hit the
> most.
> > Would love more feedback on what is blocking real use cases.
> >
> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor 
> wrote:
> >>
> >> Hi,
> >> I hope it is the right forum.
> >> I am looking for some information of what to expect from
> >> StructuredStreaming in its next releases to help me choose when / where
> to
> >> start using it more seriously (or where to invest in workarounds and
> where
> >> to wait). I couldn't find a good place where such planning discussed
> for 2.1
> >> (like, for example ML and SPARK-15581).
> >> I'm aware of the 2.0 documented limits
> >> (
> http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> ),
> >> like no support for multiple aggregations levels, joins are strictly to
> a
> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> (like
> >> no sink for interactive queries) etc etc
> >> I'm also aware of some changes that have landed in master, like the new
> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> metrics in SPARK-17731, and some improvements for the file source.
> >> If I remember correctly, the discussion on Spark release cadence
> concluded
> >> with a preference to a four-month cycles, with likely code freeze pretty
> >> soon (end of October). So I believe the scope for 2.1 should likely
> quite
> >> clear to some, and that 2.2 planning should likely be starting about
> now.
> >> Any visibility / sharing will be highly appreciated!
> >> thanks in advance,
> >>
> >> Ofir Manor
> >>
> >> Co-Founder & CTO | Equalum
> >>
> >> Mobile: +972-54-7801286 <054-780-1286> | Email: ofir.ma...@equalum.io
> >
> >
>
>
>


Re: Spark SQL and Kryo registration

2016-08-04 Thread Amit Sela
It should. Codegen uses the SparkConf in SparkEnv when instantiating a new
Serializer.

On Thu, Aug 4, 2016 at 6:14 PM Jacek Laskowski  wrote:

> Hi Olivier,
>
> I don't know either, but am curious what you've tried already.
>
> Jacek
>
> On 3 Aug 2016 10:50 a.m., "Olivier Girardot" <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> I'm currently to use Spark 2.0.0 and making Dataframes work
>> with kryo.registrationRequired=true
>> Is it even possible at all considering the codegen ?
>>
>> Regards,
>>
>> *Olivier Girardot* | AssociƩ
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>


Re: cutting 1.6.2 rc and 2.0.0 rc this week?

2016-06-15 Thread Amit Sela
Should we backport https://github.com/apache/spark/pull/13424 to 1.6.2 ?

On Thu, Jun 16, 2016 at 9:02 AM andy petrella 
wrote:

> +1 both too
> (for tomorrow lunchtime? ^^)
>
> On Thu, Jun 16, 2016 at 5:06 AM Raymond Honderdors <
> raymond.honderd...@sizmek.com> wrote:
>
>> +1 for both
>>
>> Get Outlook for Android 
>>
>>
>>
>> On Wed, Jun 15, 2016 at 10:23 PM +0300, "Michael Armbrust" <
>> mich...@databricks.com> wrote:
>>
>> +1 to both of these!
>>
>> On Wed, Jun 15, 2016 at 12:21 PM, Sean Owen  wrote:
>>
>>> 1.6.2 RC seems fine to me; I don't know of outstanding issues. Clearly
>>> we need to keep the 1.x line going for a bit, so a bug fix release
>>> sounds good,
>>>
>>> Although we've got some work to do before 2.0.0 it does look like it's
>>> within reach. Especially if declaring an RC creates more focus on
>>> resolving the most important blocker issues -- and if we do burn those
>>> down before 2.0.0 -- this sounds like a good step IMHO.
>>>
>>> On Wed, Jun 15, 2016 at 8:01 PM, Reynold Xin 
>>> wrote:
>>> > It's been a while and we have accumulated quite a few bug fixes in
>>> > branch-1.6. I'm thinking about cutting 1.6.2 rc this week. Any patches
>>> > somebody want to get in last minute?
>>> >
>>> > On a related note, I'm thinking about cutting 2.0.0 rc this week too. I
>>> > looked at the 60 unresolved tickets and almost all of them look like
>>> they
>>> > can be retargeted are are just some doc updates. I'm going to be more
>>> > aggressive and pushing individual people about resolving those, in
>>> case this
>>> > drags on forever.
>>> >
>>> >
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>> --
> andy
>