Re: StructuredStreaming status

Ofir Manor Wed, 19 Oct 2016 16:47:45 -0700

Thanks a lot Michael! I really appreciate your sharing.
Logistically, I suggest to find a way to tag all structured streaming
JIRAs, so it wouldn't so hard to look for them, for anyone wanting to
participate, and also have something like the ML roadmap JIRA.
regarding your list, evicting space seems very important. If I understand
correctly, currently state grows forever (when using windows), so it is
impractical to run a long-running streaming job with decent state. It would
be great if user could bound the state by event time (it is also very
natural).
I personally see sessionization as lower priority (seems like a niche
requirement). To me, supporting only a single stream of events that can
only be joined to static datasets makes building anything but the simplest
of short-running streaming jobs problematic (all interesting datasets
change over time). Also, the promise of interactive queries on top of a
computed, live dataset likely has a wider appeal (as it was presented since
early this year as one of the goals of structured streaming). Also making
the sources and sinks API nicer to third-party developers to encourage
adoption and plugins, or beefing up the list of builtin exactly-once
sources and sinks (maybe also have a pluggable state store, as I've seen
some wanting, which may better enable interactive queries).
In addition, I think you should really identify what needs to be done to
make this API stable and focus on that. I think that for adoption, you'll
need to be clear on the full list of gaps / gotchas, and clearly
communicate the project priorities / target timeline (again, just like ML
does it), hopefully after some community discussion...


On a personal note, I'm quite surprised that this is all the progress in
Structured Streaming over the last three months since 2.0 was released. I
was under the impression that this was one of the biggest things that the
Spark community actively works on, but that is clearly not the case, given
that most of the activity is a couple of (very important) JIRAs from the
last several weeks. Not really sure how to parse that yet...
I think having some clearer, prioritized roadmap going forward will be a
good first to recalibrate expectations for 2.2 and for graduating from an
alpha state. But especially, I think you guys seriously needs to figure out
what's the bottleneck here (lack of dedicated owner? lack of commiters
focusing on it?) and just fix it (recruit new commiters to work on it?) to
have a competitive streaming offering in a few quarters.

Just my two cents,

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Wed, Oct 19, 2016 at 10:45 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Anything that is actively being designed should be in JIRA, and it seems
> like you found most of it.  In general, release windows can be found on the
> wiki <https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage>.
>
> 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.  It may also include some of the following.
>
> The items I'd like to start thinking about next are:
>  - Evicting state from the store based on event time watermarks
>  - Sessionization (grouping together related events by key / eventTime)
>  - Improvements to the query planner (remove some of the restrictions on
> what queries can be run).
>
> This is roughly in order based on what I've been hearing users hit the
> most.  Would love more feedback on what is blocking real use cases.
>
> On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.ma...@equalum.io> wrote:
>
>> Hi,
>> I hope it is the right forum.
>> I am looking for some information of what to expect from
>> StructuredStreaming in its next releases to help me choose when / where to
>> start using it more seriously (or where to invest in workarounds and where
>> to wait). I couldn't find a good place where such planning discussed for
>> 2.1  (like, for example ML and SPARK-15581).
>> I'm aware of the 2.0 documented limits (http://spark.apache.org/docs/
>> 2.0.1/structured-streaming-programming-guide.html#unsupported-operations),
>> like no support for multiple aggregations levels, joins are strictly to a
>> static dataset (no SCD or stream-stream) etc, limited sources / sinks (like
>> no sink for interactive queries) etc etc
>> I'm also aware of some changes that have landed in master, like the new
>> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
>> metrics in SPARK-17731, and some improvements for the file source.
>> If I remember correctly, the discussion on Spark release cadence
>> concluded with a preference to a four-month cycles, with likely code freeze
>> pretty soon (end of October). So I believe the scope for 2.1 should likely
>> quite clear to some, and that 2.2 planning should likely be starting about
>> now.
>> Any visibility / sharing will be highly appreciated!
>> thanks in advance,
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>
>

Re: StructuredStreaming status

Reply via email to