Re: Plan on Structured Streaming in next major/minor release?

Michael Armbrust Tue, 30 Oct 2018 13:44:42 -0700

Thanks for bringing up some possible future directions for streaming. Here
are some thoughts:
 - I personally view all of the activity on Spark SQL also as activity on
Structured Streaming. The great thing about building streaming on catalyst
/ tungsten is that continued improvement to these components improves
streaming use cases as well.
 - I think the biggest on-going project is DataSourceV2, whose goal is to
provide a stable / performant API for streaming and batch data sources to
plug in.  I think connectivity to many different systems is one of the most
powerful aspects of Spark and right now there is no stable public API for
streaming. A lot of committer / PMC time is being spent here at the moment.
 - As you mention, 2.4.0 significantly improves the built in connectivity
for Kafka, giving us the ability to read exactly once from a topic being
written to transactional producers. I think projects to extend this
guarantee to the Kafka Sink and also to improve authentication with Kafka
are a great idea (and it seems like there is a lot of review activity on
the latter).


You bring up some other possible projects like session window support.
This is an interesting project, but as far as I can tell it still there is
still a lot of work that would need to be done before this feature could be
merged.  We'd need to understand how it works with update mode amongst
other things. Additionally, a 3000+ line patch is really time consuming to
review. This coupled with the fact that all the users that I have
interacted with need "session windows + some custom business logic"
(usually implemented with flatMapGroupsWithState), mean that I'm more
inclined to direct limited review bandwidth to incremental improvements in
that feature than to something large/new. This is not to say that this
feature isn't useful / shouldn't be merge, just a bit of explanation as to
why there might be less activity here than you would hope.

Similarly, multiple aggregations are an often requested feature.  However,
fundamentally, this is going to be a fairly large investment (I think we'd
need to combine the unsupported operation checker and the query planner and
also create a high performance (i.e. whole stage code-gened) aggregation
operator that understands negation).

Thanks again for starting the discussion, and looking forward to hearing
about what features are most requested!

On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <kabh...@gmail.com> wrote:

> Adding more: again, it doesn't mean they're feasible to do. Just a kind of
> brainstorming.
>
> * SPARK-20568: Delete files after processing in structured streaming
>   * There hasn't been consensus regarding supporting this: there were
> voices for both YES and NO.
> * Support multiple levels of aggregations in structured streaming
>   * There're plenty of questions in SO regarding this. While I don't think
> it makes sense on structured streaming if it requires additional shuffle,
> there might be another case: group by keys, apply aggregation, apply
> aggregation on aggregated result (grouped keys don't change)
>
> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>
>> Yeah, the main intention of this thread is to collect interest on
>> possible feature list for structured streaming. From what I can see in
>> Spark community, most of the discussions as well as contributions are for
>> SQL, and I'd wish to see similar activeness / efforts on structured
>> streaming.
>> (Unfortunately there's less effort to review others' works - design doc
>> as well as pull request - most of efforts looks like being spent to their
>> own works.)
>>
>> I respect the role of PMC member, so the final decision would be up to
>> PMC members, but contributors as well as end users could show the interest
>> as well as discuss about requirements on SPIP, which could be a good
>> background to persuade PMC members.
>>
>> Before going into the deep I guess we could use this thread to discuss
>> about possible use cases, and if we would like to move forward to
>> individual thread we could initiate (or resurrect) its discussion thread.
>>
>> For queryable state, at least there seems no workaround in Spark to
>> provide similar thing, especially state is getting bigger. I may have some
>> concerns on the details, but I'll add my thought on the discussion thread.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com>님이 작성:
>>
>>> Hi Jungtaek,
>>>
>>> I just tried to start the discussion in the dev list along time ago.
>>> I enumerated some uses cases as Michael proposed here
>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4r9vod+ebty1fdgtqsxzgjgubox-k8araur...@mail.gmail.com%3E>.
>>> The discussion didn't go further.
>>>
>>> If people find it useful we should start discussing it in detail again.
>>>
>>> Stavros
>>>
>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <kabh...@gmail.com> wrote:
>>>
>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>> state, right?
>>>>
>>>> Could you summary the progress and the reason why the progress got
>>>> stopped?
>>>>
>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <
>>>> stavros.kontopou...@lightbend.com>님이 작성:
>>>>
>>>>> That is a very interesting list thanks. I could create a design doc
>>>>> as a starting pointing for discussion if this is a feature we would like 
>>>>> to
>>>>> have.
>>>>>
>>>>> Regards,
>>>>> Stavros
>>>>>
>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qcsd2...@163.com> wrote:
>>>>>
>>>>>> Thanks for raising them.
>>>>>>
>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>
>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>

Re: Plan on Structured Streaming in next major/minor release?

Reply via email to