Re: Plan on Structured Streaming in next major/minor release?

Stavros Kontopoulos Tue, 30 Oct 2018 13:59:26 -0700

@Michael any update about queryable state?

Stavros


On Tue, Oct 30, 2018 at 10:43 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Thanks for bringing up some possible future directions for streaming. Here
> are some thoughts:
>  - I personally view all of the activity on Spark SQL also as activity on
> Structured Streaming. The great thing about building streaming on catalyst
> / tungsten is that continued improvement to these components improves
> streaming use cases as well.
>  - I think the biggest on-going project is DataSourceV2, whose goal is to
> provide a stable / performant API for streaming and batch data sources to
> plug in.  I think connectivity to many different systems is one of the most
> powerful aspects of Spark and right now there is no stable public API for
> streaming. A lot of committer / PMC time is being spent here at the moment.
>  - As you mention, 2.4.0 significantly improves the built in connectivity
> for Kafka, giving us the ability to read exactly once from a topic being
> written to transactional producers. I think projects to extend this
> guarantee to the Kafka Sink and also to improve authentication with Kafka
> are a great idea (and it seems like there is a lot of review activity on
> the latter).
>
> You bring up some other possible projects like session window support.
> This is an interesting project, but as far as I can tell it still there is
> still a lot of work that would need to be done before this feature could be
> merged.  We'd need to understand how it works with update mode amongst
> other things. Additionally, a 3000+ line patch is really time consuming to
> review. This coupled with the fact that all the users that I have
> interacted with need "session windows + some custom business logic"
> (usually implemented with flatMapGroupsWithState), mean that I'm more
> inclined to direct limited review bandwidth to incremental improvements in
> that feature than to something large/new. This is not to say that this
> feature isn't useful / shouldn't be merge, just a bit of explanation as to
> why there might be less activity here than you would hope.
>
> Similarly, multiple aggregations are an often requested feature.  However,
> fundamentally, this is going to be a fairly large investment (I think we'd
> need to combine the unsupported operation checker and the query planner and
> also create a high performance (i.e. whole stage code-gened) aggregation
> operator that understands negation).
>
> Thanks again for starting the discussion, and looking forward to hearing
> about what features are most requested!
>
> On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <kabh...@gmail.com> wrote:
>
>> Adding more: again, it doesn't mean they're feasible to do. Just a kind
>> of brainstorming.
>>
>> * SPARK-20568: Delete files after processing in structured streaming
>>   * There hasn't been consensus regarding supporting this: there were
>> voices for both YES and NO.
>> * Support multiple levels of aggregations in structured streaming
>>   * There're plenty of questions in SO regarding this. While I don't
>> think it makes sense on structured streaming if it requires additional
>> shuffle, there might be another case: group by keys, apply aggregation,
>> apply aggregation on aggregated result (grouped keys don't change)
>>
>> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <kabh...@gmail.com>님이 작성:
>>
>>> Yeah, the main intention of this thread is to collect interest on
>>> possible feature list for structured streaming. From what I can see in
>>> Spark community, most of the discussions as well as contributions are for
>>> SQL, and I'd wish to see similar activeness / efforts on structured
>>> streaming.
>>> (Unfortunately there's less effort to review others' works - design doc
>>> as well as pull request - most of efforts looks like being spent to their
>>> own works.)
>>>
>>> I respect the role of PMC member, so the final decision would be up to
>>> PMC members, but contributors as well as end users could show the interest
>>> as well as discuss about requirements on SPIP, which could be a good
>>> background to persuade PMC members.
>>>
>>> Before going into the deep I guess we could use this thread to discuss
>>> about possible use cases, and if we would like to move forward to
>>> individual thread we could initiate (or resurrect) its discussion thread.
>>>
>>> For queryable state, at least there seems no workaround in Spark to
>>> provide similar thing, especially state is getting bigger. I may have some
>>> concerns on the details, but I'll add my thought on the discussion thread.
>>>
>>> - Jungtaek Lim (HeartSaVioR)
>>>
>>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <stavros.kontopoulos@
>>> lightbend.com>님이 작성:
>>>
>>>> Hi Jungtaek,
>>>>
>>>> I just tried to start the discussion in the dev list along time ago.
>>>> I enumerated some uses cases as Michael proposed here
>>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4r9vod+ebty1fdgtqsxzgjgubox-k8araur...@mail.gmail.com%3E>.
>>>> The discussion didn't go further.
>>>>
>>>> If people find it useful we should start discussing it in detail again.
>>>>
>>>> Stavros
>>>>
>>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <kabh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Stavros, if my memory is right, you were trying to drive queryable
>>>>> state, right?
>>>>>
>>>>> Could you summary the progress and the reason why the progress got
>>>>> stopped?
>>>>>
>>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <stavros.kontopoulos@
>>>>> lightbend.com>님이 작성:
>>>>>
>>>>>> That is a very interesting list thanks. I could create a design doc
>>>>>> as a starting pointing for discussion if this is a feature we would like 
>>>>>> to
>>>>>> have.
>>>>>>
>>>>>> Regards,
>>>>>> Stavros
>>>>>>
>>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qcsd2...@163.com> wrote:
>>>>>>
>>>>>>> Thanks for raising them.
>>>>>>>
>>>>>>> FYI, I believe this open issues could also be considered:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-24630
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630>
>>>>>>>
>>>>>>> An new ability to express Struct Streaming on pure SQL.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.
>>>>>>> nabble.com/
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>

Re: Plan on Structured Streaming in next major/minor release?

Reply via email to