@Michael any update about queryable state? Stavros
On Tue, Oct 30, 2018 at 10:43 PM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for bringing up some possible future directions for streaming. Here > are some thoughts: > - I personally view all of the activity on Spark SQL also as activity on > Structured Streaming. The great thing about building streaming on catalyst > / tungsten is that continued improvement to these components improves > streaming use cases as well. > - I think the biggest on-going project is DataSourceV2, whose goal is to > provide a stable / performant API for streaming and batch data sources to > plug in. I think connectivity to many different systems is one of the most > powerful aspects of Spark and right now there is no stable public API for > streaming. A lot of committer / PMC time is being spent here at the moment. > - As you mention, 2.4.0 significantly improves the built in connectivity > for Kafka, giving us the ability to read exactly once from a topic being > written to transactional producers. I think projects to extend this > guarantee to the Kafka Sink and also to improve authentication with Kafka > are a great idea (and it seems like there is a lot of review activity on > the latter). > > You bring up some other possible projects like session window support. > This is an interesting project, but as far as I can tell it still there is > still a lot of work that would need to be done before this feature could be > merged. We'd need to understand how it works with update mode amongst > other things. Additionally, a 3000+ line patch is really time consuming to > review. This coupled with the fact that all the users that I have > interacted with need "session windows + some custom business logic" > (usually implemented with flatMapGroupsWithState), mean that I'm more > inclined to direct limited review bandwidth to incremental improvements in > that feature than to something large/new. This is not to say that this > feature isn't useful / shouldn't be merge, just a bit of explanation as to > why there might be less activity here than you would hope. > > Similarly, multiple aggregations are an often requested feature. However, > fundamentally, this is going to be a fairly large investment (I think we'd > need to combine the unsupported operation checker and the query planner and > also create a high performance (i.e. whole stage code-gened) aggregation > operator that understands negation). > > Thanks again for starting the discussion, and looking forward to hearing > about what features are most requested! > > On Tue, Oct 30, 2018 at 12:23 AM Jungtaek Lim <kabh...@gmail.com> wrote: > >> Adding more: again, it doesn't mean they're feasible to do. Just a kind >> of brainstorming. >> >> * SPARK-20568: Delete files after processing in structured streaming >> * There hasn't been consensus regarding supporting this: there were >> voices for both YES and NO. >> * Support multiple levels of aggregations in structured streaming >> * There're plenty of questions in SO regarding this. While I don't >> think it makes sense on structured streaming if it requires additional >> shuffle, there might be another case: group by keys, apply aggregation, >> apply aggregation on aggregated result (grouped keys don't change) >> >> 2018년 10월 22일 (월) 오후 12:25, Jungtaek Lim <kabh...@gmail.com>님이 작성: >> >>> Yeah, the main intention of this thread is to collect interest on >>> possible feature list for structured streaming. From what I can see in >>> Spark community, most of the discussions as well as contributions are for >>> SQL, and I'd wish to see similar activeness / efforts on structured >>> streaming. >>> (Unfortunately there's less effort to review others' works - design doc >>> as well as pull request - most of efforts looks like being spent to their >>> own works.) >>> >>> I respect the role of PMC member, so the final decision would be up to >>> PMC members, but contributors as well as end users could show the interest >>> as well as discuss about requirements on SPIP, which could be a good >>> background to persuade PMC members. >>> >>> Before going into the deep I guess we could use this thread to discuss >>> about possible use cases, and if we would like to move forward to >>> individual thread we could initiate (or resurrect) its discussion thread. >>> >>> For queryable state, at least there seems no workaround in Spark to >>> provide similar thing, especially state is getting bigger. I may have some >>> concerns on the details, but I'll add my thought on the discussion thread. >>> >>> - Jungtaek Lim (HeartSaVioR) >>> >>> 2018년 10월 22일 (월) 오전 1:15, Stavros Kontopoulos <stavros.kontopoulos@ >>> lightbend.com>님이 작성: >>> >>>> Hi Jungtaek, >>>> >>>> I just tried to start the discussion in the dev list along time ago. >>>> I enumerated some uses cases as Michael proposed here >>>> <http://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCACTd3c_snT=y4r9vod+ebty1fdgtqsxzgjgubox-k8araur...@mail.gmail.com%3E>. >>>> The discussion didn't go further. >>>> >>>> If people find it useful we should start discussing it in detail again. >>>> >>>> Stavros >>>> >>>> On Sun, Oct 21, 2018 at 4:54 PM, Jungtaek Lim <kabh...@gmail.com> >>>> wrote: >>>> >>>>> Stavros, if my memory is right, you were trying to drive queryable >>>>> state, right? >>>>> >>>>> Could you summary the progress and the reason why the progress got >>>>> stopped? >>>>> >>>>> 2018년 10월 21일 (일) 오후 10:27, Stavros Kontopoulos <stavros.kontopoulos@ >>>>> lightbend.com>님이 작성: >>>>> >>>>>> That is a very interesting list thanks. I could create a design doc >>>>>> as a starting pointing for discussion if this is a feature we would like >>>>>> to >>>>>> have. >>>>>> >>>>>> Regards, >>>>>> Stavros >>>>>> >>>>>> On Sun, Oct 21, 2018 at 3:04 PM, JackyLee <qcsd2...@163.com> wrote: >>>>>> >>>>>>> Thanks for raising them. >>>>>>> >>>>>>> FYI, I believe this open issues could also be considered: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-24630 >>>>>>> <https://issues.apache.org/jira/browse/SPARK-24630> >>>>>>> >>>>>>> An new ability to express Struct Streaming on pure SQL. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3. >>>>>>> nabble.com/ >>>>>>> >>>>>>> ------------------------------------------------------------ >>>>>>> --------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>>