+1 This is exciting. I agree with Jerry that this SPIP and continuous processing are orthogonal. This SPIP itself would be a great improvement and impact most Structured Streaming users.
Best Regards, Shixiong On Wed, Nov 30, 2022 at 6:57 AM Mridul Muralidharan <mri...@gmail.com> wrote: > > Thanks for all the clarifications and details Jerry, Jungtaek :-) > This looks like an exciting improvement to Structured Streaming - looking > forward to it becoming part of Apache Spark ! > > Regards, > Mridul > > > On Mon, Nov 28, 2022 at 8:40 PM Jerry Peng <jerry.boyang.p...@gmail.com> > wrote: > >> Hi all, >> >> I will add my two cents. Improving the Microbatch execution engine does >> not prevent us from working/improving on the continuous execution engine in >> the future. These are orthogonal issues. This new mode I am proposing in >> the microbatch execution engine intends to lower latency of this execution >> engine that most people use today. We can view it as an incremental >> improvement on the existing engine. I see the continuous execution engine >> as a partially completed re-write of spark streaming and may serve as the >> "future" engine powering Spark Streaming. Improving the "current" engine >> does not mean we cannot work on a "future" engine. These two are not >> mutually exclusive. I would like to focus the discussion on the merits of >> this feature in regards to the current micro-batch execution engine and not >> a discussion on the future of continuous execution engine. >> >> Best, >> >> Jerry >> >> >> On Wed, Nov 23, 2022 at 3:17 AM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> Hi Mridul, >>> >>> I'd like to make clear to avoid any misunderstanding - the decision was >>> not led by me. (I'm just a one of engineers in the team. Not even TL.) As >>> you see the direction, there was an internal consensus to not revisit the >>> continuous mode. There are various reasons, which I think we know already. >>> You seem to remember I have raised concerns about continuous mode, but have >>> you indicated that it was even over 2 years ago? I still see no traction >>> around the project. The main reason I abandoned the discussion was due to >>> promising effort on integrating push based shuffle into continuous mode to >>> achieve shuffle, but no effort has been made so far. >>> >>> The goal of this SPIP is to have an alternative approach dealing with >>> same workload, given that we no longer have confidence of success of >>> continuous mode. But I also want to make clear that deprecating and >>> eventually retiring continuous mode is not a goal of this project. If that >>> happens eventually, that would be a side-effect. Someone may have concerns >>> that we have two different projects aiming for similar thing, but I'd >>> rather see both projects having competition. If anyone willing to improve >>> continuous mode can start making the effort right now. This SPIP does not >>> block it. >>> >>> >>> On Wed, Nov 23, 2022 at 5:29 PM Mridul Muralidharan <mri...@gmail.com> >>> wrote: >>> >>>> >>>> Hi Jungtaek, >>>> >>>> Given the goal of the SPIP is reducing latency for stateless apps, >>>> and should reasonably fit continuous mode design goals, it feels odd to not >>>> support it fin the proposal. >>>> >>>> I know you have raised concerns about continuous mode in past as well >>>> in dev@ list, and we are further ignoring it in this proposal (and >>>> possibly other enhancements in past few releases). >>>> >>>> Do you want to revisit the discussion to support it and propose a vote >>>> on that ? And move it to deprecated ? >>>> >>>> I am much more comfortable not supporting this SPIP for CM if it was >>>> deprecated. >>>> >>>> Thoughts ? >>>> >>>> Regards, >>>> Mridul >>>> >>>> >>>> >>>> >>>> On Wed, Nov 23, 2022 at 1:16 AM Jerry Peng <jerry.boyang.p...@gmail.com> >>>> wrote: >>>> >>>>> Jungtaek, >>>>> >>>>> Thanks for taking up the role to shepard this SPIP! Thank you for >>>>> also chiming in on your thoughts concerning the continuous mode! >>>>> >>>>> Best, >>>>> >>>>> Jerry >>>>> >>>>> On Tue, Nov 22, 2022 at 5:57 PM Jungtaek Lim < >>>>> kabhwan.opensou...@gmail.com> wrote: >>>>> >>>>>> Just FYI, I'm shepherding this SPIP project. >>>>>> >>>>>> I think the major meta question would be, "why don't we spend >>>>>> effort on continuous mode rather than initiating another feature aiming >>>>>> for >>>>>> the same workload?". Jerry already updated the doc to answer the >>>>>> question, >>>>>> but I can also share my thoughts about it. >>>>>> >>>>>> I feel like the current "continuous mode" is a niche solution. (It's >>>>>> not to blame. If you have to deal with such workload but can't rewrite >>>>>> the >>>>>> underlying engine from scratch, then there are really few options.) >>>>>> Since the implementation went with a workaround to implement which >>>>>> the architecture does not support natively e.g. distributed snapshot, it >>>>>> gets quite tricky on maintaining and expanding the project. It also >>>>>> requires 3rd parties to implement a separate source and sink >>>>>> implementation, which I'm not sure how many 3rd parties actually followed >>>>>> so far. >>>>>> >>>>>> Eventually, "continuous mode" becomes an area no one in the active >>>>>> community knows the details and has willingness to maintain. I wouldn't >>>>>> say >>>>>> we are confident to remove the tag on "experimental", although the >>>>>> feature >>>>>> has been shipped for years. It was introduced in Spark 2.3, surprising >>>>>> enough? >>>>>> >>>>>> We went back and thought about the approach from scratch. Jerry came >>>>>> up with the idea which leverages existing microbatch execution, hence >>>>>> relatively stable and no need to require 3rd parties to support another >>>>>> mode. It adds complexity against microbatch execution but it's a lot less >>>>>> complicated compared to the existing continuous mode. Definitely quite >>>>>> less >>>>>> than creating a new record-to-record engine from scratch. >>>>>> >>>>>> That said, we want to propose and move forward with the new approach. >>>>>> >>>>>> ps. Eventually we could probably discuss retiring continuous mode if >>>>>> the new approach gets accepted and eventually considered as a stable one >>>>>> after several minor releases. That's just me. >>>>>> >>>>>> On Wed, Nov 23, 2022 at 5:16 AM Jerry Peng < >>>>>> jerry.boyang.p...@gmail.com> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I would like to start the discussion for a SPIP, Asynchronous Offset >>>>>>> Management in Structured Streaming. The high level summary of the SPIP >>>>>>> is >>>>>>> that currently in Structured Streaming we perform a couple of offset >>>>>>> management operations for progress tracking purposes synchronously on >>>>>>> the >>>>>>> critical path which can contribute significantly to processing latency. >>>>>>> If >>>>>>> we were to make these operations asynchronous and less frequent we can >>>>>>> dramatically improve latency for certain types of workloads. >>>>>>> >>>>>>> I have put together a SPIP to implement such a mechanism. Please >>>>>>> take a look! >>>>>>> >>>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-39591 >>>>>>> >>>>>>> SPIP doc: >>>>>>> https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jerry >>>>>>> >>>>>>