Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Yang Jie Thu, 29 May 2025 11:15:15 -0700

+1

On 2025/05/29 16:25:19 Xiao Li wrote:
> +1
> 
> Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道：
> 
> > +1.
> >
> > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote:
> >
> >> +1
> >> Sent from my iPhone
> >>
> >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote:
> >>
> >> 
> >> +1 Nice feature
> >>
> >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com>
> >> wrote:
> >>
> >>> +1
> >>>
> >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道：
> >>>
> >>>> +1, LGTM.
> >>>>
> >>>> Kent
> >>>>
> >>>> 在 2025年5月29日星期四，Chao Sun <sunc...@apache.org> 写道：
> >>>>
> >>>>> +1. Super excited by this initiative!
> >>>>>
> >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> +1
> >>>>>>
> >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> +1
> >>>>>>> By unifying batch and low-latency streaming in Spark, we can
> >>>>>>> eliminate the need for separate streaming engines, reducing system
> >>>>>>> complexity and operational cost. Excited to see this direction!
> >>>>>>>
> >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh <
> >>>>>>> mich.talebza...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> My point about "in real time application or data, there is nothing
> >>>>>>>> as an answer which is supposed to be late and correct. The 
> >>>>>>>> timeliness is
> >>>>>>>> part of the application. if I get the right answer too slowly it 
> >>>>>>>> becomes
> >>>>>>>> useless or wrong" is actually fundamental to *why* we need this
> >>>>>>>> Spark Structured Streaming proposal.
> >>>>>>>>
> >>>>>>>> The proposal is precisely about enabling Spark to power
> >>>>>>>> applications where, as I define it, the *timeliness* of the answer
> >>>>>>>> is as critical as its *correctness*. Spark's current streaming
> >>>>>>>> engine, primarily operating on micro-batches, often delivers results 
> >>>>>>>> that
> >>>>>>>> are technically "correct" but arrive too late to be truly useful for
> >>>>>>>> certain high-stakes, real-time scenarios. This makes them "useless or
> >>>>>>>> wrong" in a practical, business-critical sense.
> >>>>>>>>
> >>>>>>>> For example *in real-time fraud detection* and In *high-frequency
> >>>>>>>> trading,* market data or trade execution commands must be
> >>>>>>>> delivered with minimal latency. Even a slight delay can mean missed
> >>>>>>>> opportunities or significant financial losses, making a "correct" 
> >>>>>>>> price
> >>>>>>>> update useless if it's not instantaneous. able for these demanding
> >>>>>>>> use cases, where a "late but correct" answer is simply not good 
> >>>>>>>> enough. As
> >>>>>>>> a colliery it is a fundamental concept, so it has to be treated as 
> >>>>>>>> such not
> >>>>>>>> as a comment.in SPIP
> >>>>>>>>
> >>>>>>>> Hope this clarifies the connection in practical terms
> >>>>>>>> Dr Mich Talebzadeh,
> >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
> >>>>>>>> GDPR
> >>>>>>>>
> >>>>>>>>    view my Linkedin profile
> >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Mich,
> >>>>>>>>>
> >>>>>>>>> Sorry, I may be missing something here but what does your
> >>>>>>>>> definition here have to do with the SPIP?   Perhaps add comments 
> >>>>>>>>> directly
> >>>>>>>>> to the SPIP to provide context as the code snippet below is a 
> >>>>>>>>> direct copy
> >>>>>>>>> from the SPIP itself.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Denny
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh <
> >>>>>>>>> mich.talebza...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> just to add
> >>>>>>>>>>
> >>>>>>>>>> A stronger definition of real time. The engineering definition of
> >>>>>>>>>> real time is roughly fast enough to be interactive
> >>>>>>>>>>
> >>>>>>>>>> However, I put a stronger definition. In real time application or
> >>>>>>>>>> data, there is nothing as an answer which is supposed to be late 
> >>>>>>>>>> and
> >>>>>>>>>> correct. The timeliness is part of the application.if I get the 
> >>>>>>>>>> right
> >>>>>>>>>> answer too slowly it becomes useless or wrong
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Dr Mich Talebzadeh,
> >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
> >>>>>>>>>> GDPR
> >>>>>>>>>>
> >>>>>>>>>>    view my Linkedin profile
> >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh <
> >>>>>>>>>> mich.talebza...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> The current limitations in SSS come from micro-batching.If you
> >>>>>>>>>>> are going to reduce micro-batching, this reduction must be 
> >>>>>>>>>>> balanced against
> >>>>>>>>>>> the available processing capacity of the cluster to prevent back 
> >>>>>>>>>>> pressure
> >>>>>>>>>>> and instability. In the case of Continuous Processing mode, a
> >>>>>>>>>>> specific continuous trigger with a desired checkpoint interval 
> >>>>>>>>>>> quote
> >>>>>>>>>>>
> >>>>>>>>>>> "
> >>>>>>>>>>> df.writeStream
> >>>>>>>>>>>    .format("...")
> >>>>>>>>>>>    .option("...")
> >>>>>>>>>>>    .trigger(Trigger.RealTime(“300 Seconds”))    // new trigger
> >>>>>>>>>>> type to enable real-time Mode
> >>>>>>>>>>>    .start()
> >>>>>>>>>>> This Trigger.RealTime signals that the query should run in the
> >>>>>>>>>>> new ultra low-latency execution mode.  A time interval can also be
> >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each 
> >>>>>>>>>>> micro-batch should
> >>>>>>>>>>> run for.
> >>>>>>>>>>> "
> >>>>>>>>>>>
> >>>>>>>>>>> will inevitably depend on many factors. Not that simple
> >>>>>>>>>>> HTH
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Dr Mich Talebzadeh,
> >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis |
> >>>>>>>>>>> GDPR
> >>>>>>>>>>>
> >>>>>>>>>>>    view my Linkedin profile
> >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng <
> >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled
> >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've 
> >>>>>>>>>>>> been
> >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek 
> >>>>>>>>>>>> Lim, and
> >>>>>>>>>>>> Michael Armbrust: [JIRA
> >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc
> >>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing>
> >>>>>>>>>>>> ].
> >>>>>>>>>>>>
> >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode”
> >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers 
> >>>>>>>>>>>> end-to-end latency
> >>>>>>>>>>>> for processing streams of data.
> >>>>>>>>>>>>
> >>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is
> >>>>>>>>>>>> to make Spark capable of handling streaming jobs that need 
> >>>>>>>>>>>> results almost
> >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve 
> >>>>>>>>>>>> this without
> >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already 
> >>>>>>>>>>>> use – so
> >>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency 
> >>>>>>>>>>>> mode by
> >>>>>>>>>>>> simply turning it on, without rewriting their logic.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time
> >>>>>>>>>>>> applications (like instant anomaly alerts or live 
> >>>>>>>>>>>> personalization) that
> >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s 
> >>>>>>>>>>>> current streaming
> >>>>>>>>>>>> engine.
> >>>>>>>>>>>>
> >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and
> >>>>>>>>>>>> suggestions on this approach!
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best,
> >>>>>> Yanbo
> >>>>>>
> >>>>>
> >>
> >> --
> >> John Zhuge
> >>
> >>
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Real-Time Mode in Apache Spark Structured Streaming

Reply via email to