+1 On 2025/05/29 16:25:19 Xiao Li wrote: > +1 > > Yuming Wang <yumw...@apache.org> 于2025年5月29日周四 02:22写道: > > > +1. > > > > On Thu, May 29, 2025 at 3:36 PM DB Tsai <dbt...@dbtsai.com> wrote: > > > >> +1 > >> Sent from my iPhone > >> > >> On May 29, 2025, at 12:15 AM, John Zhuge <jzh...@apache.org> wrote: > >> > >> > >> +1 Nice feature > >> > >> On Wed, May 28, 2025 at 9:53 PM Yuanjian Li <xyliyuanj...@gmail.com> > >> wrote: > >> > >>> +1 > >>> > >>> Kent Yao <y...@apache.org> 于2025年5月28日周三 19:31写道: > >>> > >>>> +1, LGTM. > >>>> > >>>> Kent > >>>> > >>>> 在 2025年5月29日星期四,Chao Sun <sunc...@apache.org> 写道: > >>>> > >>>>> +1. Super excited by this initiative! > >>>>> > >>>>> On Wed, May 28, 2025 at 1:54 PM Yanbo Liang <yblia...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> +1 > >>>>>> > >>>>>> On Wed, May 28, 2025 at 12:34 PM huaxin gao <huaxin.ga...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> +1 > >>>>>>> By unifying batch and low-latency streaming in Spark, we can > >>>>>>> eliminate the need for separate streaming engines, reducing system > >>>>>>> complexity and operational cost. Excited to see this direction! > >>>>>>> > >>>>>>> On Wed, May 28, 2025 at 9:08 AM Mich Talebzadeh < > >>>>>>> mich.talebza...@gmail.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> My point about "in real time application or data, there is nothing > >>>>>>>> as an answer which is supposed to be late and correct. The > >>>>>>>> timeliness is > >>>>>>>> part of the application. if I get the right answer too slowly it > >>>>>>>> becomes > >>>>>>>> useless or wrong" is actually fundamental to *why* we need this > >>>>>>>> Spark Structured Streaming proposal. > >>>>>>>> > >>>>>>>> The proposal is precisely about enabling Spark to power > >>>>>>>> applications where, as I define it, the *timeliness* of the answer > >>>>>>>> is as critical as its *correctness*. Spark's current streaming > >>>>>>>> engine, primarily operating on micro-batches, often delivers results > >>>>>>>> that > >>>>>>>> are technically "correct" but arrive too late to be truly useful for > >>>>>>>> certain high-stakes, real-time scenarios. This makes them "useless or > >>>>>>>> wrong" in a practical, business-critical sense. > >>>>>>>> > >>>>>>>> For example *in real-time fraud detection* and In *high-frequency > >>>>>>>> trading,* market data or trade execution commands must be > >>>>>>>> delivered with minimal latency. Even a slight delay can mean missed > >>>>>>>> opportunities or significant financial losses, making a "correct" > >>>>>>>> price > >>>>>>>> update useless if it's not instantaneous. able for these demanding > >>>>>>>> use cases, where a "late but correct" answer is simply not good > >>>>>>>> enough. As > >>>>>>>> a colliery it is a fundamental concept, so it has to be treated as > >>>>>>>> such not > >>>>>>>> as a comment.in SPIP > >>>>>>>> > >>>>>>>> Hope this clarifies the connection in practical terms > >>>>>>>> Dr Mich Talebzadeh, > >>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | > >>>>>>>> GDPR > >>>>>>>> > >>>>>>>> view my Linkedin profile > >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, 28 May 2025 at 16:32, Denny Lee <denny.g....@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hey Mich, > >>>>>>>>> > >>>>>>>>> Sorry, I may be missing something here but what does your > >>>>>>>>> definition here have to do with the SPIP? Perhaps add comments > >>>>>>>>> directly > >>>>>>>>> to the SPIP to provide context as the code snippet below is a > >>>>>>>>> direct copy > >>>>>>>>> from the SPIP itself. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Denny > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, May 28, 2025 at 06:48 Mich Talebzadeh < > >>>>>>>>> mich.talebza...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>>> just to add > >>>>>>>>>> > >>>>>>>>>> A stronger definition of real time. The engineering definition of > >>>>>>>>>> real time is roughly fast enough to be interactive > >>>>>>>>>> > >>>>>>>>>> However, I put a stronger definition. In real time application or > >>>>>>>>>> data, there is nothing as an answer which is supposed to be late > >>>>>>>>>> and > >>>>>>>>>> correct. The timeliness is part of the application.if I get the > >>>>>>>>>> right > >>>>>>>>>> answer too slowly it becomes useless or wrong > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Dr Mich Talebzadeh, > >>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | > >>>>>>>>>> GDPR > >>>>>>>>>> > >>>>>>>>>> view my Linkedin profile > >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, 28 May 2025 at 11:10, Mich Talebzadeh < > >>>>>>>>>> mich.talebza...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>>> The current limitations in SSS come from micro-batching.If you > >>>>>>>>>>> are going to reduce micro-batching, this reduction must be > >>>>>>>>>>> balanced against > >>>>>>>>>>> the available processing capacity of the cluster to prevent back > >>>>>>>>>>> pressure > >>>>>>>>>>> and instability. In the case of Continuous Processing mode, a > >>>>>>>>>>> specific continuous trigger with a desired checkpoint interval > >>>>>>>>>>> quote > >>>>>>>>>>> > >>>>>>>>>>> " > >>>>>>>>>>> df.writeStream > >>>>>>>>>>> .format("...") > >>>>>>>>>>> .option("...") > >>>>>>>>>>> .trigger(Trigger.RealTime(“300 Seconds”)) // new trigger > >>>>>>>>>>> type to enable real-time Mode > >>>>>>>>>>> .start() > >>>>>>>>>>> This Trigger.RealTime signals that the query should run in the > >>>>>>>>>>> new ultra low-latency execution mode. A time interval can also be > >>>>>>>>>>> specified, e.g. “300 Seconds”, to indicate how long each > >>>>>>>>>>> micro-batch should > >>>>>>>>>>> run for. > >>>>>>>>>>> " > >>>>>>>>>>> > >>>>>>>>>>> will inevitably depend on many factors. Not that simple > >>>>>>>>>>> HTH > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Dr Mich Talebzadeh, > >>>>>>>>>>> Architect | Data Science | Financial Crime | Forensic Analysis | > >>>>>>>>>>> GDPR > >>>>>>>>>>> > >>>>>>>>>>> view my Linkedin profile > >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, 28 May 2025 at 05:13, Jerry Peng < > >>>>>>>>>>> jerry.boyang.p...@gmail.com> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi all, > >>>>>>>>>>>> > >>>>>>>>>>>> I want to start a discussion thread for the SPIP titled > >>>>>>>>>>>> “Real-Time Mode in Apache Spark Structured Streaming” that I've > >>>>>>>>>>>> been > >>>>>>>>>>>> working on with Siying Dong, Indrajit Roy, Chao Sun, Jungtaek > >>>>>>>>>>>> Lim, and > >>>>>>>>>>>> Michael Armbrust: [JIRA > >>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-52330>] [Doc > >>>>>>>>>>>> <https://docs.google.com/document/d/1CvJvtlTGP6TwQIT4kW6GFT1JbdziAYOBvt60ybb7Dw8/edit?usp=sharing> > >>>>>>>>>>>> ]. > >>>>>>>>>>>> > >>>>>>>>>>>> The SPIP proposes a new execution mode called “Real-time Mode” > >>>>>>>>>>>> in Spark Structured Streaming that significantly lowers > >>>>>>>>>>>> end-to-end latency > >>>>>>>>>>>> for processing streams of data. > >>>>>>>>>>>> > >>>>>>>>>>>> A key principle of this proposal is compatibility. Our goal is > >>>>>>>>>>>> to make Spark capable of handling streaming jobs that need > >>>>>>>>>>>> results almost > >>>>>>>>>>>> immediately (within O(100) milliseconds). We want to achieve > >>>>>>>>>>>> this without > >>>>>>>>>>>> changing the high-level DataFrame/Dataset API that users already > >>>>>>>>>>>> use – so > >>>>>>>>>>>> existing streaming queries can run in this new ultra-low-latency > >>>>>>>>>>>> mode by > >>>>>>>>>>>> simply turning it on, without rewriting their logic. > >>>>>>>>>>>> > >>>>>>>>>>>> In short, we’re trying to enable Spark to power real-time > >>>>>>>>>>>> applications (like instant anomaly alerts or live > >>>>>>>>>>>> personalization) that > >>>>>>>>>>>> today cannot meet their latency requirements with Spark’s > >>>>>>>>>>>> current streaming > >>>>>>>>>>>> engine. > >>>>>>>>>>>> > >>>>>>>>>>>> We'd greatly appreciate your feedback, thoughts, and > >>>>>>>>>>>> suggestions on this approach! > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>> > >>>>>> -- > >>>>>> Best, > >>>>>> Yanbo > >>>>>> > >>>>> > >> > >> -- > >> John Zhuge > >> > >> >
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org