Hi Jacek, Yes, I am assuming that data streams in consistently at the same rate (for example, 100MB/s).
BTW, even if the persistence level for streaming data is set to MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will spill to disk. That will make the application performance even worse. Mohammed From: Jacek Laskowski [mailto:ja...@japila.pl] Sent: Saturday, August 6, 2016 1:54 AM To: Mohammed Guller Cc: Saurav Sinha; user Subject: RE: Explanation regarding Spark Streaming Hi, Thanks for explanation, but it does not prove Spark will OOM at some point. You assume enough data to store but there could be none. Jacek On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote: Assume the batch interval is 10 seconds and batch processing time is 30 seconds. So while Spark Streaming is processing the first batch, the receiver will have a backlog of 20 seconds worth of data. By the time Spark Streaming finishes batch #2, the receiver will have 40 seconds worth of data in memory buffer. This backlog will keep growing as time passes assuming data streams in consistently at the same rate. Also keep in mind that windowing operations on a DStream implicitly persist every RDD in a DStream in memory. Mohammed -----Original Message----- From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>] Sent: Thursday, August 4, 2016 4:25 PM To: Mohammed Guller Cc: Saurav Sinha; user Subject: Re: Explanation regarding Spark Streaming On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote: > and eventually you will run out of memory. Why? Mind elaborating? Jacek