The thread owner question is

Q1. What will happen if spark streaming job have batchDurationTime as 60
sec and processing time of complete pipeline is greater then 60 sec. "


This basically means that you will gradually building a backlog and
regardless of whether you are going to blow up the buffer or not that data
analysis will have serious flow!

In example below I have a batch interval of 2 seconds streaming in 10,000
rows/events. The windows length = 4 sec (twice the batch interval) and
sliding window set at 2 sec. I have deliberately set the volume of
streaming in this case high.

As you can see from the Streaming graphs, there is serious issues here with
average of 213 events/sec and scheduling delay of 5 seconds

[image: Inline images 1]

Technically the app may not crash but its business value is practically
nil. If I was doing this for some form of complex event processing or fraud
detection, I would have to look for a new job :(

So monitor the processing to make sure that it is running smoothly and
ensure that there is no backlog. Also look at your memory usage. For
example, if you are receiving a single stream of 100MB/second, and you want
to do 60 second batches (window length), then you will need a buffer
of 100*60 MB = 6000MB or 6GB at the very least. Note that if you are using
a single receiver, then all the data is coming to a single Spark worker
machine, so each machine should be about that. Add to that other overheads
of running Spark, etc. Accordingly calculate the memory usage, then
double/triple the number to be on the safe side and monitor the processing.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 August 2016 at 09:53, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Thanks for explanation, but it does not prove Spark will OOM at some
> point. You assume enough data to store but there could be none.
>
> Jacek
>
> On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com> wrote:
>
>> Assume the batch interval is 10 seconds and batch processing time is 30
>> seconds. So while Spark Streaming is processing the first batch, the
>> receiver will have a backlog of 20 seconds worth of data. By the time Spark
>> Streaming finishes batch #2, the receiver will have 40 seconds worth of
>> data in memory buffer. This backlog will keep growing as time passes
>> assuming data streams in consistently at the same rate.
>>
>> Also keep in mind that windowing operations on a DStream implicitly
>> persist every RDD in a DStream in memory.
>>
>> Mohammed
>>
>> -----Original Message-----
>> From: Jacek Laskowski [mailto:ja...@japila.pl]
>> Sent: Thursday, August 4, 2016 4:25 PM
>> To: Mohammed Guller
>> Cc: Saurav Sinha; user
>> Subject: Re: Explanation regarding Spark Streaming
>>
>> On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com>
>> wrote:
>> > and eventually you will run out of memory.
>>
>> Why? Mind elaborating?
>>
>> Jacek
>>
>

Reply via email to