Re: Why do we ever run out of memory in Spark Structured Streaming?

kant kodali Wed, 05 Apr 2017 08:50:31 -0700

Actually I want to reset my counters every 24 hours then shouldn't the
window and slide interval = 24 hours. If so, how do I send updates to real
time dashboard every second? isn't the trigger interval is the same as
slide interval ?


On Wed, Apr 5, 2017 at 7:17 AM, kant kodali <kanth...@gmail.com> wrote:

> One of our requirement is that we need to maintain counter for a 24 hour
> period such as number of transactions processed in the past 24 hours. After
> each day these counters can start from zero again so we just need to
> maintain a running count during the 24 hour period. Also since we want to
> show these stats on a real time dashboard we want those charts to be
> updated every second so I guess this would translate to window interval of
> 24 hours and a slide/trigger interval of 1 second. First of all, Is this
> okay ?
>
> Secondly, we push about 5000 JSON messages/sec into spark streaming and
> each message is about 2KB. we just need to parse those messages and compute
> say sum of certain fields  from each message and the result needs to be
> stored somewhere such that each run will take its result and add it up to
>  the previous run and this state have to be maintained for 24 hours and
> then we can reset it back to zero. so any advice on how to best approach
> this scenario?
>
> Thanks much!
>
> On Wed, Apr 5, 2017 at 12:39 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Hi!
>>
>> I am talking about "stateful operations like aggregations". Does this
>> happen on heap or off heap by default? I came across a article where I saw
>> both on and off heap are possible but I am not sure what happens by default
>> and when Spark or Spark Structured Streaming decides to store off heap?
>>
>> I don't even know what mapGroupsWithState does since It's not part of
>> spark 2.1 which is what we currently use. Any pointers would be great.
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Tue, Apr 4, 2017 at 5:34 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Are you referring to the memory usage of stateful operations like
>>> aggregations, or the new mapGroupsWithState?
>>> The current implementation of the internal state store (that maintains
>>> the stateful aggregates) is such that it keeps all the data in memory of
>>> the executor. It does use HDFS-compatible file system for checkpointing,
>>> but as of now, it currently keeps all the data in memory of the executor.
>>> This is something we will improve in the future.
>>>
>>> That said, you can enabled watermarking in your query that would
>>> automatically clear old, unnecessary state thus limiting the total memory
>>> used for stateful operations.
>>> Furthermore, you can also monitor the size of the state and get alerts
>>> if the state is growing too large.
>>>
>>> Read more in the programming guide.
>>> Watermarking - http://spark.apache.org/docs
>>> /latest/structured-streaming-programming-guide.html#handling
>>> -late-data-and-watermarking
>>> Monitoring - http://spark.apache.org/docs/latest/structured-streaming-p
>>> rogramming-guide.html#monitoring-streaming-queries
>>>
>>> In case you were referring to something else, please give us more
>>> context details - what query, what are the symptoms you are observing.
>>>
>>> On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> Why do we ever run out of memory in Spark Structured Streaming
>>>> especially when Memory can always spill to disk ? until the disk is full we
>>>> shouldn't be out of memory.isn't it? sure thrashing will happen more
>>>> frequently and degrades performance but we do we ever run out Memory even
>>>> in case of maintaining a state for 6 months or a year?
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>

Re: Why do we ever run out of memory in Spark Structured Streaming?

Reply via email to