One of our requirement is that we need to maintain counter for a 24 hour period such as number of transactions processed in the past 24 hours. After each day these counters can start from zero again so we just need to maintain a running count during the 24 hour period. Also since we want to show these stats on a real time dashboard we want those charts to be updated every second so I guess this would translate to window interval of 24 hours and a slide/trigger interval of 1 second. First of all, Is this okay ?
Secondly, we push about 5000 JSON messages/sec into spark streaming and each message is about 2KB. we just need to parse those messages and compute say sum of certain fields from each message and the result needs to be stored somewhere such that each run will take its result and add it up to the previous run and this state have to be maintained for 24 hours and then we can reset it back to zero. so any advice on how to best approach this scenario? Thanks much! On Wed, Apr 5, 2017 at 12:39 AM, kant kodali <kanth...@gmail.com> wrote: > Hi! > > I am talking about "stateful operations like aggregations". Does this > happen on heap or off heap by default? I came across a article where I saw > both on and off heap are possible but I am not sure what happens by default > and when Spark or Spark Structured Streaming decides to store off heap? > > I don't even know what mapGroupsWithState does since It's not part of > spark 2.1 which is what we currently use. Any pointers would be great. > > Thanks! > > > > > > On Tue, Apr 4, 2017 at 5:34 PM, Tathagata Das <tathagata.das1...@gmail.com > > wrote: > >> Are you referring to the memory usage of stateful operations like >> aggregations, or the new mapGroupsWithState? >> The current implementation of the internal state store (that maintains >> the stateful aggregates) is such that it keeps all the data in memory of >> the executor. It does use HDFS-compatible file system for checkpointing, >> but as of now, it currently keeps all the data in memory of the executor. >> This is something we will improve in the future. >> >> That said, you can enabled watermarking in your query that would >> automatically clear old, unnecessary state thus limiting the total memory >> used for stateful operations. >> Furthermore, you can also monitor the size of the state and get alerts if >> the state is growing too large. >> >> Read more in the programming guide. >> Watermarking - http://spark.apache.org/docs/latest/structured-streaming- >> programming-guide.html#handling-late-data-and-watermarking >> Monitoring - http://spark.apache.org/docs/latest/structured-streaming- >> programming-guide.html#monitoring-streaming-queries >> >> In case you were referring to something else, please give us more context >> details - what query, what are the symptoms you are observing. >> >> On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth...@gmail.com> wrote: >> >>> Why do we ever run out of memory in Spark Structured Streaming >>> especially when Memory can always spill to disk ? until the disk is full we >>> shouldn't be out of memory.isn't it? sure thrashing will happen more >>> frequently and degrades performance but we do we ever run out Memory even >>> in case of maintaining a state for 6 months or a year? >>> >>> Thanks! >>> >> >> >