Actually I want to reset my counters every 24 hours then shouldn't the window and slide interval = 24 hours. If so, how do I send updates to real time dashboard every second? isn't the trigger interval is the same as slide interval ?
On Wed, Apr 5, 2017 at 7:17 AM, kant kodali <kanth...@gmail.com> wrote: > One of our requirement is that we need to maintain counter for a 24 hour > period such as number of transactions processed in the past 24 hours. After > each day these counters can start from zero again so we just need to > maintain a running count during the 24 hour period. Also since we want to > show these stats on a real time dashboard we want those charts to be > updated every second so I guess this would translate to window interval of > 24 hours and a slide/trigger interval of 1 second. First of all, Is this > okay ? > > Secondly, we push about 5000 JSON messages/sec into spark streaming and > each message is about 2KB. we just need to parse those messages and compute > say sum of certain fields from each message and the result needs to be > stored somewhere such that each run will take its result and add it up to > the previous run and this state have to be maintained for 24 hours and > then we can reset it back to zero. so any advice on how to best approach > this scenario? > > Thanks much! > > On Wed, Apr 5, 2017 at 12:39 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi! >> >> I am talking about "stateful operations like aggregations". Does this >> happen on heap or off heap by default? I came across a article where I saw >> both on and off heap are possible but I am not sure what happens by default >> and when Spark or Spark Structured Streaming decides to store off heap? >> >> I don't even know what mapGroupsWithState does since It's not part of >> spark 2.1 which is what we currently use. Any pointers would be great. >> >> Thanks! >> >> >> >> >> >> On Tue, Apr 4, 2017 at 5:34 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> Are you referring to the memory usage of stateful operations like >>> aggregations, or the new mapGroupsWithState? >>> The current implementation of the internal state store (that maintains >>> the stateful aggregates) is such that it keeps all the data in memory of >>> the executor. It does use HDFS-compatible file system for checkpointing, >>> but as of now, it currently keeps all the data in memory of the executor. >>> This is something we will improve in the future. >>> >>> That said, you can enabled watermarking in your query that would >>> automatically clear old, unnecessary state thus limiting the total memory >>> used for stateful operations. >>> Furthermore, you can also monitor the size of the state and get alerts >>> if the state is growing too large. >>> >>> Read more in the programming guide. >>> Watermarking - http://spark.apache.org/docs >>> /latest/structured-streaming-programming-guide.html#handling >>> -late-data-and-watermarking >>> Monitoring - http://spark.apache.org/docs/latest/structured-streaming-p >>> rogramming-guide.html#monitoring-streaming-queries >>> >>> In case you were referring to something else, please give us more >>> context details - what query, what are the symptoms you are observing. >>> >>> On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth...@gmail.com> wrote: >>> >>>> Why do we ever run out of memory in Spark Structured Streaming >>>> especially when Memory can always spill to disk ? until the disk is full we >>>> shouldn't be out of memory.isn't it? sure thrashing will happen more >>>> frequently and degrades performance but we do we ever run out Memory even >>>> in case of maintaining a state for 6 months or a year? >>>> >>>> Thanks! >>>> >>> >>> >> >