Are you referring to the memory usage of stateful operations like aggregations, or the new mapGroupsWithState? The current implementation of the internal state store (that maintains the stateful aggregates) is such that it keeps all the data in memory of the executor. It does use HDFS-compatible file system for checkpointing, but as of now, it currently keeps all the data in memory of the executor. This is something we will improve in the future.
That said, you can enabled watermarking in your query that would automatically clear old, unnecessary state thus limiting the total memory used for stateful operations. Furthermore, you can also monitor the size of the state and get alerts if the state is growing too large. Read more in the programming guide. Watermarking - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking Monitoring - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries In case you were referring to something else, please give us more context details - what query, what are the symptoms you are observing. On Tue, Apr 4, 2017 at 5:17 PM, kant kodali <kanth...@gmail.com> wrote: > Why do we ever run out of memory in Spark Structured Streaming especially > when Memory can always spill to disk ? until the disk is full we shouldn't > be out of memory.isn't it? sure thrashing will happen more frequently and > degrades performance but we do we ever run out Memory even in case of > maintaining a state for 6 months or a year? > > Thanks! >