My thoughts were of handling just the “current” state of the sliding window (i.e. the “last” window). The idea is that at least in cases which I encountered, the sliding window is used to “forget” irrelevant information and therefore when a step goes out of date for the “current” window it becomes irrelevant. I agree that this use case is just an example and will also have issues if there is a combination of windows. My main issue was that if we need to have a relatively large buffer (such as full distinct count) then the memory overhead of this can be very high.
As for the example of the map you gave, If I understand correctly how this would occur behind the scenes, this just provides the map but the memory cost of having multiple versions of the data remain. As I said, my issue is with the high memory overhead. Consider a simple example: I do a sliding window of 1 day with a 1 minute step. There are 1440 minutes per day which means the groupby has a cost of multiplying all aggregations by 1440. For something such as a count or sum, this might not be a big issue but if we have an array of say 100 elements then this can quickly become very costly. As I said, it is just an idea for optimization for specific use cases. From: Michael Armbrust [via Apache Spark Developers List] [mailto:ml-node+s1001551n1952...@n3.nabble.com] Sent: Thursday, October 20, 2016 11:16 AM To: Mendelson, Assaf Subject: Re: StructuredStreaming status let’s say we would have implemented distinct count by saving a map with the key being the distinct value and the value being the last time we saw this value. This would mean that we wouldn’t really need to save all the steps in the middle and copy the data, we could only save the last portion. I don't think you can calculate count distinct in each event time window correctly using this map if there is late data, which is one of the key problems we are trying to solve with this API. If you are only tracking the last time you saw this value, how do you know if a late data item was already accounted for in any given window that is earlier than this "last time"? We would currently need to track the items seen in each window (though much less space is required for approx count distinct). However, the state eviction I mentioned above should also let you give us a boundary on how late data can be, and thus how many windows we need retain state for. You should also be able to group by processing time instead of event time if you want something closer to the semantics of DStreams. Finally, you can already construct the map you describe using structured streaming and use its result to output statistics at each trigger window: df.groupBy($"value") .select(max($"eventTime") as 'lastSeen) .writeStream .outputMode("complete") .trigger(ProcessingTime("5 minutes")) .foreach( <emit results using values from the map> ) ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19520.html To start a new topic under Apache Spark Developers List, email ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com> To unsubscribe from Apache Spark Developers List, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19521.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.