RE: StructuredStreaming status

assaf.mendelson Thu, 20 Oct 2016 01:32:53 -0700

My thoughts were of handling just the “current” state of the sliding window 
(i.e. the “last” window). The idea is that at least in cases which I 
encountered, the sliding window is used to “forget” irrelevant information and 
therefore when a step goes out of  date for the “current” window it becomes 
irrelevant.
I agree that this use case is just an example and will also have issues if 
there is a combination of windows. My main issue was that if we need to have a 
relatively large buffer (such as full distinct count) then the memory overhead 
of this can be very high.


As for the example of the map you gave, If I understand correctly how this 
would occur behind the scenes, this just provides the map but the memory cost 
of having multiple versions of the data remain. As I said, my issue is with the 
high memory overhead.

Consider a simple example: I do a sliding window of 1 day with a 1 minute step. 
There are 1440 minutes per day which means the groupby has a cost of 
multiplying all aggregations by 1440. For something such as a count or sum, 
this might not be a big issue but if we have an array of say 100 elements then 
this can quickly become very costly.

As I said, it is just an idea for optimization for specific use cases.


From: Michael Armbrust [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n1952...@n3.nabble.com]
Sent: Thursday, October 20, 2016 11:16 AM
To: Mendelson, Assaf
Subject: Re: StructuredStreaming status

let’s say we would have implemented distinct count by saving a map with the key 
being the distinct value and the value being the last time we saw this value. 
This would mean that we wouldn’t really need to save all the steps in the 
middle and copy the data, we could only save the last portion.

I don't think you can calculate count distinct in each event time window 
correctly using this map if there is late data, which is one of the key 
problems we are trying to solve with this API.  If you are only tracking the 
last time you saw this value, how do you know if a late data item was already 
accounted for in any given window that is earlier than this "last time"?

We would currently need to track the items seen in each window (though much 
less space is required for approx count distinct).  However, the state eviction 
I mentioned above should also let you give us a boundary on how late data can 
be, and thus how many windows we need retain state for.  You should also be 
able to group by processing time instead of event time if you want something 
closer to the semantics of DStreams.

Finally, you can already construct the map you describe using structured 
streaming and use its result to output statistics at each trigger window:

df.groupBy($"value")
  .select(max($"eventTime") as 'lastSeen)
  .writeStream
  .outputMode("complete")
  .trigger(ProcessingTime("5 minutes"))
  .foreach( <emit results using values from the map> )

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19520.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com<mailto:ml-node+s1001551n1...@n3.nabble.com>
To unsubscribe from Apache Spark Developers List, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YXNzYWYubWVuZGVsc29uQHJzYS5jb218MXwtMTI4OTkxNTg1Mg==>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StructuredStreaming-status-tp19490p19521.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: StructuredStreaming status

Reply via email to