You don’t need to “launch batches” every 5 minutes. You can launch batches 
every 2 seconds, and aggregate on window for 5 minutes. Spark will read data 
from topic every 2 seconds, and keep the data in memory for 5 minutes.

You need to make few decisions

  1.  DO you want a tumbling window or a rolling window? A tumbling window of 5 
minutes will produce an aggregate every 5 minutes. It will aggregate data for 5 
minutes before. A rolling window of 5 miutes/1 minute, will produce an 
aggregate ever 1 minute. It will aggregate data ever 1 minute. For example, 
let’s say you have data evert 2 seconds. A tumbling window will produce a 
result on minute 5, 10, 15, 20…. Minute 5 result will have data from minute 
1-4., 15 will have data from 6-10… and so on. Rolling window will produce data 
on minute 5, 6, 7, 8, …. Minute 5 will have aggregate from 1-5, minute 6 will 
have aggregate from 2-6, and so on. This defines your window. In your code you 
have


window(df_temp.timestamp, "2 minutes", "1 minutes")

This is a rolling window. Here second parameter(2 minutes) is the window 
interval, and third parameter(1 minutes) is the slide interval. In the above 
example, it will produce an aggregate every 1 minute interval for 2minute worth 
of data.

If you define


window(df_temp.timestamp, "2 minutes", "2 minutes")

This is a tumbling window. It will produce an aggregate every 2 minutes, with 2 
minutes worth of data





  1.  Can you have late data? How late can data arrive? Usually streaming 
systems send data out of order. Liik, it could happen that you get data for 
t=11:00:00 AM, and then get data for t=10:59:59AM. This means that the data is 
late by 1 second. What’s the worst case condition for late data? You need to 
define the watermark for late data. In your code, you have defined a watermark 
of 2 minutes. For aggregations, the watermark also defines which windows Spark 
will keep in memory. If you define a watermark of 2 minutes, and you have a 
rolling window with slide interval of 1 minute, Spark will keep 2 windows in 
memory. Watermark interval affects how much memory will be used by Spark

It might help if you try to follow the example in this guide very carefully 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
 That is a pretty good example, but you need to follow it event by event very 
carefully to get all the nuances.

From: Giuseppe Ricci <peppepega...@gmail.com>
Date: Monday, May 10, 2021 at 11:19 AM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: [EXTERNAL] Calculate average from Spark stream


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi, I'm new on Apache Spark.
I'm trying to read data from an Apache Kafka topic (I have a simulated 
temperature sensor producer which sends data every 2 second) and I need every 5 
minutes to calculate the average temperature. Reading documentation I 
understand I need to use windows but I'm not able to finalize my code. Can some 
help me?
How can I launch batches every 5 minutes? My code works one time and finishes. 
Why in the console I can't find any helpful information for correct execution? 
See attached picture.

This is my code:
https://pastebin.com/4S31jEeP

Thanks for your precious help.



PhD. Giuseppe Ricci

Reply via email to