Thanks.
What if I use batch calculation instead of stream computing? Do I still need
that much memory? For example, if the 24 hour data set is 100 GB. Do I also
need a 100GB RAM to do the one time batch calculation ?
At 2016-05-09 15:14:47, "Saisai Shao" wrote:
For window related operators, Spark Streaming will cache the data into memory
within this window, in your case your window size is up to 24 hours, which
means data has to be in Executor's memory for more than 1 day, this may
introduce several problems when memory is not enough.
On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh
wrote:
ok terms for Spark Streaming
"Batch interval" is the basic interval at which the system with receive the
data in batches.
This is the interval set when creating a StreamingContext. For example, if you
set the batch interval as 300 seconds, then any input DStream will generate
RDDs of received data at 300 seconds intervals.
A window operator is defined by two parameters -
- WindowDuration / WindowsLength - the length of the window
- SlideDuration / SlidingInterval - the interval at which the window will slide
or move forward
Ok so your batch interval is 5 minutes. That is the rate messages are coming in
from the source.
Then you have these two params
// window length - The duration of the window below that must be multiple of
batch interval n in = > StreamingContext(sparkConf, Seconds(n))
val windowLength = x = m * n
// sliding interval - The interval at which the window operation is performed
in other words data is collected within this "previous interval'
val slidingInterval = y l x/y = even number
Both the window length and the slidingInterval duration must be multiples of
the batch interval, as received data is divided into batches of duration "batch
interval".
If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds
If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60
You sliding window should be set to batch interval = 5 * 60 seconds. In other
words that where the aggregates and summaries come for your report.
What is your data source here?
HTH
Dr Mich Talebzadeh
LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
http://talebzadehmich.wordpress.com
On 9 May 2016 at 04:19, kramer2...@126.com wrote:
We have some stream data need to be calculated and considering use spark
stream to do it.
We need to generate three kinds of reports. The reports are based on
1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data
The frequency of reports is 5 minutes.
After reading the docs, the most obvious way to solve this seems to set up a
spark stream with 5 minutes interval and two window which are 1 hour and 1
day.
But I am worrying that if the window is too big for one day and one hour. I
do not have much experience on spark stream, so what is the window length in
your environment?
Any official docs talking about this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org