Re: Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?
ces (Kafka or socket). >>>>> >>>>>> >>>>>>1. The stream data cached will be distributed among all nodes of >>>>>>Spark among executors >>>>>>2. As I understand each Spark worker node has one executor that >>>>>>includes cache. So the streaming data is distributed among these work >>>>>> node >>>>>>caches. For example if I have 4 worker nodes each cache will have a >>>>>> quarter >>>>>>of data (this assumes that cache size among worker nodes is the same.) >>>>>> >>>>>> Ideally, it will distributed evenly across the executors, also this >>>>> is target for tuning. Normally it depends on several conditions like >>>>> receiver distribution, partition distribution. >>>>> >>>>> >>>>>> >>>>>> The issue raises if the amount of streaming data does not fit into >>>>>> these 4 caches? Will the job crash? >>>>>> >>>>>> >>>>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> No, each executor only stores part of data in memory (it depends on >>>>>> how the partition are distributed and how many receivers you have). >>>>>> >>>>>> For WindowedDStream, it will obviously cache the data in memory, from >>>>>> my understanding you don't need to call cache() again. >>>>>> >>>>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> >>>>>> wrote: >>>>>> >>>>>> hi, >>>>>> >>>>>> so if i have 10gb of streaming data coming in does it require 10gb of >>>>>> memory in each node? >>>>>> >>>>>> also in that case why do we need using >>>>>> >>>>>> dstream.cache() >>>>>> >>>>>> thanks >>>>>> >>>>>> >>>>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> It depends on you to write the Spark application, normally if data is >>>>>> already on the persistent storage, there's no need to be put into memory. >>>>>> The reason why Spark Streaming has to be stored in memory is that >>>>>> streaming >>>>>> source is not persistent source, so you need to have a place to store the >>>>>> data. >>>>>> >>>>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >>>>>> >>>>>> Thanks. >>>>>> What if I use batch calculation instead of stream computing? Do I >>>>>> still need that much memory? For example, if the 24 hour data set is 100 >>>>>> GB. Do I also need a 100GB RAM to do the one time batch calculation ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>>>>> >>>>>> For window related operators, Spark Streaming will cache the data >>>>>> into memory within this window, in your case your window size is up to 24 >>>>>> hours, which means data has to be in Executor's memory for more than 1 >>>>>> day, >>>>>> this may introduce several problems when memory is not enough. >>>>>> >>>>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>> ok terms for Spark Streaming >>>>>> >>>>>> "Batch interval" is the basic interval at which the system with >>>>>> receive the data in batches. >>>>>> This is the interval set when creating a StreamingContext. For >>>>>> example, if you set the batch interval as 300 seconds, then any input >>>>>> DStream will generate RDDs of received data at 300 seconds intervals. >>>>>> A window operator is defined by two parameters - >>>>>> - WindowDuration / WindowsLength - the length of the window >>>>>> - SlideDuration / SlidingInterval - the interval at which the window >>>>>> will slide or move forward >>>>>> >>>>>> >>>>>> Ok so your batch interval is 5 minutes. That is the rate messages are >>>>>> coming in from the source. >>>>>> >>>>>> Then you have these two params >>>>>> >>>>>> // window length - The duration of the window below that must be >>>>>> multiple of batch interval n in = > StreamingContext(sparkConf, >>>>>> Seconds(n)) >>>>>> val windowLength = x = m * n >>>>>> // sliding interval - The interval at which the window operation is >>>>>> performed in other words data is collected within this "previous >>>>>> interval' >>>>>> val slidingInterval = y l x/y = even number >>>>>> >>>>>> Both the window length and the slidingInterval duration must be >>>>>> multiples of the batch interval, as received data is divided into batches >>>>>> of duration "batch interval". >>>>>> >>>>>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>>>>> seconds >>>>>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 >>>>>> * 60 >>>>>> >>>>>> You sliding window should be set to batch interval = 5 * 60 seconds. >>>>>> In other words that where the aggregates and summaries come for your >>>>>> report. >>>>>> >>>>>> What is your data source here? >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> >>>>>> wrote: >>>>>> >>>>>> We have some stream data need to be calculated and considering use >>>>>> spark >>>>>> stream to do it. >>>>>> >>>>>> We need to generate three kinds of reports. The reports are based on >>>>>> >>>>>> 1. The last 5 minutes data >>>>>> 2. The last 1 hour data >>>>>> 3. The last 24 hour data >>>>>> >>>>>> The frequency of reports is 5 minutes. >>>>>> >>>>>> After reading the docs, the most obvious way to solve this seems to >>>>>> set up a >>>>>> spark stream with 5 minutes interval and two window which are 1 hour >>>>>> and 1 >>>>>> day. >>>>>> >>>>>> >>>>>> But I am worrying that if the window is too big for one day and one >>>>>> hour. I >>>>>> do not have much experience on spark stream, so what is the window >>>>>> length in >>>>>> your environment? >>>>>> >>>>>> Any official docs talking about this? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >> >> >> >> >> > > > > >
Re:Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?
on different streaming sources (Kafka or socket). The stream data cached will be distributed among all nodes of Spark among executors As I understand each Spark worker node has one executor that includes cache. So the streaming data is distributed among these work node caches. For example if I have 4 worker nodes each cache will have a quarter of data (this assumes that cache size among worker nodes is the same.) Ideally, it will distributed evenly across the executors, also this is target for tuning. Normally it depends on several conditions like receiver distribution, partition distribution. The issue raises if the amount of streaming data does not fit into these 4 caches? Will the job crash? On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote: No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: Thanks. What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals. A window operator is defined by two parameters - - WindowDuration / WindowsLength - the length of the window - SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.na
Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?
d will be distributed among all nodes of >>>>>Spark among executors >>>>>2. As I understand each Spark worker node has one executor that >>>>>includes cache. So the streaming data is distributed among these work >>>>> node >>>>>caches. For example if I have 4 worker nodes each cache will have a >>>>> quarter >>>>>of data (this assumes that cache size among worker nodes is the same.) >>>>> >>>>> Ideally, it will distributed evenly across the executors, also this is >>>> target for tuning. Normally it depends on several conditions like receiver >>>> distribution, partition distribution. >>>> >>>> >>>>> >>>>> The issue raises if the amount of streaming data does not fit into >>>>> these 4 caches? Will the job crash? >>>>> >>>>> >>>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> No, each executor only stores part of data in memory (it depends on >>>>> how the partition are distributed and how many receivers you have). >>>>> >>>>> For WindowedDStream, it will obviously cache the data in memory, from >>>>> my understanding you don't need to call cache() again. >>>>> >>>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> >>>>> wrote: >>>>> >>>>> hi, >>>>> >>>>> so if i have 10gb of streaming data coming in does it require 10gb of >>>>> memory in each node? >>>>> >>>>> also in that case why do we need using >>>>> >>>>> dstream.cache() >>>>> >>>>> thanks >>>>> >>>>> >>>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> >>>>> wrote: >>>>> >>>>> >>>>> It depends on you to write the Spark application, normally if data is >>>>> already on the persistent storage, there's no need to be put into memory. >>>>> The reason why Spark Streaming has to be stored in memory is that >>>>> streaming >>>>> source is not persistent source, so you need to have a place to store the >>>>> data. >>>>> >>>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >>>>> >>>>> Thanks. >>>>> What if I use batch calculation instead of stream computing? Do I >>>>> still need that much memory? For example, if the 24 hour data set is 100 >>>>> GB. Do I also need a 100GB RAM to do the one time batch calculation ? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>>>> >>>>> For window related operators, Spark Streaming will cache the data into >>>>> memory within this window, in your case your window size is up to 24 >>>>> hours, >>>>> which means data has to be in Executor's memory for more than 1 day, this >>>>> may introduce several problems when memory is not enough. >>>>> >>>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>> ok terms for Spark Streaming >>>>> >>>>> "Batch interval" is the basic interval at which the system with >>>>> receive the data in batches. >>>>> This is the interval set when creating a StreamingContext. For >>>>> example, if you set the batch interval as 300 seconds, then any input >>>>> DStream will generate RDDs of received data at 300 seconds intervals. >>>>> A window operator is defined by two parameters - >>>>> - WindowDuration / WindowsLength - the length of the window >>>>> - SlideDuration / SlidingInterval - the interval at which the window >>>>> will slide or move forward >>>>> >>>>> >>>>> Ok so your batch interval is 5 minutes. That is the rate messages are >>>>> coming in from the source. >>>>> >>>>> Then you have these two params >>>>> >>>>> // window length - The duration of the window below that must be >>>>> multiple of batch interval n in = > StreamingContext(sparkConf, >>>>> Seconds(n)) >>>>> val windowLength = x = m * n >>>>> // sliding interval - The interval at which the window operation is >>>>> performed in other words data is collected within this "previous interval' >>>>> val slidingInterval = y l x/y = even number >>>>> >>>>> Both the window length and the slidingInterval duration must be >>>>> multiples of the batch interval, as received data is divided into batches >>>>> of duration "batch interval". >>>>> >>>>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>>>> seconds >>>>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * >>>>> 60 >>>>> >>>>> You sliding window should be set to batch interval = 5 * 60 seconds. >>>>> In other words that where the aggregates and summaries come for your >>>>> report. >>>>> >>>>> What is your data source here? >>>>> >>>>> HTH >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >>>>> >>>>> We have some stream data need to be calculated and considering use >>>>> spark >>>>> stream to do it. >>>>> >>>>> We need to generate three kinds of reports. The reports are based on >>>>> >>>>> 1. The last 5 minutes data >>>>> 2. The last 1 hour data >>>>> 3. The last 24 hour data >>>>> >>>>> The frequency of reports is 5 minutes. >>>>> >>>>> After reading the docs, the most obvious way to solve this seems to >>>>> set up a >>>>> spark stream with 5 minutes interval and two window which are 1 hour >>>>> and 1 >>>>> day. >>>>> >>>>> >>>>> But I am worrying that if the window is too big for one day and one >>>>> hour. I >>>>> do not have much experience on spark stream, so what is the window >>>>> length in >>>>> your environment? >>>>> >>>>> Any official docs talking about this? >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >> >> >> >> >> > > > > >
Re:Re: Re: Re: Re: Re: How big the spark stream window could be ?
source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Re: Re: Re: How big the spark stream window could be ?
gt; will slide or move forward >>>> >>>> >>>> Ok so your batch interval is 5 minutes. That is the rate messages are >>>> coming in from the source. >>>> >>>> Then you have these two params >>>> >>>> // window length - The duration of the window below that must be >>>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >>>> val windowLength = x = m * n >>>> // sliding interval - The interval at which the window operation is >>>> performed in other words data is collected within this "previous interval' >>>> val slidingInterval = y l x/y = even number >>>> >>>> Both the window length and the slidingInterval duration must be >>>> multiples of the batch interval, as received data is divided into batches >>>> of duration "batch interval". >>>> >>>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>>> seconds >>>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * >>>> 60 >>>> >>>> You sliding window should be set to batch interval = 5 * 60 seconds. In >>>> other words that where the aggregates and summaries come for your report. >>>> >>>> What is your data source here? >>>> >>>> HTH >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >>>> >>>> We have some stream data need to be calculated and considering use spark >>>> stream to do it. >>>> >>>> We need to generate three kinds of reports. The reports are based on >>>> >>>> 1. The last 5 minutes data >>>> 2. The last 1 hour data >>>> 3. The last 24 hour data >>>> >>>> The frequency of reports is 5 minutes. >>>> >>>> After reading the docs, the most obvious way to solve this seems to set >>>> up a >>>> spark stream with 5 minutes interval and two window which are 1 hour >>>> and 1 >>>> day. >>>> >>>> >>>> But I am worrying that if the window is too big for one day and one >>>> hour. I >>>> do not have much experience on spark stream, so what is the window >>>> length in >>>> your environment? >>>> >>>> Any official docs talking about this? >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >> >> >> >> >> > > > > >
Re:Re: Re: Re: Re: How big the spark stream window could be ?
ent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: Thanks. What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals. A window operator is defined by two parameters - - WindowDuration / WindowsLength - the length of the window - SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Re: Re: How big the spark stream window could be ?
ry (it depends on how >>> the partition are distributed and how many receivers you have). >>> >>> For WindowedDStream, it will obviously cache the data in memory, from my >>> understanding you don't need to call cache() again. >>> >>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> >>> wrote: >>> >>> hi, >>> >>> so if i have 10gb of streaming data coming in does it require 10gb of >>> memory in each node? >>> >>> also in that case why do we need using >>> >>> dstream.cache() >>> >>> thanks >>> >>> >>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: >>> >>> >>> It depends on you to write the Spark application, normally if data is >>> already on the persistent storage, there's no need to be put into memory. >>> The reason why Spark Streaming has to be stored in memory is that streaming >>> source is not persistent source, so you need to have a place to store the >>> data. >>> >>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >>> >>> Thanks. >>> What if I use batch calculation instead of stream computing? Do I still >>> need that much memory? For example, if the 24 hour data set is 100 GB. Do I >>> also need a 100GB RAM to do the one time batch calculation ? >>> >>> >>> >>> >>> >>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>> >>> For window related operators, Spark Streaming will cache the data into >>> memory within this window, in your case your window size is up to 24 hours, >>> which means data has to be in Executor's memory for more than 1 day, this >>> may introduce several problems when memory is not enough. >>> >>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>> ok terms for Spark Streaming >>> >>> "Batch interval" is the basic interval at which the system with receive >>> the data in batches. >>> This is the interval set when creating a StreamingContext. For example, >>> if you set the batch interval as 300 seconds, then any input DStream will >>> generate RDDs of received data at 300 seconds intervals. >>> A window operator is defined by two parameters - >>> - WindowDuration / WindowsLength - the length of the window >>> - SlideDuration / SlidingInterval - the interval at which the window >>> will slide or move forward >>> >>> >>> Ok so your batch interval is 5 minutes. That is the rate messages are >>> coming in from the source. >>> >>> Then you have these two params >>> >>> // window length - The duration of the window below that must be >>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >>> val windowLength = x = m * n >>> // sliding interval - The interval at which the window operation is >>> performed in other words data is collected within this "previous interval' >>> val slidingInterval = y l x/y = even number >>> >>> Both the window length and the slidingInterval duration must be >>> multiples of the batch interval, as received data is divided into batches >>> of duration "batch interval". >>> >>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>> seconds >>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * >>> 60 >>> >>> You sliding window should be set to batch interval = 5 * 60 seconds. In >>> other words that where the aggregates and summaries come for your report. >>> >>> What is your data source here? >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >>> >>> We have some stream data need to be calculated and considering use spark >>> stream to do it. >>> >>> We need to generate three kinds of reports. The reports are based on >>> >>> 1. The last 5 minutes data >>> 2. The last 1 hour data >>> 3. The last 24 hour data >>> >>> The frequency of reports is 5 minutes. >>> >>> After reading the docs, the most obvious way to solve this seems to set >>> up a >>> spark stream with 5 minutes interval and two window which are 1 hour and >>> 1 >>> day. >>> >>> >>> But I am worrying that if the window is too big for one day and one >>> hour. I >>> do not have much experience on spark stream, so what is the window >>> length in >>> your environment? >>> >>> Any official docs talking about this? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> > > > > >
Re:Re: Re: Re: How big the spark stream window could be ?
- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Re: How big the spark stream window could be ?
introduce several problems when memory is not enough. >> >> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >> ok terms for Spark Streaming >> >> "Batch interval" is the basic interval at which the system with receive >> the data in batches. >> This is the interval set when creating a StreamingContext. For example, >> if you set the batch interval as 300 seconds, then any input DStream will >> generate RDDs of received data at 300 seconds intervals. >> A window operator is defined by two parameters - >> - WindowDuration / WindowsLength - the length of the window >> - SlideDuration / SlidingInterval - the interval at which the window will >> slide or move forward >> >> >> Ok so your batch interval is 5 minutes. That is the rate messages are >> coming in from the source. >> >> Then you have these two params >> >> // window length - The duration of the window below that must be multiple >> of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >> val windowLength = x = m * n >> // sliding interval - The interval at which the window operation is >> performed in other words data is collected within this "previous interval' >> val slidingInterval = y l x/y = even number >> >> Both the window length and the slidingInterval duration must be multiples >> of the batch interval, as received data is divided into batches of duration >> "batch interval". >> >> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >> seconds >> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 >> >> You sliding window should be set to batch interval = 5 * 60 seconds. In >> other words that where the aggregates and summaries come for your report. >> >> What is your data source here? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> http://talebzadehmich.wordpress.com >> >> >> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >> >> We have some stream data need to be calculated and considering use spark >> stream to do it. >> >> We need to generate three kinds of reports. The reports are based on >> >> 1. The last 5 minutes data >> 2. The last 1 hour data >> 3. The last 24 hour data >> >> The frequency of reports is 5 minutes. >> >> After reading the docs, the most obvious way to solve this seems to set >> up a >> spark stream with 5 minutes interval and two window which are 1 hour and 1 >> day. >> >> >> But I am worrying that if the window is too big for one day and one hour. >> I >> do not have much experience on spark stream, so what is the window length >> in >> your environment? >> >> Any official docs talking about this? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > >
Re:Re: Re: How big the spark stream window could be ?
l = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How big the spark stream window could be ?
I agree with Jorn et al on this. An alternative approach would be best. as a 24 hour operation sounds like a classic batch job more suitable for later reporting. This happens all the time in RDBMS. As I understand and within Spark the sliding interval can only be used after that window length (in this case 24 hours) has elapsed. You might as well use normal storage for it. It may be slower but would be far more manageable. Otherwise use other suggestions I made. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 15:15, firemonk9 <dhiraj.peech...@gmail.com> wrote: > I have not come across official docs in this regard how ever if you use 24 > hour window size, you will need to have memory big enough to fit the stream > data for 24 hours. Usually memory is the limiting factor for the window > size. > > Dhiraj Peechara > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899p26903.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: How big the spark stream window could be ?
I have not come across official docs in this regard how ever if you use 24 hour window size, you will need to have memory big enough to fit the stream data for 24 hours. Usually memory is the limiting factor for the window size. Dhiraj Peechara -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899p26903.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How big the spark stream window could be ?
I do not recommend large windows. You can have small windows, store the data and then do the reports for one hour or one day on stored data. > On 09 May 2016, at 05:19, "kramer2...@126.com" <kramer2...@126.com> wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: How big the spark stream window could be ?
have a place to store the > data. > > On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: > > Thanks. > What if I use batch calculation instead of stream computing? Do I still > need that much memory? For example, if the 24 hour data set is 100 GB. Do I > also need a 100GB RAM to do the one time batch calculation ? > > > > > > At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > > ok terms for Spark Streaming > > "Batch interval" is the basic interval at which the system with receive > the data in batches. > This is the interval set when creating a StreamingContext. For example, if > you set the batch interval as 300 seconds, then any input DStream will > generate RDDs of received data at 300 seconds intervals. > A window operator is defined by two parameters - > - WindowDuration / WindowsLength - the length of the window > - SlideDuration / SlidingInterval - the interval at which the window will > slide or move forward > > > Ok so your batch interval is 5 minutes. That is the rate messages are > coming in from the source. > > Then you have these two params > > // window length - The duration of the window below that must be multiple > of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) > val windowLength = x = m * n > // sliding interval - The interval at which the window operation is > performed in other words data is collected within this "previous interval' > val slidingInterval = y l x/y = even number > > Both the window length and the slidingInterval duration must be multiples > of the batch interval, as received data is divided into batches of duration > "batch interval". > > If you want to collect 1 hour data then windowLength = 12 * 5 * 60 > seconds > If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 > > You sliding window should be set to batch interval = 5 * 60 seconds. In > other words that where the aggregates and summaries come for your report. > > What is your data source here? > > HTH > > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up > a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length > in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > > > > > > > >
Re: Re: How big the spark stream window could be ?
ion "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: How big the spark stream window could be ?
//www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up > a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length > in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > > > > >
Re: Re: How big the spark stream window could be ?
Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in persistent storage - The stream data cached will be distributed among all nodes of Spark among executors - As I understand each Spark worker node has one executor that includes cache. So the streaming data is distributed among these work node caches. For example if I have 4 worker nodes each cache will have a quarter of data (this assumes that cache size among worker nodes is the same.) The issue raises if the amount of streaming data does not fit into these 4 caches? Will the job crash? On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote: No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: Thanks.What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals.A window operator is defined by two parameters -- WindowDuration / WindowsLength - the length of the window- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26
Re: Re: How big the spark stream window could be ?
No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: > hi, > > so if i have 10gb of streaming data coming in does it require 10gb of > memory in each node? > > also in that case why do we need using > > dstream.cache() > > thanks > > > On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > > It depends on you to write the Spark application, normally if data is > already on the persistent storage, there's no need to be put into memory. > The reason why Spark Streaming has to be stored in memory is that streaming > source is not persistent source, so you need to have a place to store the > data. > > On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: > > Thanks. > What if I use batch calculation instead of stream computing? Do I still > need that much memory? For example, if the 24 hour data set is 100 GB. Do I > also need a 100GB RAM to do the one time batch calculation ? > > > > > > At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > > ok terms for Spark Streaming > > "Batch interval" is the basic interval at which the system with receive > the data in batches. > This is the interval set when creating a StreamingContext. For example, if > you set the batch interval as 300 seconds, then any input DStream will > generate RDDs of received data at 300 seconds intervals. > A window operator is defined by two parameters - > - WindowDuration / WindowsLength - the length of the window > - SlideDuration / SlidingInterval - the interval at which the window will > slide or move forward > > > Ok so your batch interval is 5 minutes. That is the rate messages are > coming in from the source. > > Then you have these two params > > // window length - The duration of the window below that must be multiple > of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) > val windowLength = x = m * n > // sliding interval - The interval at which the window operation is > performed in other words data is collected within this "previous interval' > val slidingInterval = y l x/y = even number > > Both the window length and the slidingInterval duration must be multiples > of the batch interval, as received data is divided into batches of duration > "batch interval". > > If you want to collect 1 hour data then windowLength = 12 * 5 * 60 > seconds > If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 > > You sliding window should be set to batch interval = 5 * 60 seconds. In > other words that where the aggregates and summaries come for your report. > > What is your data source here? > > HTH > > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up > a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length > in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > >
Re: Re: How big the spark stream window could be ?
hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: Thanks.What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals.A window operator is defined by two parameters -- WindowDuration / WindowsLength - the length of the window- SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: How big the spark stream window could be ?
It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the data. On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: > Thanks. > What if I use batch calculation instead of stream computing? Do I still > need that much memory? For example, if the 24 hour data set is 100 GB. Do I > also need a 100GB RAM to do the one time batch calculation ? > > > > > > At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > >> ok terms for Spark Streaming >> >> "Batch interval" is the basic interval at which the system with receive >> the data in batches. >> This is the interval set when creating a StreamingContext. For example, >> if you set the batch interval as 300 seconds, then any input DStream will >> generate RDDs of received data at 300 seconds intervals. >> A window operator is defined by two parameters - >> - WindowDuration / WindowsLength - the length of the window >> - SlideDuration / SlidingInterval - the interval at which the window will >> slide or move forward >> >> >> Ok so your batch interval is 5 minutes. That is the rate messages are >> coming in from the source. >> >> Then you have these two params >> >> // window length - The duration of the window below that must be multiple >> of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >> val windowLength = x = m * n >> // sliding interval - The interval at which the window operation is >> performed in other words data is collected within this "previous interval' >> val slidingInterval = y l x/y = even number >> >> Both the window length and the slidingInterval duration must be multiples >> of the batch interval, as received data is divided into batches of duration >> "batch interval". >> >> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >> seconds >> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 >> >> You sliding window should be set to batch interval = 5 * 60 seconds. In >> other words that where the aggregates and summaries come for your report. >> >> What is your data source here? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >> >>> We have some stream data need to be calculated and considering use spark >>> stream to do it. >>> >>> We need to generate three kinds of reports. The reports are based on >>> >>> 1. The last 5 minutes data >>> 2. The last 1 hour data >>> 3. The last 24 hour data >>> >>> The frequency of reports is 5 minutes. >>> >>> After reading the docs, the most obvious way to solve this seems to set >>> up a >>> spark stream with 5 minutes interval and two window which are 1 hour and >>> 1 >>> day. >>> >>> >>> But I am worrying that if the window is too big for one day and one >>> hour. I >>> do not have much experience on spark stream, so what is the window >>> length in >>> your environment? >>> >>> Any official docs talking about this? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> > > > >
Re:Re: How big the spark stream window could be ?
Thanks. What if I use batch calculation instead of stream computing? Do I still need that much memory? For example, if the 24 hour data set is 100 GB. Do I also need a 100GB RAM to do the one time batch calculation ? At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals. A window operator is defined by two parameters - - WindowDuration / WindowsLength - the length of the window - SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How big the spark stream window could be ?
What do you mean of swap space, the system swap space or Spark's block manager disk space? If you're referring to swap space, I think you should first think about JVM heap size and yarn container size before running out of system memory. If you're referring to block manager disk space, the StorageLevel of WindowedDStream is MEMORY_ONLY_SER, so it will not put into disk when executor memory is not enough. On Mon, May 9, 2016 at 3:26 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > That is a valid point Shao. However, it will start using disk space as > memory storage akin to swap space. It will not crash I believe it will just > be slow and this assumes that you do not run out of disk space. > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 May 2016 at 08:14, Saisai Shao <sai.sai.s...@gmail.com> wrote: > >> For window related operators, Spark Streaming will cache the data into >> memory within this window, in your case your window size is up to 24 hours, >> which means data has to be in Executor's memory for more than 1 day, this >> may introduce several problems when memory is not enough. >> >> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> ok terms for Spark Streaming >>> >>> "Batch interval" is the basic interval at which the system with receive >>> the data in batches. >>> This is the interval set when creating a StreamingContext. For example, >>> if you set the batch interval as 300 seconds, then any input DStream will >>> generate RDDs of received data at 300 seconds intervals. >>> A window operator is defined by two parameters - >>> - WindowDuration / WindowsLength - the length of the window >>> - SlideDuration / SlidingInterval - the interval at which the window >>> will slide or move forward >>> >>> >>> Ok so your batch interval is 5 minutes. That is the rate messages are >>> coming in from the source. >>> >>> Then you have these two params >>> >>> // window length - The duration of the window below that must be >>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >>> val windowLength = x = m * n >>> // sliding interval - The interval at which the window operation is >>> performed in other words data is collected within this "previous interval' >>> val slidingInterval = y l x/y = even number >>> >>> Both the window length and the slidingInterval duration must be >>> multiples of the batch interval, as received data is divided into batches >>> of duration "batch interval". >>> >>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>> seconds >>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * >>> 60 >>> >>> You sliding window should be set to batch interval = 5 * 60 seconds. In >>> other words that where the aggregates and summaries come for your report. >>> >>> What is your data source here? >>> >>> HTH >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >>> >>>> We have some stream data need to be calculated and considering use spark >>>> stream to do it. >>>> >>>> We need to generate three kinds of reports. The reports are based on >>>> >>>> 1. The last 5 minutes data >>>> 2. The last 1 hour data >>>> 3. The last 24 hour data >>>> >>>> The frequency of reports is 5 minutes. >>>> >>>> After reading the docs, the most obvious way to solve this seems to set >>>> up a >>>> spark stream with 5 minutes interval and two window which are 1 hour >>>> and 1 >>>> day. >>>> >>>> >>>> But I am worrying that if the window is too big for one day and one >>>> hour. I >>>> do not have much experience on spark stream, so what is the window >>>> length in >>>> your environment? >>>> >>>> Any official docs talking about this? >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >
Re: Re: How big the spark stream window could be ?
Hi, Have you thought of other alternatives like collecting data in a database (over 24 hours period)? I mean do you require reports of 5 min interval *after 24 hours data collection* from t0, t0+5m, t0+10 min? You can only do so after collecting data then you can partition your table into 5 minutes timeslot? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 08:15, 李明伟 <kramer2...@126.com> wrote: > Thanks Mich > > I guess I did not make my question clear enough. I know the terms like > interval or window. I also know how to use them. The problem is that in my > case, I need to set the window to cover data for 24 hours or 1 hours. I am > not sure if it is a good way because the window is just too big. I am > expecting my program to be a long running service. So I am worrying the > stability of the program. > > > > > > > > At 2016-05-09 15:01:57, "Mich Talebzadeh" <mich.talebza...@gmail.com> > wrote: > > ok terms for Spark Streaming > > "Batch interval" is the basic interval at which the system with receive > the data in batches. > This is the interval set when creating a StreamingContext. For example, if > you set the batch interval as 300 seconds, then any input DStream will > generate RDDs of received data at 300 seconds intervals. > A window operator is defined by two parameters - > - WindowDuration / WindowsLength - the length of the window > - SlideDuration / SlidingInterval - the interval at which the window will > slide or move forward > > > Ok so your batch interval is 5 minutes. That is the rate messages are > coming in from the source. > > Then you have these two params > > // window length - The duration of the window below that must be multiple > of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) > val windowLength = x = m * n > // sliding interval - The interval at which the window operation is > performed in other words data is collected within this "previous interval' > val slidingInterval = y l x/y = even number > > Both the window length and the slidingInterval duration must be multiples > of the batch interval, as received data is divided into batches of duration > "batch interval". > > If you want to collect 1 hour data then windowLength = 12 * 5 * 60 > seconds > If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 > > You sliding window should be set to batch interval = 5 * 60 seconds. In > other words that where the aggregates and summaries come for your report. > > What is your data source here? > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > >> We have some stream data need to be calculated and considering use spark >> stream to do it. >> >> We need to generate three kinds of reports. The reports are based on >> >> 1. The last 5 minutes data >> 2. The last 1 hour data >> 3. The last 24 hour data >> >> The frequency of reports is 5 minutes. >> >> After reading the docs, the most obvious way to solve this seems to set >> up a >> spark stream with 5 minutes interval and two window which are 1 hour and 1 >> day. >> >> >> But I am worrying that if the window is too big for one day and one hour. >> I >> do not have much experience on spark stream, so what is the window length >> in >> your environment? >> >> Any official docs talking about this? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > >
Re: How big the spark stream window could be ?
That is a valid point Shao. However, it will start using disk space as memory storage akin to swap space. It will not crash I believe it will just be slow and this assumes that you do not run out of disk space. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 08:14, Saisai Shao <sai.sai.s...@gmail.com> wrote: > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > >> ok terms for Spark Streaming >> >> "Batch interval" is the basic interval at which the system with receive >> the data in batches. >> This is the interval set when creating a StreamingContext. For example, >> if you set the batch interval as 300 seconds, then any input DStream will >> generate RDDs of received data at 300 seconds intervals. >> A window operator is defined by two parameters - >> - WindowDuration / WindowsLength - the length of the window >> - SlideDuration / SlidingInterval - the interval at which the window will >> slide or move forward >> >> >> Ok so your batch interval is 5 minutes. That is the rate messages are >> coming in from the source. >> >> Then you have these two params >> >> // window length - The duration of the window below that must be multiple >> of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >> val windowLength = x = m * n >> // sliding interval - The interval at which the window operation is >> performed in other words data is collected within this "previous interval' >> val slidingInterval = y l x/y = even number >> >> Both the window length and the slidingInterval duration must be multiples >> of the batch interval, as received data is divided into batches of duration >> "batch interval". >> >> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >> seconds >> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 >> >> You sliding window should be set to batch interval = 5 * 60 seconds. In >> other words that where the aggregates and summaries come for your report. >> >> What is your data source here? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >> >>> We have some stream data need to be calculated and considering use spark >>> stream to do it. >>> >>> We need to generate three kinds of reports. The reports are based on >>> >>> 1. The last 5 minutes data >>> 2. The last 1 hour data >>> 3. The last 24 hour data >>> >>> The frequency of reports is 5 minutes. >>> >>> After reading the docs, the most obvious way to solve this seems to set >>> up a >>> spark stream with 5 minutes interval and two window which are 1 hour and >>> 1 >>> day. >>> >>> >>> But I am worrying that if the window is too big for one day and one >>> hour. I >>> do not have much experience on spark stream, so what is the window >>> length in >>> your environment? >>> >>> Any official docs talking about this? >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re:Re: How big the spark stream window could be ?
Thanks Mich I guess I did not make my question clear enough. I know the terms like interval or window. I also know how to use them. The problem is that in my case, I need to set the window to cover data for 24 hours or 1 hours. I am not sure if it is a good way because the window is just too big. I am expecting my program to be a long running service. So I am worrying the stability of the program. At 2016-05-09 15:01:57, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote: ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals. A window operator is defined by two parameters - - WindowDuration / WindowsLength - the length of the window - SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote: We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How big the spark stream window could be ?
For window related operators, Spark Streaming will cache the data into memory within this window, in your case your window size is up to 24 hours, which means data has to be in Executor's memory for more than 1 day, this may introduce several problems when memory is not enough. On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > ok terms for Spark Streaming > > "Batch interval" is the basic interval at which the system with receive > the data in batches. > This is the interval set when creating a StreamingContext. For example, if > you set the batch interval as 300 seconds, then any input DStream will > generate RDDs of received data at 300 seconds intervals. > A window operator is defined by two parameters - > - WindowDuration / WindowsLength - the length of the window > - SlideDuration / SlidingInterval - the interval at which the window will > slide or move forward > > > Ok so your batch interval is 5 minutes. That is the rate messages are > coming in from the source. > > Then you have these two params > > // window length - The duration of the window below that must be multiple > of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) > val windowLength = x = m * n > // sliding interval - The interval at which the window operation is > performed in other words data is collected within this "previous interval' > val slidingInterval = y l x/y = even number > > Both the window length and the slidingInterval duration must be multiples > of the batch interval, as received data is divided into batches of duration > "batch interval". > > If you want to collect 1 hour data then windowLength = 12 * 5 * 60 > seconds > If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 > > You sliding window should be set to batch interval = 5 * 60 seconds. In > other words that where the aggregates and summaries come for your report. > > What is your data source here? > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > >> We have some stream data need to be calculated and considering use spark >> stream to do it. >> >> We need to generate three kinds of reports. The reports are based on >> >> 1. The last 5 minutes data >> 2. The last 1 hour data >> 3. The last 24 hour data >> >> The frequency of reports is 5 minutes. >> >> After reading the docs, the most obvious way to solve this seems to set >> up a >> spark stream with 5 minutes interval and two window which are 1 hour and 1 >> day. >> >> >> But I am worrying that if the window is too big for one day and one hour. >> I >> do not have much experience on spark stream, so what is the window length >> in >> your environment? >> >> Any official docs talking about this? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: How big the spark stream window could be ?
ok terms for Spark Streaming "Batch interval" is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 300 seconds, then any input DStream will generate RDDs of received data at 300 seconds intervals. A window operator is defined by two parameters - - WindowDuration / WindowsLength - the length of the window - SlideDuration / SlidingInterval - the interval at which the window will slide or move forward Ok so your batch interval is 5 minutes. That is the rate messages are coming in from the source. Then you have these two params // window length - The duration of the window below that must be multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) val windowLength = x = m * n // sliding interval - The interval at which the window operation is performed in other words data is collected within this "previous interval' val slidingInterval = y l x/y = even number Both the window length and the slidingInterval duration must be multiples of the batch interval, as received data is divided into batches of duration "batch interval". If you want to collect 1 hour data then windowLength = 12 * 5 * 60 seconds If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 You sliding window should be set to batch interval = 5 * 60 seconds. In other words that where the aggregates and summaries come for your report. What is your data source here? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up > a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length > in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
How big the spark stream window could be ?
We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day. But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment? Any official docs talking about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org