Re: Re: Re: Re: Re: How big the spark stream window could be ?

Mich Talebzadeh Tue, 10 May 2016 18:04:05 -0700

hm,

This is a standalone mode.


When you are running Spark in Standalone mode, you only have one worker
that lives within the driver JVM process that you start when you start
spark-shell or spark-submit.

However, since driver-memory setting encapsulates the JVM, you will need to
set the amount of *driver memory *for any non-default value *before
starting JVM by providing the new value:*




${SPARK_HOME}/bin/spark-submit --driver-memory 5g










Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 11 May 2016 at 01:22, 李明伟 <kramer2...@126.com> wrote:

> I actually provided them in submit command here:
>
> nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
> --num-executors 1 --total-executor-cores 1 --conf
> "spark.storage.memoryFraction=0.2"  ./mycode.py 1>a.log 2>b.log &
>
>
>
>
>
>
>
> At 2016-05-10 21:19:06, "Mich Talebzadeh" <mich.talebza...@gmail.com>
> wrote:
>
> Hi Mingwei,
>
> In your Spark conf setting what are you providing for these parameters. *Are
> you capping them?*
>
> For example
>
>   val conf = new SparkConf().
>                setAppName("AppName").
>                setMaster("local[2]").
>                set("spark.executor.memory", "4G").
>                set("spark.cores.max", "2").
>                set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>
> I assume you are running in standalone mode so each worker/aka
> slave grabs all the available cores and allocates the remaining memory on
> each host. Do not provide these in
>
> Do not provide new values for these parameter meaning overwrite them in
>
> *${SPARK_HOME}/bin/spark-submit  --*
>
>
> HTH
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 May 2016 at 03:12, 李明伟 <kramer2...@126.com> wrote:
>
>> Hi Mich
>>
>> I added some more infor (the spark-env.sh setting and top command output
>> in that thread.) Can you help to check pleas?
>>
>> Regards
>> Mingwei
>>
>>
>>
>>
>>
>> At 2016-05-09 23:45:19, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>> wrote:
>>
>> I had a look at the thread.
>>
>> This is what you have which I gather a standalone box in other words one
>> worker node
>>
>> bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
>> --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log
>>
>> But what I don't understand why is using 80% of your RAM as opposed to
>> 25% of it (4GB/16GB) right?
>>
>> Where else have you set up these parameters for example in
>> $SPARK_HOME/con/spark-env.sh?
>>
>> Can you send the output of /usr/bin/free and top
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote:
>>
>>> Thanks for all the information guys.
>>>
>>> I wrote some code to do the test. Not using window. So only calculating
>>> data for each batch interval. I set the interval to 30 seconds also reduce
>>> the size of data to about 30 000 lines of csv.
>>> Means my code should calculation on 30 000 lines of CSV in 30 seconds. I
>>> think it is not a very heavy workload. But my spark stream code still crash.
>>>
>>> I send another post to the user list here
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html
>>>
>>> Is it possible for you to have a look please? Very appreciate.
>>>
>>>
>>>
>>>
>>>
>>> At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:
>>>
>>> Pease see the inline comments.
>>>
>>>
>>> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com>
>>> wrote:
>>>
>>>> Thank you.
>>>>
>>>> So If I create spark streaming then
>>>>
>>>>
>>>>    1. The streams will always need to be cached? It cannot be stored
>>>>    in persistent storage
>>>>
>>>> You don't need to cache the stream explicitly if you don't have
>>> specific requirement, Spark will do it for you depends on different
>>> streaming sources (Kafka or socket).
>>>
>>>>
>>>>    1. The stream data cached will be distributed among all nodes of
>>>>    Spark among executors
>>>>    2. As I understand each Spark worker node has one executor that
>>>>    includes cache. So the streaming data is distributed among these work 
>>>> node
>>>>    caches. For example if I have 4 worker nodes each cache will have a 
>>>> quarter
>>>>    of data (this assumes that cache size among worker nodes is the same.)
>>>>
>>>> Ideally, it will distributed evenly across the executors, also this is
>>> target for tuning. Normally it depends on several conditions like receiver
>>> distribution, partition distribution.
>>>
>>>
>>>>
>>>> The issue raises if the amount of streaming data does not fit into
>>>> these 4 caches? Will the job crash?
>>>>
>>>>
>>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> No, each executor only stores part of data in memory (it depends on how
>>>> the partition are distributed and how many receivers you have).
>>>>
>>>> For WindowedDStream, it will obviously cache the data in memory, from
>>>> my understanding you don't need to call cache() again.
>>>>
>>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com>
>>>> wrote:
>>>>
>>>> hi,
>>>>
>>>> so if i have 10gb of streaming data coming in does it require 10gb of
>>>> memory in each node?
>>>>
>>>> also in that case why do we need using
>>>>
>>>> dstream.cache()
>>>>
>>>> thanks
>>>>
>>>>
>>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>> It depends on you to write the Spark application, normally if data is
>>>> already on the persistent storage, there's no need to be put into memory.
>>>> The reason why Spark Streaming has to be stored in memory is that streaming
>>>> source is not persistent source, so you need to have a place to store the
>>>> data.
>>>>
>>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote:
>>>>
>>>> Thanks.
>>>> What if I use batch calculation instead of stream computing? Do I still
>>>> need that much memory? For example, if the 24 hour data set is 100 GB. Do I
>>>> also need a 100GB RAM to do the one time batch calculation ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:
>>>>
>>>> For window related operators, Spark Streaming will cache the data into
>>>> memory within this window, in your case your window size is up to 24 hours,
>>>> which means data has to be in Executor's memory for more than 1 day, this
>>>> may introduce several problems when memory is not enough.
>>>>
>>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>> ok terms for Spark Streaming
>>>>
>>>> "Batch interval" is the basic interval at which the system with receive
>>>> the data in batches.
>>>> This is the interval set when creating a StreamingContext. For example,
>>>> if you set the batch interval as 300 seconds, then any input DStream will
>>>> generate RDDs of received data at 300 seconds intervals.
>>>> A window operator is defined by two parameters -
>>>> - WindowDuration / WindowsLength - the length of the window
>>>> - SlideDuration / SlidingInterval - the interval at which the window
>>>> will slide or move forward
>>>>
>>>>
>>>> Ok so your batch interval is 5 minutes. That is the rate messages are
>>>> coming in from the source.
>>>>
>>>> Then you have these two params
>>>>
>>>> // window length - The duration of the window below that must be
>>>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
>>>> val windowLength = x =  m * n
>>>> // sliding interval - The interval at which the window operation is
>>>> performed in other words data is collected within this "previous interval'
>>>> val slidingInterval =  y l x/y = even number
>>>>
>>>> Both the window length and the slidingInterval duration must be
>>>> multiples of the batch interval, as received data is divided into batches
>>>> of duration "batch interval".
>>>>
>>>> If you want to collect 1 hour data then windowLength =  12 * 5 * 60
>>>> seconds
>>>> If you want to collect 24 hour data then windowLength =  24 * 12 * 5 *
>>>> 60
>>>>
>>>> You sliding window should be set to batch interval = 5 * 60 seconds. In
>>>> other words that where the aggregates and summaries come for your report.
>>>>
>>>> What is your data source here?
>>>>
>>>> HTH
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote:
>>>>
>>>> We have some stream data need to be calculated and considering use spark
>>>> stream to do it.
>>>>
>>>> We need to generate three kinds of reports. The reports are based on
>>>>
>>>> 1. The last 5 minutes data
>>>> 2. The last 1 hour data
>>>> 3. The last 24 hour data
>>>>
>>>> The frequency of reports is 5 minutes.
>>>>
>>>> After reading the docs, the most obvious way to solve this seems to set
>>>> up a
>>>> spark stream with 5 minutes interval and two window which are 1 hour
>>>> and 1
>>>> day.
>>>>
>>>>
>>>> But I am worrying that if the window is too big for one day and one
>>>> hour. I
>>>> do not have much experience on spark stream, so what is the window
>>>> length in
>>>> your environment?
>>>>
>>>> Any official docs talking about this?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re: Re: Re: Re: Re: How big the spark stream window could be ?

Reply via email to