hm, This is a standalone mode.
When you are running Spark in Standalone mode, you only have one worker that lives within the driver JVM process that you start when you start spark-shell or spark-submit. However, since driver-memory setting encapsulates the JVM, you will need to set the amount of *driver memory *for any non-default value *before starting JVM by providing the new value:* ${SPARK_HOME}/bin/spark-submit --driver-memory 5g Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 11 May 2016 at 01:22, 李明伟 <kramer2...@126.com> wrote: > I actually provided them in submit command here: > > nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G > --num-executors 1 --total-executor-cores 1 --conf > "spark.storage.memoryFraction=0.2" ./mycode.py 1>a.log 2>b.log & > > > > > > > > At 2016-05-10 21:19:06, "Mich Talebzadeh" <mich.talebza...@gmail.com> > wrote: > > Hi Mingwei, > > In your Spark conf setting what are you providing for these parameters. *Are > you capping them?* > > For example > > val conf = new SparkConf(). > setAppName("AppName"). > setMaster("local[2]"). > set("spark.executor.memory", "4G"). > set("spark.cores.max", "2"). > set("spark.driver.allowMultipleContexts", "true") > val sc = new SparkContext(conf) > > I assume you are running in standalone mode so each worker/aka > slave grabs all the available cores and allocates the remaining memory on > each host. Do not provide these in > > Do not provide new values for these parameter meaning overwrite them in > > *${SPARK_HOME}/bin/spark-submit --* > > > HTH > > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 10 May 2016 at 03:12, 李明伟 <kramer2...@126.com> wrote: > >> Hi Mich >> >> I added some more infor (the spark-env.sh setting and top command output >> in that thread.) Can you help to check pleas? >> >> Regards >> Mingwei >> >> >> >> >> >> At 2016-05-09 23:45:19, "Mich Talebzadeh" <mich.talebza...@gmail.com> >> wrote: >> >> I had a look at the thread. >> >> This is what you have which I gather a standalone box in other words one >> worker node >> >> bin/spark-submit --master spark://ES01:7077 --executor-memory 4G >> --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log >> >> But what I don't understand why is using 80% of your RAM as opposed to >> 25% of it (4GB/16GB) right? >> >> Where else have you set up these parameters for example in >> $SPARK_HOME/con/spark-env.sh? >> >> Can you send the output of /usr/bin/free and top >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote: >> >>> Thanks for all the information guys. >>> >>> I wrote some code to do the test. Not using window. So only calculating >>> data for each batch interval. I set the interval to 30 seconds also reduce >>> the size of data to about 30 000 lines of csv. >>> Means my code should calculation on 30 000 lines of CSV in 30 seconds. I >>> think it is not a very heavy workload. But my spark stream code still crash. >>> >>> I send another post to the user list here >>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html >>> >>> Is it possible for you to have a look please? Very appreciate. >>> >>> >>> >>> >>> >>> At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>> >>> Pease see the inline comments. >>> >>> >>> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> >>> wrote: >>> >>>> Thank you. >>>> >>>> So If I create spark streaming then >>>> >>>> >>>> 1. The streams will always need to be cached? It cannot be stored >>>> in persistent storage >>>> >>>> You don't need to cache the stream explicitly if you don't have >>> specific requirement, Spark will do it for you depends on different >>> streaming sources (Kafka or socket). >>> >>>> >>>> 1. The stream data cached will be distributed among all nodes of >>>> Spark among executors >>>> 2. As I understand each Spark worker node has one executor that >>>> includes cache. So the streaming data is distributed among these work >>>> node >>>> caches. For example if I have 4 worker nodes each cache will have a >>>> quarter >>>> of data (this assumes that cache size among worker nodes is the same.) >>>> >>>> Ideally, it will distributed evenly across the executors, also this is >>> target for tuning. Normally it depends on several conditions like receiver >>> distribution, partition distribution. >>> >>> >>>> >>>> The issue raises if the amount of streaming data does not fit into >>>> these 4 caches? Will the job crash? >>>> >>>> >>>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> >>>> wrote: >>>> >>>> >>>> No, each executor only stores part of data in memory (it depends on how >>>> the partition are distributed and how many receivers you have). >>>> >>>> For WindowedDStream, it will obviously cache the data in memory, from >>>> my understanding you don't need to call cache() again. >>>> >>>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> >>>> wrote: >>>> >>>> hi, >>>> >>>> so if i have 10gb of streaming data coming in does it require 10gb of >>>> memory in each node? >>>> >>>> also in that case why do we need using >>>> >>>> dstream.cache() >>>> >>>> thanks >>>> >>>> >>>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> >>>> wrote: >>>> >>>> >>>> It depends on you to write the Spark application, normally if data is >>>> already on the persistent storage, there's no need to be put into memory. >>>> The reason why Spark Streaming has to be stored in memory is that streaming >>>> source is not persistent source, so you need to have a place to store the >>>> data. >>>> >>>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >>>> >>>> Thanks. >>>> What if I use batch calculation instead of stream computing? Do I still >>>> need that much memory? For example, if the 24 hour data set is 100 GB. Do I >>>> also need a 100GB RAM to do the one time batch calculation ? >>>> >>>> >>>> >>>> >>>> >>>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >>>> >>>> For window related operators, Spark Streaming will cache the data into >>>> memory within this window, in your case your window size is up to 24 hours, >>>> which means data has to be in Executor's memory for more than 1 day, this >>>> may introduce several problems when memory is not enough. >>>> >>>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>> ok terms for Spark Streaming >>>> >>>> "Batch interval" is the basic interval at which the system with receive >>>> the data in batches. >>>> This is the interval set when creating a StreamingContext. For example, >>>> if you set the batch interval as 300 seconds, then any input DStream will >>>> generate RDDs of received data at 300 seconds intervals. >>>> A window operator is defined by two parameters - >>>> - WindowDuration / WindowsLength - the length of the window >>>> - SlideDuration / SlidingInterval - the interval at which the window >>>> will slide or move forward >>>> >>>> >>>> Ok so your batch interval is 5 minutes. That is the rate messages are >>>> coming in from the source. >>>> >>>> Then you have these two params >>>> >>>> // window length - The duration of the window below that must be >>>> multiple of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >>>> val windowLength = x = m * n >>>> // sliding interval - The interval at which the window operation is >>>> performed in other words data is collected within this "previous interval' >>>> val slidingInterval = y l x/y = even number >>>> >>>> Both the window length and the slidingInterval duration must be >>>> multiples of the batch interval, as received data is divided into batches >>>> of duration "batch interval". >>>> >>>> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >>>> seconds >>>> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * >>>> 60 >>>> >>>> You sliding window should be set to batch interval = 5 * 60 seconds. In >>>> other words that where the aggregates and summaries come for your report. >>>> >>>> What is your data source here? >>>> >>>> HTH >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >>>> >>>> We have some stream data need to be calculated and considering use spark >>>> stream to do it. >>>> >>>> We need to generate three kinds of reports. The reports are based on >>>> >>>> 1. The last 5 minutes data >>>> 2. The last 1 hour data >>>> 3. The last 24 hour data >>>> >>>> The frequency of reports is 5 minutes. >>>> >>>> After reading the docs, the most obvious way to solve this seems to set >>>> up a >>>> spark stream with 5 minutes interval and two window which are 1 hour >>>> and 1 >>>> day. >>>> >>>> >>>> But I am worrying that if the window is too big for one day and one >>>> hour. I >>>> do not have much experience on spark stream, so what is the window >>>> length in >>>> your environment? >>>> >>>> Any official docs talking about this? >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >> >> >> >> >> > > > > >