I actually provided them in submit command here:

nohup ./bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G 
--num-executors 1 --total-executor-cores 1 --conf 
"spark.storage.memoryFraction=0.2"  ./mycode.py1>a.log 2>b.log &










At 2016-05-10 21:19:06, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:

Hi Mingwei,


In your Spark conf setting what are you providing for these parameters. Are you 
capping them?


For example


  val conf = new SparkConf().
               setAppName("AppName").
               setMaster("local[2]").
               set("spark.executor.memory", "4G").
               set("spark.cores.max", "2").
               set("spark.driver.allowMultipleContexts", "true")
  val sc = new SparkContext(conf)


I assume you are running in standalone mode so each worker/aka slave grabs all 
the available cores and allocates the remaining memory on each host. Do not 
provide these in


Do not provide new values for these parameter meaning overwrite them in

${SPARK_HOME}/bin/spark-submit  --




HTH

























Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 10 May 2016 at 03:12, 李明伟 <kramer2...@126.com> wrote:

Hi Mich


I added some more infor (the spark-env.sh setting and top command output in 
that thread.) Can you help to check pleas?


Regards
Mingwei






At 2016-05-09 23:45:19, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:

I had a look at the thread.


This is what you have which I gather a standalone box in other words one worker 
node


bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G 
--num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log


But what I don't understand why is using 80% of your RAM as opposed to 25% of 
it (4GB/16GB) right?


Where else have you set up these parameters for example in 
$SPARK_HOME/con/spark-env.sh?


Can you send the output of /usr/bin/free and top


HTH



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote:

Thanks for all the information guys. 


I wrote some code to do the test. Not using window. So only calculating data 
for each batch interval. I set the interval to 30 seconds also reduce the size 
of data to about 30 000 lines of csv.
Means my code should calculation on 30 000 lines of CSV in 30 seconds. I think 
it is not a very heavy workload. But my spark stream code still crash.


I send another post to the user list here 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html
 
Is it possible for you to have a look please? Very appreciate.






At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:

Pease see the inline comments.




On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:

Thank you.


So If I create spark streaming then


The streams will always need to be cached? It cannot be stored in persistent 
storage
You don't need to cache the stream explicitly if you don't have specific 
requirement, Spark will do it for you depends on different streaming sources 
(Kafka or socket).
The stream data cached will be distributed among all nodes of Spark among 
executors
As I understand each Spark worker node has one executor that includes cache. So 
the streaming data is distributed among these work node caches. For example if 
I have 4 worker nodes each cache will have a quarter of data (this assumes that 
cache size among worker nodes is the same.)
Ideally, it will distributed evenly across the executors, also this is target 
for tuning. Normally it depends on several conditions like receiver 
distribution, partition distribution.
 


The issue raises if the amount of streaming data does not fit into these 4 
caches? Will the job crash?



On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote:




No, each executor only stores part of data in memory (it depends on how the 
partition are distributed and how many receivers you have). 


For WindowedDStream, it will obviously cache the data in memory, from my 
understanding you don't need to call cache() again.


On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:

hi,


so if i have 10gb of streaming data coming in does it require 10gb of memory in 
each node?


also in that case why do we need using


dstream.cache()


thanks



On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote:




It depends on you to write the Spark application, normally if data is already 
on the persistent storage, there's no need to be put into memory. The reason 
why Spark Streaming has to be stored in memory is that streaming source is not 
persistent source, so you need to have a place to store the data.


On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote:

Thanks.
What if I use batch calculation instead of stream computing? Do I still need 
that much memory? For example, if the 24 hour data set is 100 GB. Do I also 
need a 100GB RAM to do the one time batch calculation ?






At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:

For window related operators, Spark Streaming will cache the data into memory 
within this window, in your case your window size is up to 24 hours, which 
means data has to be in Executor's memory for more than 1 day, this may 
introduce several problems when memory is not enough.


On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

ok terms for Spark Streaming


"Batch interval" is the basic interval at which the system with receive the 
data in batches.
This is the interval set when creating a StreamingContext. For example, if you 
set the batch interval as 300 seconds, then any input DStream will generate 
RDDs of received data at 300 seconds intervals.
A window operator is defined by two parameters -
- WindowDuration / WindowsLength - the length of the window
- SlideDuration / SlidingInterval - the interval at which the window will slide 
or move forward




Ok so your batch interval is 5 minutes. That is the rate messages are coming in 
from the source.


Then you have these two params


// window length - The duration of the window below that must be multiple of 
batch interval n in = > StreamingContext(sparkConf, Seconds(n))
val windowLength = x =  m * n
// sliding interval - The interval at which the window operation is performed 
in other words data is collected within this "previous interval'
val slidingInterval =  y l x/y = even number


Both the window length and the slidingInterval duration must be multiples of 
the batch interval, as received data is divided into batches of duration "batch 
interval".


If you want to collect 1 hour data then windowLength =  12 * 5 * 60 seconds
If you want to collect 24 hour data then windowLength =  24 * 12 * 5 * 60


You sliding window should be set to batch interval = 5 * 60 seconds. In other 
words that where the aggregates and summaries come for your report.


What is your data source here?


HTH




Dr Mich Talebzadeh
 
LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 
http://talebzadehmich.wordpress.com
 


On 9 May 2016 at 04:19, kramer2...@126.com<kramer2...@126.com> wrote:
We have some stream data need to be calculated and considering use spark
stream to do it.

We need to generate three kinds of reports. The reports are based on

1. The last 5 minutes data
2. The last 1 hour data
3. The last 24 hour data

The frequency of reports is 5 minutes.

After reading the docs, the most obvious way to solve this seems to set up a
spark stream with 5 minutes interval and two window which are 1 hour and 1
day.


But I am worrying that if the window is too big for one day and one hour. I
do not have much experience on spark stream, so what is the window length in
your environment?

Any official docs talking about this?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org









 
















 







 


Reply via email to