Re: Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-11 Thread Mich Talebzadeh
Ok you can see that the process 10603 Worker is running as the worker/slave in your drive manager connection to GUI port webui-port 8081 spark://ES01:7077. That you can access through web Also you have process 12420 running as SparkSubmit. that is telling you the JVM you have submitted for this

Re:Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
[root@ES01 test]# jps 10409 Master 12578 CoarseGrainedExecutorBackend 24089 NameNode 17705 Jps 24184 DataNode 10603 Worker 12420 SparkSubmit [root@ES01 test]# ps -awx | grep -i spark | grep java 10409 ?Sl 1:52 java -cp

Re: Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
what does jps returning? jps 16738 ResourceManager 14786 Worker 17059 JobHistoryServer 12421 QuorumPeerMain 9061 RunJar 9286 RunJar 5190 SparkSubmit 16806 NodeManager 16264 DataNode 16138 NameNode 16430 SecondaryNameNode 22036 SparkSubmit 9557 Jps 13240 Kafka 2522 Master and ps -awx | grep -i

Re:Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
Hi Mich From the ps command. I can find four process. 10409 is the master and 10603 is the worker. 12420 is the driver program and 12578 should be the executor (worker). Am I right? So you mean the 12420 is actually running both the driver and the worker role? [root@ES01 ~]# ps -awx | grep

Re: Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
hm, This is a standalone mode. When you are running Spark in Standalone mode, you only have one worker that lives within the driver JVM process that you start when you start spark-shell or spark-submit. However, since driver-memory setting encapsulates the JVM, you will need to set the amount

Re:Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread 李明伟
I actually provided them in submit command here: nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 --conf "spark.storage.memoryFraction=0.2" ./mycode.py1>a.log 2>b.log & At 2016-05-10 21:19:06, "Mich Talebzadeh"

Re: Re: Re: Re: How big the spark stream window could be ?

2016-05-10 Thread Mich Talebzadeh
Hi Mingwei, In your Spark conf setting what are you providing for these parameters. *Are you capping them?* For example val conf = new SparkConf(). setAppName("AppName"). setMaster("local[2]"). set("spark.executor.memory", "4G").

Re:Re: Re: Re: How big the spark stream window could be ?

2016-05-09 Thread 李明伟
Hi Mich I added some more infor (the spark-env.sh setting and top command output in that thread.) Can you help to check pleas? Regards Mingwei At 2016-05-09 23:45:19, "Mich Talebzadeh" wrote: I had a look at the thread. This is what you have which I gather

Re: Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Mich Talebzadeh
I had a look at the thread. This is what you have which I gather a standalone box in other words one worker node bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log But what I don't understand why is using

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Mich Talebzadeh
Actually it is interesting to understand how Spark allocate cache size within each worker node. if it is allocated dynamically then memory error won't occur until all cache memories are exhausted? Also it really depends on the operation for example, I would not use spark for this purpose, to get

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
great. So in simplest of forms let us assume that I have a standalone host that runs Spark and receives topics from a source say Kafa. So basically I have one executor, one cache on the node and if my streaming data is too much, I anticipate there will not be execution as I don't have memory.

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
Pease see the inline comments. On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar wrote: > Thank you. > > So If I create spark streaming then > > >1. The streams will always need to be cached? It cannot be stored in >persistent storage > > You don't need to cache the

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
Thank you. So If I create spark streaming then - The streams will always need to be cached? It cannot be stored in persistent storage - The stream data cached will be distributed among all nodes of Spark among executors - As I understand each Spark worker node has one executor that

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
No, each executor only stores part of data in memory (it depends on how the partition are distributed and how many receivers you have). For WindowedDStream, it will obviously cache the data in memory, from my understanding you don't need to call cache() again. On Mon, May 9, 2016 at 5:06 PM,

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Ashok Kumar
hi, so if i have 10gb of streaming data coming in does it require 10gb of memory in each node? also in that case why do we need using dstream.cache() thanks On Monday, 9 May 2016, 9:58, Saisai Shao wrote: It depends on you to write the Spark application,

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Saisai Shao
It depends on you to write the Spark application, normally if data is already on the persistent storage, there's no need to be put into memory. The reason why Spark Streaming has to be stored in memory is that streaming source is not persistent source, so you need to have a place to store the

Re: Re: How big the spark stream window could be ?

2016-05-09 Thread Mich Talebzadeh
Hi, Have you thought of other alternatives like collecting data in a database (over 24 hours period)? I mean do you require reports of 5 min interval *after 24 hours data collection* from t0, t0+5m, t0+10 min? You can only do so after collecting data then you can partition your table into 5