Ok you can see that the process 10603 Worker is running as the worker/slave
in your drive manager connection to GUI port webui-port 8081
spark://ES01:7077. That you can access through web
Also you have process 12420 running as SparkSubmit. that is telling you the
JVM you have submitted for this
[root@ES01 test]# jps
10409 Master
12578 CoarseGrainedExecutorBackend
24089 NameNode
17705 Jps
24184 DataNode
10603 Worker
12420 SparkSubmit
[root@ES01 test]# ps -awx | grep -i spark | grep java
10409 ?Sl 1:52 java -cp
what does jps returning?
jps
16738 ResourceManager
14786 Worker
17059 JobHistoryServer
12421 QuorumPeerMain
9061 RunJar
9286 RunJar
5190 SparkSubmit
16806 NodeManager
16264 DataNode
16138 NameNode
16430 SecondaryNameNode
22036 SparkSubmit
9557 Jps
13240 Kafka
2522 Master
and
ps -awx | grep -i
Hi Mich
From the ps command. I can find four process. 10409 is the master and 10603 is
the worker. 12420 is the driver program and 12578 should be the executor
(worker). Am I right?
So you mean the 12420 is actually running both the driver and the worker role?
[root@ES01 ~]# ps -awx | grep
hm,
This is a standalone mode.
When you are running Spark in Standalone mode, you only have one worker
that lives within the driver JVM process that you start when you start
spark-shell or spark-submit.
However, since driver-memory setting encapsulates the JVM, you will need to
set the amount
I actually provided them in submit command here:
nohup ./bin/spark-submit --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 --conf
"spark.storage.memoryFraction=0.2" ./mycode.py1>a.log 2>b.log &
At 2016-05-10 21:19:06, "Mich Talebzadeh"
Hi Mingwei,
In your Spark conf setting what are you providing for these parameters. *Are
you capping them?*
For example
val conf = new SparkConf().
setAppName("AppName").
setMaster("local[2]").
set("spark.executor.memory", "4G").
Hi Mich
I added some more infor (the spark-env.sh setting and top command output in
that thread.) Can you help to check pleas?
Regards
Mingwei
At 2016-05-09 23:45:19, "Mich Talebzadeh" wrote:
I had a look at the thread.
This is what you have which I gather
I had a look at the thread.
This is what you have which I gather a standalone box in other words one
worker node
bin/spark-submit --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log
But what I don't understand why is using
Actually it is interesting to understand how Spark allocate cache size
within each worker node. if it is allocated dynamically then memory error
won't occur until all cache memories are exhausted?
Also it really depends on the operation for example, I would not use spark
for this purpose, to get
great.
So in simplest of forms let us assume that I have a standalone host that runs
Spark and receives topics from a source say Kafa.
So basically I have one executor, one cache on the node and if my streaming
data is too much, I anticipate there will not be execution as I don't have
memory.
Pease see the inline comments.
On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar wrote:
> Thank you.
>
> So If I create spark streaming then
>
>
>1. The streams will always need to be cached? It cannot be stored in
>persistent storage
>
> You don't need to cache the
Thank you.
So If I create spark streaming then
- The streams will always need to be cached? It cannot be stored in
persistent storage
- The stream data cached will be distributed among all nodes of Spark among
executors
- As I understand each Spark worker node has one executor that
No, each executor only stores part of data in memory (it depends on how the
partition are distributed and how many receivers you have).
For WindowedDStream, it will obviously cache the data in memory, from my
understanding you don't need to call cache() again.
On Mon, May 9, 2016 at 5:06 PM,
hi,
so if i have 10gb of streaming data coming in does it require 10gb of memory in
each node?
also in that case why do we need using
dstream.cache()
thanks
On Monday, 9 May 2016, 9:58, Saisai Shao wrote:
It depends on you to write the Spark application,
It depends on you to write the Spark application, normally if data is
already on the persistent storage, there's no need to be put into memory.
The reason why Spark Streaming has to be stored in memory is that streaming
source is not persistent source, so you need to have a place to store the
Hi,
Have you thought of other alternatives like collecting data in a database
(over 24 hours period)?
I mean do you require reports of 5 min interval *after 24 hours data
collection* from t0, t0+5m, t0+10 min? You can only do so after collecting
data then you can partition your table into 5
17 matches
Mail list logo