Re: Spark concurrency question

2015-02-08 Thread Sean Owen
I think I have this right:

You will run one executor per application per worker. Generally there
is one worker per machine, and it manages all of the machine's
resources. So if you want one app to use this whole machine you need
to ask for 48G and 24 cores. That's better than splitting up the
resources such that no executor can use more than 4G.

(However with big heaps 32G it can make sense to limit the size of an
executor, so for example, you could configure to run 3 workers per
machine each controlling 8 cores and 16G, and ask for smaller
executors. Still I don't think it would make sense to run 12 workers
per machine here.)

10 tasks (1 per partition) will execute. They generally get assigned
to favor data locality, but here everything's local. If you had 3
executors of 8 cores, I'm not sure if it's guaranteed to balance but
it should be using at least 2 executors, since there are 10 tasks and
8*3=24 slots.

In your initial scenario, I think it may be waiting because the single
worker has all of its cores devoted to your first app's single
executor. You can ask for fewer cores in each spark-shell.

Not sure what you mean about threads. Yes of course threads are used
within one JVM / executor. It's not an executor per partition; it's a
task per partition and 1 executor per application per worker (and
usually 1 worker per machine but not always). One task executes
serially in one thread and as many tasks as slots can run
concurrently, and that's 1 slot per core that the executor is using. I
suppose in theory you could write a function that starts its own
threads too, but that's not generally a good idea or necessary.

Did you read the docs on the site?
http://spark.apache.org/docs/latest/cluster-overview.html
http://spark.apache.org/docs/latest/spark-standalone.html

On Sun, Feb 8, 2015 at 7:18 PM, java8964 java8...@hotmail.com wrote:
 Hi, I have some questions about how the spark run the job concurrently.

 For example, if I setup the Spark on one standalone test box, which has 24
 core and 64G memory. I setup the Worker memory to 48G, and Executor memory
 to 4G, and using spark-shell to run some jobs. Here is something confusing
 me:

 1) Does the above setting mean that I can have up to 12 Executor running in
 this box at same time?
 2) Let's assume that I want to do a line count of one 1280M HDFS file, which
 has 10 blocks as 128M per block. In this case, when the Spark program starts
 to run, will it kick off one executor using 10 threads to read these 10
 blocks hdfs data, or 10 executors to read one block each? Or in other way? I
 read the Apache spark document, so I know that this 1280M HDFS file will be
 split as 10 partitions. But how the executor run them, I am not clear.
 3) In my test case, I started one Spark-shell to run a very expensive job. I
 saw in the Spark web UI, there are 8 stages generated, with 200 to 400 tasks
 in each stage, and the tasks started to run. At this time, I started another
 spark shell to connect to master, and try to run a small spark program. From
 the spark-shell, it shows my new small program is in a wait status for
 resource. Why? And what kind of resources it is waiting for? If it is
 waiting for memory, does this means that there are 12 concurrent tasks
 running in the first program, took 12 * 4G = 48G memory given to the worker,
 so no more resource available? If so, in this case, then one running task is
 one executor?
 4) In MapReduce, the count of map and reducer tasks are the resource used by
 the cluster. My understanding is Spark is using multithread, instead of
 individual JVM processor. In this case, is the Executor using 4G heap to
 generate multithreads? My real question is that if each executor
 corresponding to each RDD partition, or executor could span thread for a RDD
 partition? On the other hand, how the worker decides how many executors to
 be created?

 If there is any online document answering the above questions, please let me
 know. I searched in the Apache Spark site, but couldn't find it.

 Thanks

 Yong

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark concurrency question

2015-02-08 Thread java8964
Hi, I have some questions about how the spark run the job concurrently.
For example, if I setup the Spark on one standalone test box, which has 24 core 
and 64G memory. I setup the Worker memory to 48G, and Executor memory to 4G, 
and using spark-shell to run some jobs. Here is something confusing me:
1) Does the above setting mean that I can have up to 12 Executor running in 
this box at same time?2) Let's assume that I want to do a line count of one 
1280M HDFS file, which has 10 blocks as 128M per block. In this case, when the 
Spark program starts to run, will it kick off one executor using 10 threads to 
read these 10 blocks hdfs data, or 10 executors to read one block each? Or in 
other way? I read the Apache spark document, so I know that this 1280M HDFS 
file will be split as 10 partitions. But how the executor run them, I am not 
clear.3) In my test case, I started one Spark-shell to run a very expensive 
job. I saw in the Spark web UI, there are 8 stages generated, with 200 to 400 
tasks in each stage, and the tasks started to run. At this time, I started 
another spark shell to connect to master, and try to run a small spark program. 
From the spark-shell, it shows my new small program is in a wait status for 
resource. Why? And what kind of resources it is waiting for? If it is waiting 
for memory, does this means that there are 12 concurrent tasks running in the 
first program, took 12 * 4G = 48G memory given to the worker, so no more 
resource available? If so, in this case, then one running task is one 
executor?4) In MapReduce, the count of map and reducer tasks are the resource 
used by the cluster. My understanding is Spark is using multithread, instead of 
individual JVM processor. In this case, is the Executor using 4G heap to 
generate multithreads? My real question is that if each executor corresponding 
to each RDD partition, or executor could span thread for a RDD partition? On 
the other hand, how the worker decides how many executors to be created?
If there is any online document answering the above questions, please let me 
know. I searched in the Apache Spark site, but couldn't find it.
Thanks
Yong  

Re: Spark concurrency question

2015-02-08 Thread Sean Owen
On Sun, Feb 8, 2015 at 10:26 PM, java8964 java8...@hotmail.com wrote:
 standalone one box environment, if I want to use all 48G memory allocated to
 worker for my application, I should ask 48G memory for the executor in the
 spark shell, right? Because 48G is too big for a JVM heap in normal case, I
 can and should consider to start multi workers in one box, to lower the
 executor memory, but still use all 48G memory.

Yes.

 In the spark document, about the -- cores parameter, the default is all
 available cores, so it means using all available cores in all workers, even
 in the cluster environment? If so, in default case, if one client submit a
 huge job, it will use all the available cores from the cluster for all the
 tasks it generates?

Have a look at how cores work in standalone mode:
http://spark.apache.org/docs/latest/job-scheduling.html

 One thing is still not clear is in the given example I have, if 10 tasks (1
 per partition) will execute, but there is one executor per application, in
 this case, I have the following 2 questions, assuming that the worker memory
 is set to 48G, and executor memory is set to 4G, and I use one spark-shell
 to connect to the master to submit my application:

 1) How many executor will be created on this box (Or even in the cluster it
 it is running in the cluster)? I don't see any spark configuration related
 to set number of executor in spark shell. If it is more than one, how this
 number is calculated?

Again from http://spark.apache.org/docs/latest/job-scheduling.html for
standalone mode the default should be 1 executor per worker, but you
can change that.

 2) Do you mean that one partition (or one task for it) will be run by one
 executor? Is that one executor will run the task sequentially, but job
 concurrency comes from that multi executors could run synchronous, right?

A partition maps to a task, which is computed serially. Tasks are
executed in parallel in an executor, which can execute many tasks at
once. No, parallelism does not (only) come from running many
executors.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org