Re: How does one decide no of executors/cores/memory allocation?

2015-06-17 Thread nsalian
Hello shreesh,

That would be quite a challenge to understand.
A few things that I think should help estimate those numbers:
1) Understanding the cost of the individual transformations in the
application
E.g a flatMap can be more expansive in memory as opposed to a map 

2) The communication patterns can be helpful to understand the cost. The
four types:

None:
 Map, Filter 
All-to-one:
 reduce
One-to-all:
 broadcast
All-to-all:
 reduceByKey, groupyByKey, Join

3) Understand the cost is the beginning. Depending how much data you have,
the partitions need to be created accordingly. The more the partitions in
smaller sizes is good to improve parallelism but you will need a lot more
executors. On the other hand, fewer partitions with larger sizes can be
lower on the executor count but it will need more individual memory.

To begin with, I would strategize the approach for partitions and try a
starting number of partitions and work from there.
Without looking or understanding your use case, it is hard for me to give
you specific numbers. It would be better to start with a basic strategy and
optimize from there.

Hope that helps.

Thank you.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326p23369.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread shreesh
I realize that there are a lot of ways to configure my application in spark.
The part that is not clear is that how do I decide say for example in how
many partitions should I divide my data or how much ram should I have or how
many workers should one initialize?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326p23339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Evo Eftimov
Best is by measuring and recording how The Performance of your solution
scales as The Workload scales - recording As In Data Points recording and
then you can do some times series stat analysis and visualizations 

For example you can start with a single box with e.g. 8 CPU cores 

Use e.g. 1 or two partitions and 1 executor which would correspond to 1 CPU
Core (JVM Thread) processing your workload - scale the workload and see how
the performance scales and record all data points 
Then re[eat the same for more cpu cores, ram and boxes - you get the idea?

Then analyze your performance datasets in the way explained 

Basically this stuff is known as Performance Engineering and has nothing to
do with specific product - read something on PE as well  

-Original Message-
From: shreesh [mailto:shreesh.la...@mail.com] 
Sent: Tuesday, June 16, 2015 4:22 PM
To: user@spark.apache.org
Subject: Re: How does one decide no of executors/cores/memory allocation?

I realize that there are a lot of ways to configure my application in spark.
The part that is not clear is that how do I decide say for example in how
many partitions should I divide my data or how much ram should I have or how
many workers should one initialize?




--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-o
f-executors-cores-memory-allocation-tp23326p23339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does one decide no of executors/cores/memory allocation?

2015-06-16 Thread Himanshu Mehra
Hi Shreesh,

You can definitely decide the how many partitions your data should break
into by passing a, 'minPartition' argument in the method
sc.textFile(input/path, minPartition) and 'numSlices' arg in method
sc.parallelize(localCollection, numSlices). In fact there is always a option
to specify the number of partitions you want with your RDD in all the method
of creating a first hand RDD. 
 moreover you can change the number of partitions any point of time by
calling some of these methods on your RDD :

'coalesce(numPartitions)': Decrease the number of partitions in the
RDD to numPartitions. Useful for running operations more efficiently after
filtering down a large dataset.

'repartition(numPartitions)':Reshuffle the data in the RDD randomly
to create either more or fewer partitions and balance it across them. This
always shuffles all data over the network.

'repartitionAndSortWithinPartitions(partitioner)':   Repartition the RDD
according to the given partitioner and, within each resulting partition,
sort records by their keys. This is more efficient than calling repartition
and then sorting within each partition because it can push the sorting down
into the shuffle machinery.

You can set these property to tune your spark environment :

spark.driver.cores  Number of cores to use for the driver process, 
only in
cluster mode.

spark.executor.coresThe number of cores to use on each executor.

spark.driver.memoryAmount of memory to use for the driver
process, i.e. where SparkContext is initialized.

spark.executor.memory  Amount of memory to use per executor process, in
the same format as JVM memory strings

you can also set, the number of worker processes per node by initializing
SPARK_WORKER_INSTANCES and the number of workers to start by initializing
SPARK_EXECUTOR_INSTANCES in the spark_home/conf/spark-env.sh file.

Thanks 


Himanshu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326p23330.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread shreesh
How do I decide in how many partitions I break up my data into, how many
executors should I have? I guess memory and cores will be allocated based on
the number of executors I have.
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How does one decide no of executors/cores/memory allocation?

2015-06-15 Thread gaurav sharma
When you submit a job, spark breaks down it into stages, as per DAG. the
stages run transformations or actions on the rdd's. Each rdd constitutes of
N partitions. The tasks creates by spark to execute the stage are equal to
 the number of partitions. Every task is executed on the  cored utilized by
the executors in your cluster.

--conf spark.cores.max=24 defines max cores you want to utilize. Spark
itself would distribute the number of cores among the workers.

More the number of partitions and more the cores available - more the
level of parallelism - better the performance

On Tue, Jun 16, 2015 at 9:27 AM, shreesh shreesh.la...@gmail.com wrote:

 How do I decide in how many partitions I break up my data into, how many
 executors should I have? I guess memory and cores will be allocated based
 on
 the number of executors I have.
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-does-one-decide-no-of-executors-cores-memory-allocation-tp23326.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org