Re: Spark Concepts

Kamal Banga Mon, 20 Oct 2014 04:09:53 -0700

1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in
conf/spark-env.sh) is used to set number of worker instances to run on each
machine (default is 1). If you do set this, make sure to also set
SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each
worker will try to use all the cores. IP Address will be same, but ports
will be different and so they can communicate through sockets. Refer here
<http://spark.apache.org/docs/1.0.1/spark-standalone.html>.


2) Master will have a process and each worker will have a process. You
would want multiple workers if you have large machine with many cores and
in that case communication shouldn't slow down a lot.

3) As this link
<http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism> says,

> set the config property spark.default.parallelism to change the default.
> In general, we recommend 2-3 tasks per CPU core in your cluster.

It could be set something like this (let's say you want parallelism to be 4
in case you have 2 cores) :

val conf = new SparkConf()
             .setMaster("local")
             .setAppName("CountingSheep")
             .set("spark.executor.memory", "1g")

             .set("spark.default.parallelism", "4")

val sc = new SparkContext(conf)


4) All mapping functions like .map(), .filter(), .foreach() will be
executed in one stage and one taskset and all the tasks in this taskset
should be executed in parallel. Details regarding no. of tasks per job are
handled by Spark's internal DAG Scheduler
<http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png> and this
depends on the partitions of data. So ideally you should set partitions as
number of nodes and parallelism as twice or thrice of it (depending on
number of cores per node) so that multiple tasks can run on each node. This
is a good question, I myself don't have much idea regarding interdependence
of partitions and parallelism.

Regards,
Kamal

On Mon, Oct 20, 2014 at 4:25 PM, Dipa Dubhashi <d...@sigmoidanalytics.com>
wrote:

> Please reply in the thread (not direct email) ....
>
>
> [image: Sigmoid] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Dipa Dubhashi* || Product Manager
>
> d...@sigmoidanalytics.com || www.sigmoidanalytics.com
>
>
> This e-mail message may contain confidential or legally privileged
> information and is intended only for the use of the intended recipient(s).
> Any unauthorized disclosure, dissemination, distribution, copying or the
> taking of any action in reliance on the information herein is prohibited.
>
> ---------- Forwarded message ----------
> From: Kamal Banga <ka...@sigmoidanalytics.com>
> Date: Mon, Oct 20, 2014 at 4:20 PM
> Subject: Re: Spark Concepts
> To: nsar...@gmail.com
> Cc: Lalit Yadav <la...@sigmoidanalytics.com>, anish <
> an...@sigmoidanalytics.com>, Dipa Dubhashi <d...@mobipulse.in>, Mayur
> Rustagi <ma...@sigmoidanalytics.com>
>
>
> 1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in
> conf/spark-env.sh) is used to set number of worker instances to run on
> each machine (default is 1). If you do set this, make sure to also set
> SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each
> worker will try to use all the cores. IP Address will be same, but ports
> will be different and so they can communicate through sockets. Refer here
> <http://spark.apache.org/docs/1.0.1/spark-standalone.html>.
>
> 2) Master will have a process and each worker will have a process. You
> would want multiple workers if you have large machine with many cores and
> in that case communication shouldn't slow down a lot.
>
> 3) As this link
> <http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism>
>  says,
>
>> set the config property spark.default.parallelism to change the default.
>> In general, we recommend 2-3 tasks per CPU core in your cluster.
>
> It could be set something like this (let's say you want parallelism to be
> 4 in case you have 2 cores) :
>
> val conf = new SparkConf()
>              .setMaster("local")
>              .setAppName("CountingSheep")
>              .set("spark.executor.memory", "1g")
>
>              .set("spark.default.parallelism", "4")
>
> val sc = new SparkContext(conf)
>
>
> 4) We don't have explicit control over no. of tasks per job. All mapping
> functions like .map(), .filter(), .foreach() will be executed in one
> stage and one taskset and all the tasks in this taskset should be executed
> in parallel. AFAIK you can only tune the no. of nodes that a job will be
> executing on as I mentioned in 3). Details regarding no. of tasks per job
> are handled by Spark's internal DAG Scheduler
> <http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png> and this
> depends on the partitions of data. So ideally you should set partitions as
> number of nodes and parallelism as twice or thrice of it (depending on
> number of cores per node) so that multiple tasks can run on each node. This
> is a good question, I myself don't have much idea regarding interdependence
> of partitions and parallelism.
>
> Regards,
> Kamal
>
> On Sun, Oct 19, 2014 at 2:37 AM, Mayur Rustagi <mayur.rust...@gmail.com>
> wrote:
>
>> reply to this thread as well.
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>> ---------- Forwarded message ----------
>> From: nsareen <nsar...@gmail.com>
>> Date: Wed, Oct 15, 2014 at 4:39 AM
>> Subject: Spark Concepts
>> To: u...@spark.incubator.apache.org
>>
>>
>> Hi , I'm pretty new to Big Data & Spark both. I've just started POC work
>> on spark and me & my team are evaluating it with other In Memory computing
>> tools such as GridGain, Bigmemory, Aerospike & some others too,
>> specifically to solve two sets of problems. 1) Data Storage : Our current
>> application runs on a single node which is a heavy configuration of 24
>> cores & 350Geg, our application loads all the datamart data inclusive of
>> multiple cubes into the memory & converts it and keeps it in a Trove
>> Collection in a form of Key / Value map. This collection is a immutable
>> collection which takes about 15-20 Gegs of memory space. Our anticipation
>> is that the data would grow 10-15 folds in the next year or so & we are not
>> very confident of Trove being able to scale to that level. 2) Compute: Ours
>> in a natively Analytical application doing predictive analytics with lots
>> of simulations and optimizations of scenarios, at the heart of all this are
>> the Trove Collections using which we perform our Mathematical algorithms to
>> calculate the end result, in doing so, the memory consumption of the
>> application goes beyond 250-300Geg. These are because of lots of
>> intermediate computing results ( collections ) which are further broken
>> down to the granular level and then searched in the Trove collection. All
>> this happens on a single node which obviously starts to perform slowly over
>> a period of time. And based on the large volume of data incoming in the
>> next year or so, our current architecture will not be able to handle such
>> massive In Memory data set & such computing power. Hence we are targeting
>> to change the architecture to a cluster based in memory distributed
>> computing. We are evaluating all these products along with Apache Spark. We
>> were very excited by Apache spark looking at the videos and some online
>> resources, but when it came down to doing handson we are facing lots of
>> issues. 1)What are Standalone Cluster's limitations ? Can i configure a
>> Cluster on a Single Node with Multiple Processes of Worker Nodes, Executors
>> etc. ? Is this supported even though the IP Address would be the same ? 2)
>> Why so many Java Processes ? Why are there so many Java Processes ? Worker
>> Nodes - Executors ? Will the communication between them not slow down the
>> performance on a whole ? 3) How is Parallelism on Partitioned Data achieved
>> ? This one is really important for us to understand, since are doing our
>> benchmarkings on Partitioned data, We do not know how to configure
>> Partitions on Spark ? Any help here would be appreciated. We want to
>> partition data present in Cubes, hence we want Each Cube to be a separate
>> partition. 4) What is the difference between Multiple Nodes executing Jobs
>> & Multiple Tasks Executing Jobs ? How do these handle the partitioning &
>> parallelism. Help in these questions would be really appreciated, to get a
>> better sense of Apache Spark. Thanks, Nitin
>> ------------------------------
>> View this message in context: Spark Concepts
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>>
>
>

Re: Spark Concepts

Reply via email to