Re: GraphX and Spark

2014-11-04 Thread Kamal Banga
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can.

On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 Can Spark achieve whatever GraphX can?
 Keeping aside the performance comparison between Spark and GraphX, if I
 want to implement any graph algorithm and I do not want to use GraphX, can
 I get the work done with Spark?

 Than You



Re: about aggregateByKey and standard deviation

2014-11-03 Thread Kamal Banga
I don't think directy .aggregateByKey() can be done, because we will need
count of keys (for average). Maybe we can use .countByKey() which returns a
map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values
per key. I myself ain't getting how to proceed.

Regards

On Fri, Oct 31, 2014 at 1:26 PM, qinwei wei@dewmobile.net wrote:

 Hi, everyone
 I have an RDD filled with data like
 (k1, v11)
 (k1, v12)
 (k1, v13)
 (k2, v21)
 (k2, v22)
 (k2, v23)
 ...

 I want to calculate the average and standard deviation of (v11, v12,
 v13) and (v21, v22, v23) group by there keys
 for the moment, i have done that by using groupByKey and map, I notice
 that groupByKey is very expensive,  but i can not figure out how to do it
 by using aggregateByKey, so i wonder is there any better way to do this?

 Thanks!

 --
 qinwei



Re: Scaladoc

2014-10-31 Thread Kamal Banga
In IntelliJ, Tools  Generate Scaladoc.

Kamal

On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta alexbare...@gmail.com
wrote:

 How do I build the scaladoc html files from the spark source distribution?

 Alex Bareta



Re: Using a Database to persist and load data from

2014-10-31 Thread Kamal Banga
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an
OutputFormat class.
So you will have to write a custom OutputFormat class that extends
OutputFormat. In this class, you will have to implement a getRecordWriter
which returns a custom RecordWriter.
So you will also have to write a custom RecordWriter which extends
RecordWriter which will have a write method that actually writes to the DB.

On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com wrote:

 AFAIK, you can read data from DB with JdbcRDD, but there is no interface
 for writing to DB.
 JdbcRDD has some restrict such as  SQL must with where clause.
 For writing to DB, you can use mapPartitions or foreachPartition to
 implement.
 You can refer this example:

 http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala

 2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com:

 Hi Ladies and Gents,
 I would like to know what are the options I have if I would like to
 leverage Spark code I already have written to use a DB (Vertica) as its
 store/datasource.
 The data is of tabular nature. So any relational DB can essentially be
 used.

 Do I need to develop a context? If yes, how? where can I get a good
 example?


 Thank you,
 Asaf





Re: Batch of updates

2014-10-28 Thread Kamal Banga
Hi Flavio,

Doing batch += ... shouldn't work. It will create new batch for each
element in the myRDD (also val initializes an immutable variable, var is
for mutable variables). You can use something like accumulators
http://spark.apache.org/docs/latest/programming-guide.html#accumulators.

val accum = sc.accumulator(0, Some accumulator)
val NUM = 100
val SIZE = myRDD.count()
val LAST = SIZE % NUM
val MULTIPLE = (SIZE / NUM) * NUM // should return 300 when SIZE is 350
myRDD.map(x = {
accum++
if (accum++  MULTIPLE)
null
else x
}).

So if there are 350 elements, it should return null for first 300 elements
and actual values for last 50 elements. Then we can apply a filter for null
and get remainder elements (and then finally flush the new RDD containing
the remainder elements).

On map vs mapPartitions, mapPartitions is used mainly for efficiency (as
you can see here
http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf,
here
http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
and here
http://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/).
So for simpler code, you can go with map, and for efficiency, you can go
with mapPartitions.

Regards,
Kamal


Re: What executes on worker and what executes on driver side

2014-10-28 Thread Kamal Banga
Can you please elaborate, I didn't get what you intended for me to read in
that link.

Regards.

On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan 
saurabh.wadha...@guavus.com wrote:

  What about:


 http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3E


 Regards
  - Saurabh Wadhawan



  On 20-Oct-2014, at 4:56 pm, Kamal Banga banga.ka...@gmail.com wrote:

  1.  All RDD operations are executed in workers. So reading a text file
 or executing val x = 1 will happen on worker. (link
 http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark)


  2.
 a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's
 replication factor to n and it will replicate that data across all nodes.
 b. With broadcast: using sc.broadcast() should do it. (link
 http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
 )

 On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan 
 saurabh.wadha...@guavus.com wrote:

 Any response for this?

  1. How do I know what statements will be executed on worker side out of
 the spark script in a stage.
 e.g. if I have
 val x = 1 (or any other code)
 in my driver code, will the same statements be executed on the worker
 side in a stage?

 2. How can I do a map side join in spark :
a. without broadcast(i.e. by reading a file once in each executor)
b. with broadcast but by broadcasting complete RDD to each executor

 Regards
  - Saurabh Wadhawan



  On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan 
 saurabh.wadha...@guavus.com wrote:

  Hi,

   I have following questions:

  1. When I write a spark script, how do I know what part runs on the
 driver side and what runs on the worker side.
  So lets say, I write code to to read a plain text file.
  Will it run on driver side only or will it run on server side only
 or on both sides

  2. If I want each worker to load a file for lets say join and the file
 is pretty huge lets say in GBs, so that I don't want to broadcast it, then
 what's the best way to do it.
   Another way to say the same thing would be how do I load a data
 structure for fast lookup(and not an RDD) on each worker node in the
 executor

  Regards
 - Saurabh







Re: Spark Concepts

2014-10-20 Thread Kamal Banga
1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in
conf/spark-env.sh) is used to set number of worker instances to run on each
machine (default is 1). If you do set this, make sure to also set
SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each
worker will try to use all the cores. IP Address will be same, but ports
will be different and so they can communicate through sockets. Refer here
http://spark.apache.org/docs/1.0.1/spark-standalone.html.

2) Master will have a process and each worker will have a process. You
would want multiple workers if you have large machine with many cores and
in that case communication shouldn't slow down a lot.

3) As this link
http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism says,

 set the config property spark.default.parallelism to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.

It could be set something like this (let's say you want parallelism to be 4
in case you have 2 cores) :

val conf = new SparkConf()
 .setMaster(local)
 .setAppName(CountingSheep)
 .set(spark.executor.memory, 1g)

 .set(spark.default.parallelism, 4)

val sc = new SparkContext(conf)


4) All mapping functions like .map(), .filter(), .foreach() will be
executed in one stage and one taskset and all the tasks in this taskset
should be executed in parallel. Details regarding no. of tasks per job are
handled by Spark's internal DAG Scheduler
http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png and this
depends on the partitions of data. So ideally you should set partitions as
number of nodes and parallelism as twice or thrice of it (depending on
number of cores per node) so that multiple tasks can run on each node. This
is a good question, I myself don't have much idea regarding interdependence
of partitions and parallelism.

Regards,
Kamal

On Mon, Oct 20, 2014 at 4:25 PM, Dipa Dubhashi d...@sigmoidanalytics.com
wrote:

 Please reply in the thread (not direct email) 


 [image: Sigmoid] http://htmlsig.com/www.sigmoidanalytics.com

 *Dipa Dubhashi* || Product Manager

 d...@sigmoidanalytics.com || www.sigmoidanalytics.com


 This e-mail message may contain confidential or legally privileged
 information and is intended only for the use of the intended recipient(s).
 Any unauthorized disclosure, dissemination, distribution, copying or the
 taking of any action in reliance on the information herein is prohibited.

 -- Forwarded message --
 From: Kamal Banga ka...@sigmoidanalytics.com
 Date: Mon, Oct 20, 2014 at 4:20 PM
 Subject: Re: Spark Concepts
 To: nsar...@gmail.com
 Cc: Lalit Yadav la...@sigmoidanalytics.com, anish 
 an...@sigmoidanalytics.com, Dipa Dubhashi d...@mobipulse.in, Mayur
 Rustagi ma...@sigmoidanalytics.com


 1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in
 conf/spark-env.sh) is used to set number of worker instances to run on
 each machine (default is 1). If you do set this, make sure to also set
 SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each
 worker will try to use all the cores. IP Address will be same, but ports
 will be different and so they can communicate through sockets. Refer here
 http://spark.apache.org/docs/1.0.1/spark-standalone.html.

 2) Master will have a process and each worker will have a process. You
 would want multiple workers if you have large machine with many cores and
 in that case communication shouldn't slow down a lot.

 3) As this link
 http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism
  says,

 set the config property spark.default.parallelism to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.

 It could be set something like this (let's say you want parallelism to be
 4 in case you have 2 cores) :

 val conf = new SparkConf()
  .setMaster(local)
  .setAppName(CountingSheep)
  .set(spark.executor.memory, 1g)

  .set(spark.default.parallelism, 4)

 val sc = new SparkContext(conf)


 4) We don't have explicit control over no. of tasks per job. All mapping
 functions like .map(), .filter(), .foreach() will be executed in one
 stage and one taskset and all the tasks in this taskset should be executed
 in parallel. AFAIK you can only tune the no. of nodes that a job will be
 executing on as I mentioned in 3). Details regarding no. of tasks per job
 are handled by Spark's internal DAG Scheduler
 http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png and this
 depends on the partitions of data. So ideally you should set partitions as
 number of nodes and parallelism as twice or thrice of it (depending on
 number of cores per node) so that multiple tasks can run on each node. This
 is a good question, I myself don't have much idea regarding interdependence
 of partitions and parallelism.

 Regards,
 Kamal

 On Sun, Oct 19, 2014 at 2

Re: What executes on worker and what executes on driver side

2014-10-20 Thread Kamal Banga
1.  All RDD operations are executed in workers. So reading a text file or
executing val x = 1 will happen on worker. (link
http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark)

2.
a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's
replication factor to n and it will replicate that data across all nodes.
b. With broadcast: using sc.broadcast() should do it. (link
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
)

On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan 
saurabh.wadha...@guavus.com wrote:

  Any response for this?

  1. How do I know what statements will be executed on worker side out of
 the spark script in a stage.
 e.g. if I have
 val x = 1 (or any other code)
 in my driver code, will the same statements be executed on the worker
 side in a stage?

 2. How can I do a map side join in spark :
a. without broadcast(i.e. by reading a file once in each executor)
b. with broadcast but by broadcasting complete RDD to each executor

 Regards
  - Saurabh Wadhawan



  On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan saurabh.wadha...@guavus.com
 wrote:

  Hi,

   I have following questions:

  1. When I write a spark script, how do I know what part runs on the
 driver side and what runs on the worker side.
  So lets say, I write code to to read a plain text file.
  Will it run on driver side only or will it run on server side only or
 on both sides

  2. If I want each worker to load a file for lets say join and the file
 is pretty huge lets say in GBs, so that I don't want to broadcast it, then
 what's the best way to do it.
   Another way to say the same thing would be how do I load a data
 structure for fast lookup(and not an RDD) on each worker node in the
 executor

  Regards
 - Saurabh





preservesPartitioning

2014-07-17 Thread Kamal Banga
Hi All,

The function *mapPartitions *in RDD.scala
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
takes
a boolean parameter *preservesPartitioning. *It seems if that parameter is
passed as *false*, the passed function f will operate on the data only
once, whereas if it's passed as *true *the function will operate on each
partition of the data.

In my case, whatever boolean value I pass, *f* operates on each partition
of data.

Any help, regarding why I am getting this unexpected behaviour?