Re: GraphX and Spark
GraphX is build on *top* of Spark, so Spark can achieve whatever GraphX can. On Wed, Nov 5, 2014 at 9:41 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Can Spark achieve whatever GraphX can? Keeping aside the performance comparison between Spark and GraphX, if I want to implement any graph algorithm and I do not want to use GraphX, can I get the work done with Spark? Than You
Re: about aggregateByKey and standard deviation
I don't think directy .aggregateByKey() can be done, because we will need count of keys (for average). Maybe we can use .countByKey() which returns a map and .foldByKey(0)(_+_) (or aggregateByKey()) which gives sum of values per key. I myself ain't getting how to proceed. Regards On Fri, Oct 31, 2014 at 1:26 PM, qinwei wei@dewmobile.net wrote: Hi, everyone I have an RDD filled with data like (k1, v11) (k1, v12) (k1, v13) (k2, v21) (k2, v22) (k2, v23) ... I want to calculate the average and standard deviation of (v11, v12, v13) and (v21, v22, v23) group by there keys for the moment, i have done that by using groupByKey and map, I notice that groupByKey is very expensive, but i can not figure out how to do it by using aggregateByKey, so i wonder is there any better way to do this? Thanks! -- qinwei
Re: Scaladoc
In IntelliJ, Tools Generate Scaladoc. Kamal On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta alexbare...@gmail.com wrote: How do I build the scaladoc html files from the spark source distribution? Alex Bareta
Re: Using a Database to persist and load data from
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an OutputFormat class. So you will have to write a custom OutputFormat class that extends OutputFormat. In this class, you will have to implement a getRecordWriter which returns a custom RecordWriter. So you will also have to write a custom RecordWriter which extends RecordWriter which will have a write method that actually writes to the DB. On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com wrote: AFAIK, you can read data from DB with JdbcRDD, but there is no interface for writing to DB. JdbcRDD has some restrict such as SQL must with where clause. For writing to DB, you can use mapPartitions or foreachPartition to implement. You can refer this example: http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala 2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com: Hi Ladies and Gents, I would like to know what are the options I have if I would like to leverage Spark code I already have written to use a DB (Vertica) as its store/datasource. The data is of tabular nature. So any relational DB can essentially be used. Do I need to develop a context? If yes, how? where can I get a good example? Thank you, Asaf
Re: Batch of updates
Hi Flavio, Doing batch += ... shouldn't work. It will create new batch for each element in the myRDD (also val initializes an immutable variable, var is for mutable variables). You can use something like accumulators http://spark.apache.org/docs/latest/programming-guide.html#accumulators. val accum = sc.accumulator(0, Some accumulator) val NUM = 100 val SIZE = myRDD.count() val LAST = SIZE % NUM val MULTIPLE = (SIZE / NUM) * NUM // should return 300 when SIZE is 350 myRDD.map(x = { accum++ if (accum++ MULTIPLE) null else x }). So if there are 350 elements, it should return null for first 300 elements and actual values for last 50 elements. Then we can apply a filter for null and get remainder elements (and then finally flush the new RDD containing the remainder elements). On map vs mapPartitions, mapPartitions is used mainly for efficiency (as you can see here http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf, here http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions and here http://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/). So for simpler code, you can go with map, and for efficiency, you can go with mapPartitions. Regards, Kamal
Re: What executes on worker and what executes on driver side
Can you please elaborate, I didn't get what you intended for me to read in that link. Regards. On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: What about: http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3E Regards - Saurabh Wadhawan On 20-Oct-2014, at 4:56 pm, Kamal Banga banga.ka...@gmail.com wrote: 1. All RDD operations are executed in workers. So reading a text file or executing val x = 1 will happen on worker. (link http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark) 2. a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's replication factor to n and it will replicate that data across all nodes. b. With broadcast: using sc.broadcast() should do it. (link http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ) On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: Any response for this? 1. How do I know what statements will be executed on worker side out of the spark script in a stage. e.g. if I have val x = 1 (or any other code) in my driver code, will the same statements be executed on the worker side in a stage? 2. How can I do a map side join in spark : a. without broadcast(i.e. by reading a file once in each executor) b. with broadcast but by broadcasting complete RDD to each executor Regards - Saurabh Wadhawan On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: Hi, I have following questions: 1. When I write a spark script, how do I know what part runs on the driver side and what runs on the worker side. So lets say, I write code to to read a plain text file. Will it run on driver side only or will it run on server side only or on both sides 2. If I want each worker to load a file for lets say join and the file is pretty huge lets say in GBs, so that I don't want to broadcast it, then what's the best way to do it. Another way to say the same thing would be how do I load a data structure for fast lookup(and not an RDD) on each worker node in the executor Regards - Saurabh
Re: Spark Concepts
1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in conf/spark-env.sh) is used to set number of worker instances to run on each machine (default is 1). If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores. IP Address will be same, but ports will be different and so they can communicate through sockets. Refer here http://spark.apache.org/docs/1.0.1/spark-standalone.html. 2) Master will have a process and each worker will have a process. You would want multiple workers if you have large machine with many cores and in that case communication shouldn't slow down a lot. 3) As this link http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism says, set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. It could be set something like this (let's say you want parallelism to be 4 in case you have 2 cores) : val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) .set(spark.default.parallelism, 4) val sc = new SparkContext(conf) 4) All mapping functions like .map(), .filter(), .foreach() will be executed in one stage and one taskset and all the tasks in this taskset should be executed in parallel. Details regarding no. of tasks per job are handled by Spark's internal DAG Scheduler http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png and this depends on the partitions of data. So ideally you should set partitions as number of nodes and parallelism as twice or thrice of it (depending on number of cores per node) so that multiple tasks can run on each node. This is a good question, I myself don't have much idea regarding interdependence of partitions and parallelism. Regards, Kamal On Mon, Oct 20, 2014 at 4:25 PM, Dipa Dubhashi d...@sigmoidanalytics.com wrote: Please reply in the thread (not direct email) [image: Sigmoid] http://htmlsig.com/www.sigmoidanalytics.com *Dipa Dubhashi* || Product Manager d...@sigmoidanalytics.com || www.sigmoidanalytics.com This e-mail message may contain confidential or legally privileged information and is intended only for the use of the intended recipient(s). Any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is prohibited. -- Forwarded message -- From: Kamal Banga ka...@sigmoidanalytics.com Date: Mon, Oct 20, 2014 at 4:20 PM Subject: Re: Spark Concepts To: nsar...@gmail.com Cc: Lalit Yadav la...@sigmoidanalytics.com, anish an...@sigmoidanalytics.com, Dipa Dubhashi d...@mobipulse.in, Mayur Rustagi ma...@sigmoidanalytics.com 1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in conf/spark-env.sh) is used to set number of worker instances to run on each machine (default is 1). If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores. IP Address will be same, but ports will be different and so they can communicate through sockets. Refer here http://spark.apache.org/docs/1.0.1/spark-standalone.html. 2) Master will have a process and each worker will have a process. You would want multiple workers if you have large machine with many cores and in that case communication shouldn't slow down a lot. 3) As this link http://spark.apache.org/docs/1.0.0/tuning.html#level-of-parallelism says, set the config property spark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. It could be set something like this (let's say you want parallelism to be 4 in case you have 2 cores) : val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) .set(spark.default.parallelism, 4) val sc = new SparkContext(conf) 4) We don't have explicit control over no. of tasks per job. All mapping functions like .map(), .filter(), .foreach() will be executed in one stage and one taskset and all the tasks in this taskset should be executed in parallel. AFAIK you can only tune the no. of nodes that a job will be executing on as I mentioned in 3). Details regarding no. of tasks per job are handled by Spark's internal DAG Scheduler http://zhangjunhd.github.io/assets/2013-09-24-spark/4.png and this depends on the partitions of data. So ideally you should set partitions as number of nodes and parallelism as twice or thrice of it (depending on number of cores per node) so that multiple tasks can run on each node. This is a good question, I myself don't have much idea regarding interdependence of partitions and parallelism. Regards, Kamal On Sun, Oct 19, 2014 at 2
Re: What executes on worker and what executes on driver side
1. All RDD operations are executed in workers. So reading a text file or executing val x = 1 will happen on worker. (link http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark) 2. a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's replication factor to n and it will replicate that data across all nodes. b. With broadcast: using sc.broadcast() should do it. (link http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ) On Mon, Oct 20, 2014 at 1:18 AM, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: Any response for this? 1. How do I know what statements will be executed on worker side out of the spark script in a stage. e.g. if I have val x = 1 (or any other code) in my driver code, will the same statements be executed on the worker side in a stage? 2. How can I do a map side join in spark : a. without broadcast(i.e. by reading a file once in each executor) b. with broadcast but by broadcasting complete RDD to each executor Regards - Saurabh Wadhawan On 19-Oct-2014, at 1:54 am, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: Hi, I have following questions: 1. When I write a spark script, how do I know what part runs on the driver side and what runs on the worker side. So lets say, I write code to to read a plain text file. Will it run on driver side only or will it run on server side only or on both sides 2. If I want each worker to load a file for lets say join and the file is pretty huge lets say in GBs, so that I don't want to broadcast it, then what's the best way to do it. Another way to say the same thing would be how do I load a data structure for fast lookup(and not an RDD) on each worker node in the executor Regards - Saurabh
preservesPartitioning
Hi All, The function *mapPartitions *in RDD.scala https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala takes a boolean parameter *preservesPartitioning. *It seems if that parameter is passed as *false*, the passed function f will operate on the data only once, whereas if it's passed as *true *the function will operate on each partition of the data. In my case, whatever boolean value I pass, *f* operates on each partition of data. Any help, regarding why I am getting this unexpected behaviour?