problem running multiple executors on large machine

2014-02-10 Thread Yadid Ayzenberg
Hi All, I have a setup which consists of 8 small machines (1 core) and 8G RAM and 1 large machine (8 cores) with 100G RAM. Is there a way to enable spark to run multiple executors on the large machine, and a single executor on each of the small machines ? Alternatively, is is possible to

heterogeneous cluster - problems setting spark.executor.memory

2014-01-23 Thread Yadid Ayzenberg
Hi Community, Im running spark in standalone mode and In my current cluster each slave has 8GB of RAM. I wanted to add one more powerful machine with 100GB of RAM as a slave to the cluster and encountered some difficulty. If I don't set spark.executor.memory, all slaves will only allocate

Re: Why spark has to use static method?

2013-12-13 Thread Yadid Ayzenberg
Hi Jie, it seems that SparkPrefix is not serializable. you can try adding /implements Serializable/ and see if that solves the problem. Yadid On 12/13/13 5:10 AM, Jie Deng wrote: Hi,all, Thanks for your time to read this, When I first trying to write a new Java class, and put spark in it,

Fwd: Re: reading a specific key-value

2013-12-13 Thread Yadid Ayzenberg
oops, ,meant to send to the entire list... Original Message Subject:Re: reading a specific key-value Date: Fri, 13 Dec 2013 14:56:22 -0500 From: Yadid Ayzenberg ya...@media.mit.edu To: K. Shankari shank...@eecs.berkeley.edu Its says more efficient if the RDD

Re: reading a specific key-value

2013-12-13 Thread Yadid Ayzenberg
partitioned by key, and after that only partition-preserving transformations can have been performed on the RDD[(K,V)]. On Fri, Dec 13, 2013 at 12:07 PM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: oops, ,meant to send to the entire list... Original

Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
Hi All, I'm trying to understand the performance results I'm getting for the following: rdd = sc.newAPIHadoopRDD( ... ) rdd1 = rdd.keyBy( func1() ) rdd1.count() rdd1.cache() rdd2= rdd1.map(func2()) rdd2.count() rdd3 = rdd2.map(func2()) rdd3.count() I would expect the 2 maps to be more or

Re: Spark map performance question

2013-12-10 Thread Yadid Ayzenberg
it was marked as cached, so rdd1 does not need to be re-evaluated within the rdd3.count job. On Tue, Dec 10, 2013 at 9:11 AM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: Hi All, I'm trying to understand the performance results I'm getting for the following

JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
Hi all, Im noticing some strange behavior when running mapPartitions. Pseudo code: JavaPairRDDObject, Tuple2Object, BSONObject myRDD = myRDD.mapPartitions( func ) myRDD.count() ArrayListTuple2Integer, Tuple2ListTuple2Double, Double, ListTuple2Double, DoubletempRDD =

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
-effecting mutator -- you've just changed where in a lineage of transformations you are pointing to with your mutable myRDD reference. On Mon, Dec 9, 2013 at 11:06 AM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: Hi all, Im noticing some strange behavior when

Re: JavaPairRDD mapPartitions side effects

2013-12-09 Thread Yadid Ayzenberg
-defined functions, so if you're caching mutable Java objects and mutating them in a transformation, the effects of that mutation might change the cached data and affect other RDDs. Is func2 mutating your cached objects? On Mon, Dec 9, 2013 at 11:25 AM, Yadid Ayzenberg ya...@media.mit.edu

Re: RDD cache question

2013-12-02 Thread Yadid Ayzenberg
. just doing one action after the looped transformations 5. trying to unpersist() prior cached iterations can work, but it is sensitive to where it occurs relative to actions On Sat, Nov 30, 2013 at 6:39 PM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: step

RDD cache question

2013-11-30 Thread Yadid Ayzenberg
Hi All, Im trying to implement the following and would like to know in which places I should be calling RDD.cache(): Suppose I have a group of RDDs : RDD1 to RDDn as input. 1. create a single RDD_total = RDD1.union(RDD2)..union(RDDn) 2. for i = 0 to x:RDD_total = RDD_total.map (some

Re: RDD cache question

2013-11-30 Thread Yadid Ayzenberg
hasn't actually done anything yet. On Sat, Nov 30, 2013 at 6:01 PM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: Hi All, Im trying to implement the following and would like to know in which places I should be calling RDD.cache(): Suppose I have a group

Re: multiple concurrent jobs

2013-11-19 Thread Yadid Ayzenberg
on worker nodes. Anyway, Prashant's response about spreadOut is appropriate for application-level scheduling. On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: My bad - I should have stated that up front. I guess it was kind

Re: multiple concurrent jobs

2013-11-19 Thread Yadid Ayzenberg
up at shuffle boundaries), and stages are associated with task sets containing multiple tasks, the units of work that actually run on worker nodes. Anyway, Prashant's response about spreadOut is appropriate for application-level scheduling. On Tue, Nov 19, 2013 at 8:03 AM, Yadid Ayzenberg

Re: foreachPartition in Java

2013-11-18 Thread Yadid Ayzenberg
mapPartitions and ignore the result? - Patrick On Sun, Nov 17, 2013 at 4:45 PM, Yadid Ayzenberg ya...@media.mit.edu mailto:ya...@media.mit.edu wrote: Hi, According to the API, foreachPartition() is not yet implemented in Java. Are there any workarounds to get

foreachPartition in Java

2013-11-17 Thread Yadid Ayzenberg
Hi, According to the API, foreachPartition() is not yet implemented in Java. Are there any workarounds to get the same functionality ? I have a non serializable DB connection and instantiating it is pretty expensive, so I prefer to do it on a per partition basis. thanks, Yadid

Re: java.io.NotSerializableException on RDD count() in Java

2013-11-03 Thread Yadid Ayzenberg
8:19 PM, Patrick Wendell wrote: Thanks that would help. This would be consistent with there being a reference to the SparkContext itself inside of the closure. Just want to make sure that's not the case. On Sun, Nov 3, 2013 at 5:13 PM, Yadid Ayzenberg ya...@media.mit.edu wrote: Im running