Re: Not Able to setup spark standalone Cluster(Newbie)

2013-10-10 Thread Meisam Fathi
I haven't tried spark 0.8 but I had similar problems with bringing up the master node on previous versions of spark (0.7.x). I'm using this command to start the master and it works for me: ./run spark.deploy.master.Master Thanks, Meisam On Thu, Oct 10, 2013 at 5:14 AM, vinayak navale wrote: > H

How to aggregte data by key

2013-11-11 Thread Meisam Fathi
Hi, I'm trying to use Spark to aggregate data. I am doing something similar to this right now. val groupByRdd = rdd.groupBy(x => (x._1,)) val aggregateRdd = groupByRdd map(x => (x._2.sum) This works fine for smaller datasets but runs OOM for larger datasets (the groupBy operation runs o

Re: How to aggregte data by key

2013-11-11 Thread Meisam Fathi
yKey(...) without having to manually wrap your RDD into a > PairRDDFunctions; just add import org.apache.spark.SparkContext._ to your > imports. > > > > On Mon, Nov 11, 2013 at 1:35 PM, Meisam Fathi > wrote: >> >> Hi, >> >> I'm trying to use

Removing RDDs' data from BlockManager

2013-11-13 Thread Meisam Fathi
Hi Community, When an RDD in the application becomes unreachable and gets garbage collected, how does Spark remove RDD's data from BlockManagers on the worker nodes? Thanks, Meisam

Re: Removing RDDs' data from BlockManager

2013-11-13 Thread Meisam Fathi
anager removes data from the cache in a least-recently-used > fashion as space fills up. If you’d like to remove an RDD manually before > that, you can call rdd.unpersist(). > > Matei > > On Nov 13, 2013, at 8:15 PM, Meisam Fathi wrote: > >> Hi Community, >> >&g

Re: RDD.count() take a lot of time

2013-11-14 Thread Meisam Fathi
Hi Valentin, data.filter() and rdd map() do not actually do the computation. When you call count() or collect(), your RDD first dies the filter(), then the map() and then the count() or collect(). See this for more info: https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations

Re: Does spark RDD has a partitionedByKey

2013-11-15 Thread Meisam Fathi
Hi Jiacheng, Each RDD has a partitioner. You can define your own partitioner if the default partitioner does not suit your purpose. You can take a look at this http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf. Thanks, Meisam On Fri, Nov 15, 20

Re: RDD.count() take a lot of time

2013-11-15 Thread Meisam Fathi
ss than a second! > The problem comes when working with 1400k elements - > .take(Int.MaxValue).size is not so quik. > Best regards, > Valentin > > 2013/11/14 Meisam Fathi : >> Hi Valentin, >> >> data.filter() and rdd map() do not actually do the computation. When >

Re: Does spark RDD has a partitionedByKey

2013-11-15 Thread Meisam Fathi
).flatMapValues( > x=>x). But I'm a bit worried whether this will create additional temp object > collection, as result is first made into Seq the an collection of tupples. > Any suggestion? > > Best Regards, > Jiahcheng Guo > > > On Sat, Nov 16, 2013 at 12:24 AM, Me

Re: cache()ing local variables?

2013-12-08 Thread Meisam Fathi
I asked the same question from Spark community a while ago (http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3CCAByMnGtm2s2tyqLzw%2BMdGqgNBLbfhE6-kkZ4OPY4ANfZaDSu7Q%40mail.gmail.com%3E). This is my understanding of how Spark works but I'd like one of the Spark maintainers