Re: Continued performance issues on a small EC2 Spark cluster

2013-11-14 Thread Gary Malouf
I bring this up because the performance we are seeing is dreadful. From cpu usage, it appears the issue is the spark shell cpu power. We have increased this node from a EC2 medium to an xl, we are seeing slightly better performance but still not great. My understanding of Spark was that most of

SPARK + YARN the general case

2013-11-14 Thread Bill Sparks
Sorry for the following question, but I just need a little clarity on expectations of Spark using YARN. Is it possible to use the spark-shell with YARN ? Or is the only way to submit a Spark job to YARN is by write a Java application and submit it via the yarn.Client application. Also is there

Re: write data into HBase via spark

2013-11-14 Thread Hao REN
Hi, Philip. Basically, we need* PairRDDFunctions.saveAsHadoopDataset* to do the job, as HBase is not a fs, saveAsHadoopFile doesn't work. *def saveAsHadoopDataset(conf: JobConf): Unit* this function takes a JobConf parameter which should be configured. Essentially, you need to set output format

RDD.count() take a lot of time

2013-11-14 Thread Valentin Michajlenko
Hi! I load data from list( sc.parallelize() ) with length about 140 items. After that I run data.filter(func1).map(func2). This operation runs less, then a second. But after that function count() (or collect() ) takes about 30 seconds. Please, help me to reduce this time! Best Regards, Valentin

Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Shay Seng
Hi, Just wondering what people suggest for joining of 2 RDDs of very different sizes I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB - 800MB that typically has a couple hundred partitions. After that I want to join that rdd with 2 smaller rdds 1 will be <50MB anothe

Re: RDD.count() take a lot of time

2013-11-14 Thread Meisam Fathi
Hi Valentin, data.filter() and rdd map() do not actually do the computation. When you call count() or collect(), your RDD first dies the filter(), then the map() and then the count() or collect(). See this for more info: https://github.com/mesos/spark/wiki/Spark-Programming-Guide#transformations

Re: Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Ryan Compton
I've done this with a "broadcast". It worked pretty well. Around 10g (for the smaller dataset) I started having problems (cf http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E ) If it's really only 800M

Re: Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Shay Seng
The starting data set is much larger than that, I start from a couple ~20GB data sets. Any hints on when it becomes impractical to broadcast .. ~ >50MB ...?? some ball park? On Thu, Nov 14, 2013 at 11:44 AM, Ryan Compton wrote: > I've done this with a "broadcast". It worked pretty well. Around

Re: Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Ryan Compton
My broadcast joins were working fine when the small data set was several GB, though that probably has more to do with the computer I was using than anything else. Switching code between a regular join and a broadcast join is easy. I basically copied the example here: http://ampcamp.berkeley.edu/wp-

Re: RDD.count() take a lot of time

2013-11-14 Thread Valentin Michajlenko
Thank you, Meisam! But I have found something interesting (for me, as novice in Spark). Working with 400k elements, count() takes 30 secs and .take(Int.MaxValue).size is less than a second! The problem comes when working with 1400k elements - .take(Int.MaxValue).size is not so quik. Best regards, V

any java api to get spark cluster info

2013-11-14 Thread Hussam_Jarada
I would like to get info like total cores and total memory available in the spark cluster via an spark java api, any suggestion? This will help me in setting the right partition when invoking parallelize for example Thanks, Hussam

Spark meetup in Boston on Nov 21st

2013-11-14 Thread Matei Zaharia
Hey folks, just a quick announcement -- in case you’re interested in learning more about Spark in the Boston area, I’m going to speak at the Boston Hadoop Meetup next Thursday: http://www.meetup.com/bostonhadoop/events/150875522/. This is a good chance to meet local users and learn more about th

Re: any java api to get spark cluster info

2013-11-14 Thread Aaron Davidson
Hi! I think I have the maximally horrendous solution to this problem. If you just want to know the total cores of a Standalone or Coarse Grained scheduler, and are OK with going "off trail" of the public API, so to speak, you can use something like the following (just beware that it's liable to bre