Why SparkPi example is slower than LocalPi example

2013-12-12 Thread Jaonary Rabarisoa
Hi all, I'm new to spark and I'm trying to play with it in order to understand how it works. So I began with running the LocalPi and SparkPi examples on my laptop in local mode. I notice that LocalPi is 3 times faster than SparkPi which is supposed to be multi threaded. Furthermore I have the

Compilation Error with Hadoop 2.2.0

2013-12-12 Thread Pinak Pani
I am trying to setup Spark with YARN 2.2.0. My Hadoop is plain Hadoop from Apache Hadoop website. When I SBT build against 2.2.0 it fails. While it compiles with a lot of warnings when I try against Hadoop 2.0.5-alpha. How can I compile Spark against YARN 2.2.0. There is a related thread here:

Re: Compilation Error with Hadoop 2.2.0

2013-12-12 Thread Prashant Sharma
I don't think yarn 2.2 is supported in 0.8 and very soon it will not be supported in master either. Read this thread http://mail-archives.apache.org/mod_mbox/spark-dev/201312.mbox/browser. On Thu, Dec 12, 2013 at 4:24 PM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: I am trying to setup

Re: Compilation Error with Hadoop 2.2.0

2013-12-12 Thread Pinak Pani
Do you mean it has been decided not to support YARN 2.2 in any future release of version 0.8? http://mail-archives.apache.org has big usability issue. You do not get URL at the thread level instead month level. Can you please tell me the subject of the mail you are referring. I will search in the

Re: Compilation Error with Hadoop 2.2.0

2013-12-12 Thread Prashant Sharma
Hey, On Thu, Dec 12, 2013 at 5:10 PM, Pinak Pani nishant.has.a.quest...@gmail.com wrote: Do you mean it has been decided not to support YARN 2.2 in any future release of version 0.8? Well AFAIK. But it might get in 0.9. http://mail-archives.apache.org has big usability issue. You do not

Re: Compilation Error with Hadoop 2.2.0

2013-12-12 Thread Pinak Pani
Alright. Thanks guys. So, I what version of Hadoop is currently supported by Spark. Also, I am not a Hadoop person, it is possible to access HDFS in Spark without YARN? On Thu, Dec 12, 2013 at 5:19 PM, Prashant Sharma scrapco...@gmail.comwrote: Hey, On Thu, Dec 12, 2013 at 5:10 PM, Pinak

some slaves don't actually start

2013-12-12 Thread Walrus theCat
Hi, I'm reading through the STDERR logs of my slaves, and about 1/4 of them don't actually start. Instead, the only thing on the log is the command that should have launched the process. Thoughts? Thanks

Re: Scala driver, Python workers?

2013-12-12 Thread Ewen Cheslack-Postava
Obviously it depends on what is missing, but if I were you, I'd try monkey patching pyspark with the functionality you need first (along with submitting a pull request, of course). The pyspark code is very readable, and a lot of functionality just builds on top of a few primitives, as in the

spark 0.8.0 fails on larger data set (Failed to run reduce at GradientDescent.scala:144)

2013-12-12 Thread Walrus theCat
Hi all, I've had smashing success with Spark 0.7.x with this code, and this same code on Spark 0.8.0 using a smaller data set. However, when I try to use a larger data set, some strange behavior occurs. I'm trying to do L2 regularization with Logistic Regression using the new ML Lib. Reading

Reading from filesystem blocks

2013-12-12 Thread Milos Nikolic
Hello, When trying to read from a file, sc.textFile() hangs for exactly one minute. From the spark shell, scala val v = sc.textFile(README.txt)// Hangs for one minute After one minute the command successfully returns the result. Now, v.count also blocks for one minute but returns the

Re: spark 0.8.0 fails on larger data set (Failed to run reduce at GradientDescent.scala:144)

2013-12-12 Thread Taka Shinagawa
How big is your data set? Did you set SPARK_MEM and SPARK_WORKER_MEMORY environmental variables? On Thu, Dec 12, 2013 at 9:07 AM, Walrus theCat walrusthe...@gmail.comwrote: Hi all, I've had smashing success with Spark 0.7.x with this code, and this same code on Spark 0.8.0 using a smaller

Re: Scala driver, Python workers?

2013-12-12 Thread Matei Zaharia
Yeah, I’m curious which APIs you found missing in Python. I know we have a lot on the Scala side that aren’t yet in there, but I’m not sure how to prioritize them.If you do want to call Python from Scala, you can also use the RDD.pipe() operation to pass data through an external process. However

Re: Why SparkPi example is slower than LocalPi example

2013-12-12 Thread Matei Zaharia
How long did they run for? The JVM takes a few seconds to start up and compile code, not to mention that Spark takes some time to initialize too, so you won’t see a major difference unless the application is taking longer. One other problem in this job is that it might use Math.random(), which

Re: spark avro: caching leads to identical records?

2013-12-12 Thread Matei Zaharia
The hadoopFile method reuses the Writable object between records that it reads by default, so you get back the same object. You should clone them if you need to cache them. This is kind of an unintuitive behavior that we’ll probably need to turn off by default; it’s helpful when you don’t need

RE: Writing to HBase

2013-12-12 Thread Benjamin Kim
Hi Philip, I got this bit of code to work in the spark-shell using scala against our dev hbase cluster. -bash-4.1$export SPARK_CLASSPATH=$SPARK_CLASSPATH:/opt/cloudera/parcels/CDH/lib/hbase/hbase.jar:/opt/cloudera/parcels/CDH/lib/hbase/conf:/opt/cloudera/parcels/CDH/lib/hadoop/conf

Re: Scala driver, Python workers?

2013-12-12 Thread Patrick Grinaway
I'm going to try what Ewen suggested--the Python wrappers seem pretty straightforward to understand and very readable. In particular, I am interested in SparkContext.hadoopRDD() and RDD.saveAsTextFile() (with compression).To elaborate on the first count, I'd like to be able to take XML files in

Re: problems with standalone cluster

2013-12-12 Thread Aaron Davidson
You might also check the spark/work/ directory for application (Executor) logs on the slaves. On Tue, Nov 19, 2013 at 6:13 PM, Umar Javed umarj.ja...@gmail.com wrote: I have a scala script that I'm trying to run on a Spark standalone cluster with just one worker (existing on the master node).

Re: spark 0.8.0 fails on larger data set (Failed to run reduce at GradientDescent.scala:144)

2013-12-12 Thread Patrick Wendell
See if there are any logs on the slaves that suggest why the tasks are failing. Right now the master log is just saying some stuff is failing but it's not clear why. On Thu, Dec 12, 2013 at 9:36 AM, Taka Shinagawa taka.epsi...@gmail.comwrote: How big is your data set? Did you set SPARK_MEM

Re: Spark Vs R (Univariate Kernel Density Estimation)

2013-12-12 Thread Imran Rashid
ah, got it, makes a lot more sense now. I couldn't figure out what w was, I should have figured it was weights. As Evan suggested, using zip is almost certainly what you want. val pointsAndWeights: RDD[(Double,Double)] = ... zipping together id_x and id_w will give you exactly that, but maybe

Re: spark avro: caching leads to identical records?

2013-12-12 Thread Robert Fink
Thanks, Matei. I expected something along these lines. Robert On Fri, Dec 13, 2013 at 5:28 AM, Matei Zaharia matei.zaha...@gmail.comwrote: The hadoopFile method reuses the Writable object between records that it reads by default, so you get back the same object. You should clone them if

writing to HDFS with a given username

2013-12-12 Thread Philip Ogren
When I call rdd.saveAsTextFile(hdfs://...) it uses my username to write to the HDFS drive. If I try to write to an HDFS directory that I do not have permissions to, then I get an error like this: Permission denied: user=me, access=WRITE, inode=/user/you/:you:us:drwxr-xr-x I can obviously

Re: writing to HDFS with a given username

2013-12-12 Thread Koert Kuipers
Hey Philip, how do you get spark to write to hdfs with your user name? When i use spark it writes to hdfs as the user that runs the spark services... i wish it read and wrote as me. On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.comwrote: When I call