extremely slow k-means version

2014-04-19 Thread ticup
Hi, I was playing around with other k-means implementations in Scala/Spark in order to test performances and get a better grasp on Spark. Now, I made one similar to the one from the examples

Re: ui broken in latest 1.0.0

2014-04-19 Thread Koert Kuipers
got it. makes sense. i am surprised it worked before... On Apr 18, 2014 9:12 PM, Andrew Or and...@databricks.com wrote: Hi Koert, I've tracked down what the bug is. The caveat is that each StageInfo only keeps around the RDDInfo of the last RDD associated with the Stage. More concretely, if

Re: ui broken in latest 1.0.0

2014-04-19 Thread Andrew Or
The reason why it worked before was because the UI would directly access sc.getStorageStatus, instead of getting it through Task and Stage events. This is not necessarily the best design, however, because the SparkContext and the SparkUI are closely coupled, and there is no way to create a SparkUI

Re: extremely slow k-means version

2014-04-19 Thread Matei Zaharia
The problem is that groupByKey means “bring all the points with this same key to the same JVM”. Your input is a Seq[Point], so you have to have all the points there. This means that a) all points will be sent across the network in a cluster, which is slow (and Spark goes through this sending

Re: extremely slow k-means version

2014-04-19 Thread ticup
Thanks a lot for the explanation Matei. As a matter of fact, I was just reading up on the paper on the Narrow and Wide Dependencies and saw that groupByKey is indeed a wide dependency which, as you explained, is the problem. Maybe it wouldn't be a bad thing to have a section in the docs on the

Re: Anyone using value classes in RDDs?

2014-04-19 Thread kamatsuoka
No, you can wrap other types in value classes as well. You can try it in the REPL: scala case class ID(val id: String) extends AnyVal defined class ID scala val i = ID(foo) i: ID = ID(foo) On Fri, Apr 18, 2014 at 4:14 PM, Koert Kuipers [via Apache Spark User List]

Help with error initializing SparkR.

2014-04-19 Thread tongzzz
I can't initialize sc context after a successful install on Cloudera quickstart VM. This is the error message: library(SparkR) Loading required package: rJava [SparkR] Initializing with classpath /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar sc - sparkR.init(local) Error in

questions about toArray and ClassTag

2014-04-19 Thread wxhsdp
Hi, all i'am quite new in scala, i do some tests in spark shell val b = a.mapPartitions{D = val p = D.toArray . p.toIterator } when a is an RDD of type RDD[Int], b.collect() works. but when i change a to RDD[MyOwnType], b.collect() returns error: 14/04/20 10:14:46 ERROR

Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?