Hi,
I was playing around with other k-means implementations in Scala/Spark in
order to test performances and get a better grasp on Spark.
Now, I made one similar to the one from the examples
got it. makes sense. i am surprised it worked before...
On Apr 18, 2014 9:12 PM, Andrew Or and...@databricks.com wrote:
Hi Koert,
I've tracked down what the bug is. The caveat is that each StageInfo only
keeps around the RDDInfo of the last RDD associated with the Stage. More
concretely, if
The reason why it worked before was because the UI would directly access
sc.getStorageStatus, instead of getting it through Task and Stage events.
This is not necessarily the best design, however, because the SparkContext
and the SparkUI are closely coupled, and there is no way to create a
SparkUI
The problem is that groupByKey means “bring all the points with this same key
to the same JVM”. Your input is a Seq[Point], so you have to have all the
points there. This means that a) all points will be sent across the network in
a cluster, which is slow (and Spark goes through this sending
Thanks a lot for the explanation Matei.
As a matter of fact, I was just reading up on the paper on the Narrow and
Wide Dependencies and saw that groupByKey is indeed a wide dependency which,
as you explained, is the problem.
Maybe it wouldn't be a bad thing to have a section in the docs on the
No, you can wrap other types in value classes as well. You can try it in
the REPL:
scala case class ID(val id: String) extends AnyVal
defined class ID
scala val i = ID(foo)
i: ID = ID(foo)
On Fri, Apr 18, 2014 at 4:14 PM, Koert Kuipers [via Apache Spark User List]
I can't initialize sc context after a successful install on Cloudera
quickstart VM.
This is the error message:
library(SparkR)
Loading required package: rJava
[SparkR] Initializing with classpath
/usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar
sc - sparkR.init(local)
Error in
Hi, all
i'am quite new in scala, i do some tests in spark shell
val b = a.mapPartitions{D =
val p = D.toArray
.
p.toIterator
}
when a is an RDD of type RDD[Int], b.collect() works. but when i change a to
RDD[MyOwnType], b.collect() returns error:
14/04/20 10:14:46 ERROR
During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?