Re: KMeans Input Format
Thank you for your help. After restructuring my code to Seans input, it worked without changing Spark context. I now took the same file format just a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it with cached and uncached RDDs. Are there some configurations to be made for such big files and MLlib? scala> import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala> import org.apache.spark.mllib.clustering.KMeansModel import org.apache.spark.mllib.clustering.KMeansModel scala> import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala> scala> // Load and parse the data scala> val data = sc.textFile("s3n://ampcamp-arigge/large_file.new.txt") 14/08/09 14:58:31 INFO storage.MemoryStore: ensureFreeSpace(35666) called with curMem=0, maxMem=309225062 14/08/09 14:58:31 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.8 KB, free 294.9 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :15 scala> val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :17 scala> val train = parsedData.repartition(20).cache() train: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[6] at repartition at :19 scala> scala> // Set model and run it scala> val model = new KMeans(). | setInitializationMode("k-means||"). | setK(2).setMaxIterations(2). | setEpsilon(1e-4). | setRuns(1). | run(parsedData) 14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available 14/08/09 14:58:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/08/09 14:58:33 INFO snappy.LoadSnappy: Snappy native library loaded 14/08/09 14:58:34 INFO mapred.FileInputFormat: Total input paths to process : 1 14/08/09 14:58:34 INFO spark.SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 2 output partitions (allowLocal=false) 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Missing parents: List() 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[10] at map at KMeans.scala:123), which has no missing parents 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[10] at map at KMeans.scala:123) 14/08/09 14:58:34 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 5: ip-172-31-16-25.ec2.internal (PROCESS_LOCAL) 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 2215 bytes in 2 ms 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 4: ip-172-31-16-24.ec2.internal (PROCESS_LOCAL) 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 2215 bytes in 1 ms -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11834.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
(-incubator, +user) It's a method of KMeansModel, not KMeans. On first glance it looks like model should be a KMeansModel, but Scala says it's not. The problem is... val model = new KMeans() .setInitializationMode("k-means||") .setK(2) .setMaxIterations(2) .setEpsilon(1e-4) .setRuns(1) .run(train) The first line is a complete statement and you end up executing just 'val model = new KMeans()' I forget in which cases the Scala compiler will interpret it the way you intend and when it won't, but, to avoid doubt, I put the periods on the end of the preceding line. That way the lines can't be interpreted to conclude a statement. On Fri, Aug 8, 2014 at 3:18 PM, AlexanderRiggers wrote: > Thanks for your answers. I added some lines to my code and it went through, > but I get a error message for my compute cost function now... > > scala> val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN > BlockManagerMasterActor: Removing BlockManager BlockManagerId(, > 192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms > 14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master > 14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager > 14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager > 192.168.0.33:49242 with 303.4 MB RAM > 14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager > 14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master. > > :30: error: value computeCost is not a member of > org.apache.spark.mllib.clustering.KMeans >val WSSSE = model.computeCost(train) > > compute cost should be a member of KMeans isn't it? > > My whole code is here: > import org.apache.spark.SparkContext > import org.apache.spark.SparkContext._ > import org.apache.spark.SparkConf > > val conf = new SparkConf() > .setMaster("local") > .setAppName("Kmeans") > .set("spark.executor.memory", "2g") > val sc = new SparkContext(conf) > > > > import org.apache.spark.mllib.clustering.KMeans > import org.apache.spark.mllib.clustering.KMeansModel > import org.apache.spark.mllib.linalg.Vectors > > // Load and parse the data > val data = sc.textFile("data/outkmeanssm.txt") > val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) > val train = parsedData.repartition(20).cache() > > // Set model and run it > val model = new KMeans() > .setInitializationMode("k-means||") > .setK(2) > .setMaxIterations(2) > .setEpsilon(1e-4) > .setRuns(1) > .run(train) > > // Evaluate clustering by computing Within Set Sum of Squared Errors > val WSSSE = model.computeCost(train) > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
Thanks for your answers. I added some lines to my code and it went through, but I get a error message for my compute cost function now... scala> val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(, 192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms 14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master 14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager 14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager 192.168.0.33:49242 with 303.4 MB RAM 14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager 14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master. :30: error: value computeCost is not a member of org.apache.spark.mllib.clustering.KMeans val WSSSE = model.computeCost(train) compute cost should be a member of KMeans isn't it? My whole code is here: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf val conf = new SparkConf() .setMaster("local") .setAppName("Kmeans") .set("spark.executor.memory", "2g") val sc = new SparkContext(conf) import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeansModel import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile("data/outkmeanssm.txt") val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) val train = parsedData.repartition(20).cache() // Set model and run it val model = new KMeans() .setInitializationMode("k-means||") .setK(2) .setMaxIterations(2) .setEpsilon(1e-4) .setRuns(1) .run(train) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = model.computeCost(train) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
Not all memory can be used for Java heap space, so maybe it does run out. Could you try repartitioning the data? To my knowledge you shouldn't be thrown out as long as a single partition fits into memory, even if the whole dataset does not. To do that, exchange val train = parsedData.cache() with val train = parsedData.repartition(20).cache() Best regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11719.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
ager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 20:00:40 WARN TaskSetManager: Lost TID 0 (task 0.0:0) 14/08/07 20:00:41 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 20:00:41 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting job Chairs-MacBook-Pro:spark-1.0.0 admin$ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11698.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
It's not running out of memory on the driver though, right? the executors may need more memory, or use more executors. --executory-memory would let you increase from the default of 512MB. On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz wrote: > Hi, > > Could you try running spark-shell with the flag --driver-memory 2g or more if > you have more RAM available and try again? > > Thanks, > Burak > > - Original Message - > From: "AlexanderRiggers" > To: u...@spark.incubator.apache.org > Sent: Thursday, August 7, 2014 7:37:40 AM > Subject: KMeans Input Format > > I want to perform a K-Means task and fail training the model and get kicked > out of Sparks scala shell before I get my result metrics. I am not sure if > the input format is the problem or something else. I use Spark 1.0.0 and my > input textile (400MB) looks like this: > > 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703 > 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99 > 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064 > 3611 23.7 1.95 etc. > > It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use > the first two, because in the MLlib example file there only use floats. So I > also tried the last two: > > 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15 > 0.99 > > My error code in both cases is here: > > scala> import org.apache.spark.mllib.clustering.KMeans import > org.apache.spark.mllib.clustering.KMeans > > scala> import org.apache.spark.mllib.linalg.Vectors import > org.apache.spark.mllib.linalg.Vectors > > scala> > > scala> // Load and parse the data > > scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO > MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 > 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 34.6 KB, free 303.3 MB) data: > org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14 > > scala> val parsedData = data.map(s => Vectors.dense(s.split(' > ').map(_.toDouble))) parsedData: > org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = > MappedRDD[2] at map at :16 > > scala> > > scala> // Cluster the data into two classes using KMeans > > scala> val numClusters = 2 numClusters: Int = 2 > > scala> val numIterations = 20 numIterations: Int = 20 > > scala> val clusters = KMeans.train(parsedData, numClusters, numIterations) > 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07 > 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07 > 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 > 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at > KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 > 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at > KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final > stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List() > 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map > at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO > DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map > at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task > set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task > 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 > 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms > 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on > executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO > TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07 > 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor > localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: > Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO > TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: > localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized > task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager: > Starting task 0.0:4 as TID 4 on executor localhost: localhost > (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4 > as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task > 0.0:5 as TID
Re: KMeans Input Format
Hi, Could you try running spark-shell with the flag --driver-memory 2g or more if you have more RAM available and try again? Thanks, Burak - Original Message - From: "AlexanderRiggers" To: u...@spark.incubator.apache.org Sent: Thursday, August 7, 2014 7:37:40 AM Subject: KMeans Input Format I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB) looks like this: 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064 3611 23.7 1.95 etc. It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use the first two, because in the MLlib example file there only use floats. So I also tried the last two: 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15 0.99 My error code in both cases is here: scala> import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala> import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala> scala> // Load and parse the data scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14 scala> val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at :16 scala> scala> // Cluster the data into two classes using KMeans scala> val numClusters = 2 numClusters: Int = 2 scala> val numIterations = 20 numIterations: Int = 20 scala> val clusters = KMeans.train(parsedData, numClusters, numIterations) 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List() 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor: Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5 14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID 3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08
KMeans Input Format
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) 14/08/07 16:17:00 WARN TaskSetManager: Lost TID 5 (task 0.0:5) Chairs-MacBook-Pro:spark-1.0.0 admin$ Chairs-MacBook-Pro:spark-1.0.0 admin$ // Evaluate clustering by computing Within Set Sum of Squared Errors -bash: //: is a directory Chairs-MacBook-Pro:spark-1.0.0 admin$ val WSSSE = clusters.computeCost(parsedData) -bash: syntax error near unexpected token `(' Chairs-MacBook-Pro:spark-1.0.0 admin$ println("Within Set Sum of Squared Errors = " + WSSSE) What am I missing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org