It's not running out of memory on the driver though, right? the executors may need more memory, or use more executors. --executory-memory would let you increase from the default of 512MB.
On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz <[email protected]> wrote: > Hi, > > Could you try running spark-shell with the flag --driver-memory 2g or more if > you have more RAM available and try again? > > Thanks, > Burak > > ----- Original Message ----- > From: "AlexanderRiggers" <[email protected]> > To: [email protected] > Sent: Thursday, August 7, 2014 7:37:40 AM > Subject: KMeans Input Format > > I want to perform a K-Means task and fail training the model and get kicked > out of Sparks scala shell before I get my result metrics. I am not sure if > the input format is the problem or something else. I use Spark 1.0.0 and my > input textile (400MB) looks like this: > > 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703 > 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99 > 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064 > 3611 23.7 1.95 etc. > > It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use > the first two, because in the MLlib example file there only use floats. So I > also tried the last two: > > 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15 > 0.99 > > My error code in both cases is here: > > scala> import org.apache.spark.mllib.clustering.KMeans import > org.apache.spark.mllib.clustering.KMeans > > scala> import org.apache.spark.mllib.linalg.Vectors import > org.apache.spark.mllib.linalg.Vectors > > scala> > > scala> // Load and parse the data > > scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO > MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 > 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 34.6 KB, free 303.3 MB) data: > org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14 > > scala> val parsedData = data.map(s => Vectors.dense(s.split(' > ').map(_.toDouble))) parsedData: > org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = > MappedRDD[2] at map at :16 > > scala> > > scala> // Cluster the data into two classes using KMeans > > scala> val numClusters = 2 numClusters: Int = 2 > > scala> val numIterations = 20 numIterations: Int = 20 > > scala> val clusters = KMeans.train(parsedData, numClusters, numIterations) > 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07 > 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07 > 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 > 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at > KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 > 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at > KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final > stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List() > 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map > at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO > DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map > at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task > set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task > 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 > 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms > 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on > executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO > TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07 > 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor > localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: > Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO > TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: > localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized > task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager: > Starting task 0.0:4 as TID 4 on executor localhost: localhost > (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4 > as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task > 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 > 16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms > 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on > executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO > TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07 > 16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor: > Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5 > 14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO > Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID > 3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO > HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432 > 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_0 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_2 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_1 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_4 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_6 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_3 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432 > 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_5 not found, computing > it 14/08/07 16:15:39 INFO HadoopRDD: Input split: > file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432 > 14/08/07 16:16:53 ERROR Executor: Exception in task ID 5 > java.lang.OutOfMemoryError: Java heap space at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at > scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at > scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at > org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at > org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at > org.apache.spark.scheduler.Task.run(Task.scala:51) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:695) 14/08/07 16:16:59 ERROR > ExecutorUncaughtExceptionHandler: Uncaught exception in thread > Thread[Executor task launch worker-5,5,main] java.lang.OutOfMemoryError: > Java heap space at > scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99) > at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at > scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at > scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at > scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at > org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at > org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at > org.apache.spark.scheduler.Task.run(Task.scala:51) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:695) 14/08/07 16:17:00 WARN > TaskSetManager: Lost TID 5 (task 0.0:5) Chairs-MacBook-Pro:spark-1.0.0 > admin$ Chairs-MacBook-Pro:spark-1.0.0 admin$ // Evaluate clustering by > computing Within Set Sum of Squared Errors -bash: //: is a directory > Chairs-MacBook-Pro:spark-1.0.0 admin$ val WSSSE = > clusters.computeCost(parsedData) -bash: syntax error near unexpected token > `(' Chairs-MacBook-Pro:spark-1.0.0 admin$ println("Within Set Sum of Squared > Errors = " + WSSSE) > > What am I missing? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
