Re: KMeans Input Format

Sean Owen Thu, 07 Aug 2014 10:25:46 -0700

It's not running out of memory on the driver though, right? the
executors may need more memory, or use more executors.
--executory-memory would let you increase from the default of 512MB.


On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz <[email protected]> wrote:
> Hi,
>
> Could you try running spark-shell with the flag --driver-memory 2g or more if 
> you have more RAM available and try again?
>
> Thanks,
> Burak
>
> ----- Original Message -----
> From: "AlexanderRiggers" <[email protected]>
> To: [email protected]
> Sent: Thursday, August 7, 2014 7:37:40 AM
> Subject: KMeans Input Format
>
> I want to perform a K-Means task and fail training the model and get kicked
> out of Sparks scala shell before I get my result metrics. I am not sure if
> the input format is the problem or something else. I use Spark 1.0.0 and my
> input textile (400MB) looks like this:
>
> 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703
> 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99
> 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064
> 3611 23.7 1.95 etc.
>
> It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use
> the first two, because in the MLlib example file there only use floats. So I
> also tried the last two:
>
> 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15
> 0.99
>
> My error code in both cases is here:
>
> scala> import org.apache.spark.mllib.clustering.KMeans import
> org.apache.spark.mllib.clustering.KMeans
>
> scala> import org.apache.spark.mllib.linalg.Vectors import
> org.apache.spark.mllib.linalg.Vectors
>
> scala>
>
> scala> // Load and parse the data
>
> scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO
> MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744
> 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 34.6 KB, free 303.3 MB) data:
> org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14
>
> scala> val parsedData = data.map(s => Vectors.dense(s.split('
> ').map(_.toDouble))) parsedData:
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
> MappedRDD[2] at map at :16
>
> scala>
>
> scala> // Cluster the data into two classes using KMeans
>
> scala> val numClusters = 2 numClusters: Int = 2
>
> scala> val numIterations = 20 numIterations: Int = 20
>
> scala> val clusters = KMeans.train(parsedData, numClusters, numIterations)
> 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07
> 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07
> 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260
> 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at
> KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07
> 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at
> KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final
> stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List()
> 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map
> at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO
> DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map
> at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task
> set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task
> 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
> 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms
> 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
> TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07
> 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor
> localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager:
> Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO
> TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost:
> localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized
> task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager:
> Starting task 0.0:4 as TID 4 on executor localhost: localhost
> (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4
> as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task
> 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
> 16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms
> 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on
> executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
> TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07
> 16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor:
> Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5
> 14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO
> Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID
> 3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> BlockManager: Found block broadcast_0 locally 14/08/07 16:15:39 INFO
> HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432
> 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_0 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_2 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_1 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_4 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_6 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_3 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432
> 14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_5 not found, computing
> it 14/08/07 16:15:39 INFO HadoopRDD: Input split:
> file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432
> 14/08/07 16:16:53 ERROR Executor: Exception in task ID 5
> java.lang.OutOfMemoryError: Java heap space at
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
> at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at
> scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at
> scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
> at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at
> org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at
> org.apache.spark.scheduler.Task.run(Task.scala:51) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:695) 14/08/07 16:16:59 ERROR
> ExecutorUncaughtExceptionHandler: Uncaught exception in thread
> Thread[Executor task launch worker-5,5,main] java.lang.OutOfMemoryError:
> Java heap space at
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
> at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47) at
> scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83) at
> scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47) at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
> at
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at
> org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at
> org.apache.spark.scheduler.Task.run(Task.scala:51) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:695) 14/08/07 16:17:00 WARN
> TaskSetManager: Lost TID 5 (task 0.0:5) Chairs-MacBook-Pro:spark-1.0.0
> admin$ Chairs-MacBook-Pro:spark-1.0.0 admin$ // Evaluate clustering by
> computing Within Set Sum of Squared Errors -bash: //: is a directory
> Chairs-MacBook-Pro:spark-1.0.0 admin$ val WSSSE =
> clusters.computeCost(parsedData) -bash: syntax error near unexpected token
> `(' Chairs-MacBook-Pro:spark-1.0.0 admin$ println("Within Set Sum of Squared
> Errors = " + WSSSE)
>
> What am I missing?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: KMeans Input Format

Reply via email to