KMeans Input Format

2014-08-07 Thread AlexanderRiggers
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at
org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at
org.apache.spark.scheduler.Task.run(Task.scala:51) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695) 14/08/07 16:17:00 WARN
TaskSetManager: Lost TID 5 (task 0.0:5) Chairs-MacBook-Pro:spark-1.0.0
admin$ Chairs-MacBook-Pro:spark-1.0.0 admin$ // Evaluate clustering by
computing Within Set Sum of Squared Errors -bash: //: is a directory
Chairs-MacBook-Pro:spark-1.0.0 admin$ val WSSSE =
clusters.computeCost(parsedData) -bash: syntax error near unexpected token
`(' Chairs-MacBook-Pro:spark-1.0.0 admin$ println("Within Set Sum of Squared
Errors = " + WSSSE)

What am I missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-07 Thread Burak Yavuz
Hi,

Could you try running spark-shell with the flag --driver-memory 2g or more if 
you have more RAM available and try again?

Thanks,
Burak

- Original Message -
From: "AlexanderRiggers" 
To: u...@spark.incubator.apache.org
Sent: Thursday, August 7, 2014 7:37:40 AM
Subject: KMeans Input Format

I want to perform a K-Means task and fail training the model and get kicked
out of Sparks scala shell before I get my result metrics. I am not sure if
the input format is the problem or something else. I use Spark 1.0.0 and my
input textile (400MB) looks like this:

86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703
10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99
12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064
3611 23.7 1.95 etc.

It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use
the first two, because in the MLlib example file there only use floats. So I
also tried the last two:

16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15
0.99

My error code in both cases is here:

scala> import org.apache.spark.mllib.clustering.KMeans import
org.apache.spark.mllib.clustering.KMeans

scala> import org.apache.spark.mllib.linalg.Vectors import
org.apache.spark.mllib.linalg.Vectors

scala>

scala> // Load and parse the data

scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO
MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744
14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB) data:
org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14

scala> val parsedData = data.map(s => Vectors.dense(s.split('
').map(_.toDouble))) parsedData:
org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[2] at map at :16

scala>

scala> // Cluster the data into two classes using KMeans

scala> val numClusters = 2 numClusters: Int = 2

scala> val numIterations = 20 numIterations: Int = 20

scala> val clusters = KMeans.train(parsedData, numClusters, numIterations)
14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07
16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07
16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260
14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at
KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07
16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at
KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final
stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List()
14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map
at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO
DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map
at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task
set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task
0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07
16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor
localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager:
Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO
TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost:
localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized
task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager:
Starting task 0.0:4 as TID 4 on executor localhost: localhost
(PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4
as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task
0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on
executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms 14/08/07
16:15:39 INFO Executor: Running task ID 4 14/08/07 16:15:39 INFO Executor:
Running task ID 1 14/08/07 16:15:39 INFO Executor: Running task ID 5
14/08/07 16:15:39 INFO Executor: Running task ID 6 14/08/07 16:15:39 INFO
Executor: Running task ID 0 14/08/07 16:15:39 INFO Executor: Running task ID
3 14/08/07 16:15:39 INFO Executor: Running task ID 2 14/08

Re: KMeans Input Format

2014-08-07 Thread Sean Owen
It's not running out of memory on the driver though, right? the
executors may need more memory, or use more executors.
--executory-memory would let you increase from the default of 512MB.

On Thu, Aug 7, 2014 at 5:07 PM, Burak Yavuz  wrote:
> Hi,
>
> Could you try running spark-shell with the flag --driver-memory 2g or more if 
> you have more RAM available and try again?
>
> Thanks,
> Burak
>
> - Original Message -
> From: "AlexanderRiggers" 
> To: u...@spark.incubator.apache.org
> Sent: Thursday, August 7, 2014 7:37:40 AM
> Subject: KMeans Input Format
>
> I want to perform a K-Means task and fail training the model and get kicked
> out of Sparks scala shell before I get my result metrics. I am not sure if
> the input format is the problem or something else. I use Spark 1.0.0 and my
> input textile (400MB) looks like this:
>
> 86252 3711 15.4 4.18 86252 3504 28 1.25 86252 3703 10.75 8.85 86252 3703
> 10.5 5.55 86252 2201 64 2.79 12262064 7203 32 8.49 12262064 2119 32 1.99
> 12262064 3405 8.5 2.99 12262064 2119 23 0 12262064 2119 33.8 1.5 12262064
> 3611 23.7 1.95 etc.
>
> It is ID, Category, PruductSize, PurchaseAMount,. I am not sure if I can use
> the first two, because in the MLlib example file there only use floats. So I
> also tried the last two:
>
> 16 2.49 64 3.29 56 1 16 3.29 6 4.99 10.75 0.79 4.6 3.99 11 1.18 5.8 1.25 15
> 0.99
>
> My error code in both cases is here:
>
> scala> import org.apache.spark.mllib.clustering.KMeans import
> org.apache.spark.mllib.clustering.KMeans
>
> scala> import org.apache.spark.mllib.linalg.Vectors import
> org.apache.spark.mllib.linalg.Vectors
>
> scala>
>
> scala> // Load and parse the data
>
> scala> val data = sc.textFile("data/outkmeanssm.txt") 14/08/07 16:15:37 INFO
> MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744
> 14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 34.6 KB, free 303.3 MB) data:
> org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :14
>
> scala> val parsedData = data.map(s => Vectors.dense(s.split('
> ').map(_.toDouble))) parsedData:
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
> MappedRDD[2] at map at :16
>
> scala>
>
> scala> // Cluster the data into two classes using KMeans
>
> scala> val numClusters = 2 numClusters: Int = 2
>
> scala> val numIterations = 20 numIterations: Int = 20
>
> scala> val clusters = KMeans.train(parsedData, numClusters, numIterations)
> 14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded 14/08/07
> 16:15:38 INFO FileInputFormat: Total input paths to process : 1 14/08/07
> 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260
> 14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at
> KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07
> 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at
> KMeans.scala:260) 14/08/07 16:15:38 INFO DAGScheduler: Parents of final
> stage: List() 14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List()
> 14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map
> at KMeans.scala:123), which has no missing parents 14/08/07 16:15:39 INFO
> DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map
> at KMeans.scala:123) 14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task
> set 0.0 with 7 tasks 14/08/07 16:15:39 INFO TaskSetManager: Starting task
> 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07
> 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms
> 14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
> executor localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO
> TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms 14/08/07
> 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor
> localhost: localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager:
> Serialized task 0.0:2 as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO
> TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost:
> localhost (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized
> task 0.0:3 as 2221 bytes in 1 ms 14/08/07 16:15:39 INFO TaskSetManager:
> Starting task 0.0:4 as TID 4 on executor localhost: localhost
> (PROCESS_LOCAL) 14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4
> as 2221 bytes in 0 ms 14/08/07 16:15:39 INFO TaskSetManager: Starting task
> 0.0:5 as TID 

Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
ager.scala:107)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14/08/07 20:00:40 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/08/07 20:00:41 WARN TaskSetManager: Loss was due to
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at
scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14/08/07 20:00:41 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting
job
Chairs-MacBook-Pro:spark-1.0.0 admin$ 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-07 Thread durin
Not all memory can be used for Java heap space, so maybe it does run out.
Could you try repartitioning the data? To my knowledge you shouldn't be
thrown out as long as a single partition fits into memory, even if the whole
dataset does not.

To do that, exchange 

val train = parsedData.cache()
with

val train = parsedData.repartition(20).cache()


Best regards,
Simon



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor
memory in the WebUI, since they are small according to the log:

14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)

-Xiangrui

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-08 Thread AlexanderRiggers
Thanks for your answers. I added some lines to my code and it went through,
but I get a error message for my compute cost function now...

scala> val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN
BlockManagerMasterActor: Removing BlockManager BlockManagerId(,
192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms
14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master
14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager
14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager
192.168.0.33:49242 with 303.4 MB RAM
14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager
14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master.

:30: error: value computeCost is not a member of
org.apache.spark.mllib.clustering.KMeans
   val WSSSE = model.computeCost(train)

compute cost should be a member of KMeans isn't it?

My whole code is here:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

val conf = new SparkConf()
.setMaster("local")
.setAppName("Kmeans")
.set("spark.executor.memory", "2g")
val sc = new SparkContext(conf)



import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeansModel
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/outkmeanssm.txt")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val train = parsedData.repartition(20).cache() 

// Set model and run it
val model = new KMeans()
.setInitializationMode("k-means||")
.setK(2)
.setMaxIterations(2)
.setEpsilon(1e-4)
.setRuns(1)
.run(train)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = model.computeCost(train)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-08 Thread Sean Owen
(-incubator, +user)

It's a method of KMeansModel, not KMeans. On first glance it looks
like model should be a KMeansModel, but Scala says it's not. The
problem is...

val model = new KMeans()
.setInitializationMode("k-means||")
.setK(2)
.setMaxIterations(2)
.setEpsilon(1e-4)
.setRuns(1)
.run(train)

The first line is a complete statement and you end up executing just
'val model = new KMeans()'

I forget in which cases the Scala compiler will interpret it the way
you intend and when it won't, but, to avoid doubt, I put the periods
on the end of the preceding line. That way the lines can't be
interpreted to conclude a statement.

On Fri, Aug 8, 2014 at 3:18 PM, AlexanderRiggers
 wrote:
> Thanks for your answers. I added some lines to my code and it went through,
> but I get a error message for my compute cost function now...
>
> scala> val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN
> BlockManagerMasterActor: Removing BlockManager BlockManagerId(,
> 192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms
> 14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master
> 14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager
> 14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager
> 192.168.0.33:49242 with 303.4 MB RAM
> 14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager
> 14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master.
>
> :30: error: value computeCost is not a member of
> org.apache.spark.mllib.clustering.KMeans
>val WSSSE = model.computeCost(train)
>
> compute cost should be a member of KMeans isn't it?
>
> My whole code is here:
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
>
> val conf = new SparkConf()
> .setMaster("local")
> .setAppName("Kmeans")
> .set("spark.executor.memory", "2g")
> val sc = new SparkContext(conf)
>
>
>
> import org.apache.spark.mllib.clustering.KMeans
> import org.apache.spark.mllib.clustering.KMeansModel
> import org.apache.spark.mllib.linalg.Vectors
>
> // Load and parse the data
> val data = sc.textFile("data/outkmeanssm.txt")
> val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
> val train = parsedData.repartition(20).cache()
>
> // Set model and run it
> val model = new KMeans()
> .setInitializationMode("k-means||")
> .setK(2)
> .setMaxIterations(2)
> .setEpsilon(1e-4)
> .setRuns(1)
> .run(train)
>
> // Evaluate clustering by computing Within Set Sum of Squared Errors
> val WSSSE = model.computeCost(train)
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-09 Thread AlexanderRiggers
Thank you for your help. After restructuring my code to Seans input, it
worked without changing Spark context.  I now took the same file format just
a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and
Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it
with cached and uncached RDDs. Are there some configurations to be made for
such big files and MLlib?

scala> import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeans

scala> import org.apache.spark.mllib.clustering.KMeansModel
import org.apache.spark.mllib.clustering.KMeansModel

scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors

scala> 

scala> // Load and parse the data

scala> val data = sc.textFile("s3n://ampcamp-arigge/large_file.new.txt")
14/08/09 14:58:31 INFO storage.MemoryStore: ensureFreeSpace(35666) called
with curMem=0, maxMem=309225062
14/08/09 14:58:31 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 34.8 KB, free 294.9 MB)
data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
:15

scala> val parsedData = data.map(s => Vectors.dense(s.split('
').map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[2] at map at :17

scala> val train = parsedData.repartition(20).cache() 
train: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[6] at repartition at :19

scala> 

scala> // Set model and run it

scala> val model = new KMeans().
 | setInitializationMode("k-means||").
 | setK(2).setMaxIterations(2).
 | setEpsilon(1e-4).
 | setRuns(1).
 | run(parsedData)
14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available
14/08/09 14:58:33 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
14/08/09 14:58:33 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/09 14:58:34 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/08/09 14:58:34 INFO spark.SparkContext: Starting job: takeSample at
KMeans.scala:260
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Got job 0 (takeSample at
KMeans.scala:260) with 2 output partitions (allowLocal=false)
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Final stage: Stage
0(takeSample at KMeans.scala:260)
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Missing parents: List()
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[10] at map at KMeans.scala:123), which has no missing parents
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
from Stage 0 (MappedRDD[10] at map at KMeans.scala:123)
14/08/09 14:58:34 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with
2 tasks
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
0 on executor 5: ip-172-31-16-25.ec2.internal (PROCESS_LOCAL)
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
2215 bytes in 2 ms
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
1 on executor 4: ip-172-31-16-24.ec2.internal (PROCESS_LOCAL)
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as
2215 bytes in 1 ms





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11834.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org