Re: KMeans Input Format

2014-08-09 Thread AlexanderRiggers
Thank you for your help. After restructuring my code to Seans input, it
worked without changing Spark context.  I now took the same file format just
a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and
Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it
with cached and uncached RDDs. Are there some configurations to be made for
such big files and MLlib?

scala import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeans

scala import org.apache.spark.mllib.clustering.KMeansModel
import org.apache.spark.mllib.clustering.KMeansModel

scala import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors

scala 

scala // Load and parse the data

scala val data = sc.textFile(s3n://ampcamp-arigge/large_file.new.txt)
14/08/09 14:58:31 INFO storage.MemoryStore: ensureFreeSpace(35666) called
with curMem=0, maxMem=309225062
14/08/09 14:58:31 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 34.8 KB, free 294.9 MB)
data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:15

scala val parsedData = data.map(s = Vectors.dense(s.split('
').map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[2] at map at console:17

scala val train = parsedData.repartition(20).cache() 
train: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[6] at repartition at console:19

scala 

scala // Set model and run it

scala val model = new KMeans().
 | setInitializationMode(k-means||).
 | setK(2).setMaxIterations(2).
 | setEpsilon(1e-4).
 | setRuns(1).
 | run(parsedData)
14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available
14/08/09 14:58:33 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
14/08/09 14:58:33 INFO snappy.LoadSnappy: Snappy native library loaded
14/08/09 14:58:34 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/08/09 14:58:34 INFO spark.SparkContext: Starting job: takeSample at
KMeans.scala:260
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Got job 0 (takeSample at
KMeans.scala:260) with 2 output partitions (allowLocal=false)
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Final stage: Stage
0(takeSample at KMeans.scala:260)
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Parents of final stage:
List()
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Missing parents: List()
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting Stage 0
(MappedRDD[10] at map at KMeans.scala:123), which has no missing parents
14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks
from Stage 0 (MappedRDD[10] at map at KMeans.scala:123)
14/08/09 14:58:34 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with
2 tasks
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
0 on executor 5: ip-172-31-16-25.ec2.internal (PROCESS_LOCAL)
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
2215 bytes in 2 ms
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
1 on executor 4: ip-172-31-16-24.ec2.internal (PROCESS_LOCAL)
14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as
2215 bytes in 1 ms





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11834.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-08 Thread AlexanderRiggers
Thanks for your answers. I added some lines to my code and it went through,
but I get a error message for my compute cost function now...

scala val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN
BlockManagerMasterActor: Removing BlockManager BlockManagerId(driver,
192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms
14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master
14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager
14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager
192.168.0.33:49242 with 303.4 MB RAM
14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager
14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master.

console:30: error: value computeCost is not a member of
org.apache.spark.mllib.clustering.KMeans
   val WSSSE = model.computeCost(train)

compute cost should be a member of KMeans isn't it?

My whole code is here:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

val conf = new SparkConf()
.setMaster(local)
.setAppName(Kmeans)
.set(spark.executor.memory, 2g)
val sc = new SparkContext(conf)



import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeansModel
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile(data/outkmeanssm.txt)
val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble)))
val train = parsedData.repartition(20).cache() 

// Set model and run it
val model = new KMeans()
.setInitializationMode(k-means||)
.setK(2)
.setMaxIterations(2)
.setEpsilon(1e-4)
.setRuns(1)
.run(train)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = model.computeCost(train)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KMeans Input Format

2014-08-07 Thread AlexanderRiggers
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of
memory. I restructured my code now, because I forgot to cache my dataset and
set down number of iterations to 2, but still get kicked out of Spark. Did I
cache the data wrong (sorry not an expert):

scala import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeans

scala import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors

scala 

scala // Load and parse the data

scala val data = sc.textFile(data/outkmeanssm.txt)
14/08/07 19:59:10 INFO MemoryStore: ensureFreeSpace(35456) called with
curMem=0, maxMem=318111744
14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)
data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
console:14

scala val parsedData = data.map(s = Vectors.dense(s.split('
').map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] =
MappedRDD[2] at map at console:16

scala val train = parsedData.cache()
train: parsedData.type = MappedRDD[2] at map at console:16

scala 

scala // Set model

scala val model = new KMeans()
model: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .setInitializationMode(k-means||)
res0: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .setK(2)
res1: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .setMaxIterations(2)
res2: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .setEpsilon(1e-4)
res3: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .setRuns(1)
res4: org.apache.spark.mllib.clustering.KMeans =
org.apache.spark.mllib.clustering.KMeans@4c5fa12d

scala .run(train)
14/08/07 19:59:22 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/08/07 19:59:22 WARN LoadSnappy: Snappy native library not loaded
14/08/07 19:59:22 INFO FileInputFormat: Total input paths to process : 1
14/08/07 19:59:22 INFO SparkContext: Starting job: takeSample at
KMeans.scala:260
14/08/07 19:59:22 INFO DAGScheduler: Got job 0 (takeSample at
KMeans.scala:260) with 7 output partitions (allowLocal=false)
14/08/07 19:59:22 INFO DAGScheduler: Final stage: Stage 0(takeSample at
KMeans.scala:260)
14/08/07 19:59:22 INFO DAGScheduler: Parents of final stage: List()
14/08/07 19:59:22 INFO DAGScheduler: Missing parents: List()
14/08/07 19:59:22 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map
at KMeans.scala:123), which has no missing parents
14/08/07 19:59:22 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0
(MappedRDD[6] at map at KMeans.scala:123)
14/08/07 19:59:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:0 as 2224 bytes
in 1 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:1 as 2224 bytes
in 0 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:2 as 2224 bytes
in 0 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:3 as 2224 bytes
in 1 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:4 as 2224 bytes
in 0 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:5 as 2224 bytes
in 1 ms
14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on
executor localhost: localhost (PROCESS_LOCAL)
14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:6 as 2224 bytes
in 0 ms
14/08/07 19:59:22 INFO Executor: Running task ID 1
14/08/07 19:59:22 INFO Executor: Running task ID 4
14/08/07 19:59:22 INFO Executor: Running task ID 3
14/08/07 19:59:22 INFO Executor: Running task ID 2
14/08/07 19:59:22 INFO Executor: Running task ID 0
14/08/07 19:59:22 INFO Executor: Running task ID 5
14/08/07 19:59:22 INFO Executor: Running task ID 6
14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally
14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally
14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally
14/08/07 19:59:22 INFO BlockManager: 

GraphX Pagerank application

2014-08-06 Thread AlexanderRiggers
I want to use pagerank on a 3GB textfile, which contains a bipartite list
with variables id and brand. 

Example:
id,brand
86246,15343
86246,27873
86246,14647
86246,55172
86246,3293
86246,2820
86246,3830
86246,2820
86246,5603
86246,72482

To perform the page rank I have to create a graph object, adding the edges
by setting sourceID=id and distID=brand. In GraphLab there is function: g =
SGraph().add_edges(data, src_field='id', dst_field='brand')

Is there something similar in GraphX? In the GraphX docs there is an example
where a separate edgelist and usernames are joined, but I couldn't find a
use case for my problem.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pagerank-application-tp11562.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Terminal freeze during SVM

2014-07-16 Thread AlexanderRiggers
so I need to reconfigure my sparkcontext this way:

val conf = new SparkConf()
 .setMaster(local)
 .setAppName(CountingSheep)
 .set(spark.executor.memory, 1g)
 .set(spark.akka.frameSize,20)
val sc = new SparkContext(conf)

And start a new cluster with the setup scripts from Spark 1.0.1. Is this the
right approach?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9941.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Terminal freeze during SVM

2014-07-10 Thread AlexanderRiggers
Tried the newest branch, but still get stuck on the same task: (kill) runJob
at SlidingRDD.scala:74





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Hello!

I want to play around with several different cluster settings and measure
performances for MLlib and GraphX  and was wondering if anybody here could
hit me up with datasets for these applications from 5GB onwards? 

I mostly interested in SVM and Triangle Count, but would be glad for any
help.

Best regards,
Alex



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions

I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a second look.


Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.