Re: KMeans Input Format
Thank you for your help. After restructuring my code to Seans input, it worked without changing Spark context. I now took the same file format just a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it with cached and uncached RDDs. Are there some configurations to be made for such big files and MLlib? scala import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala import org.apache.spark.mllib.clustering.KMeansModel import org.apache.spark.mllib.clustering.KMeansModel scala import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala scala // Load and parse the data scala val data = sc.textFile(s3n://ampcamp-arigge/large_file.new.txt) 14/08/09 14:58:31 INFO storage.MemoryStore: ensureFreeSpace(35666) called with curMem=0, maxMem=309225062 14/08/09 14:58:31 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.8 KB, free 294.9 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:15 scala val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at console:17 scala val train = parsedData.repartition(20).cache() train: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[6] at repartition at console:19 scala scala // Set model and run it scala val model = new KMeans(). | setInitializationMode(k-means||). | setK(2).setMaxIterations(2). | setEpsilon(1e-4). | setRuns(1). | run(parsedData) 14/08/09 14:58:33 WARN snappy.LoadSnappy: Snappy native library is available 14/08/09 14:58:33 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/08/09 14:58:33 INFO snappy.LoadSnappy: Snappy native library loaded 14/08/09 14:58:34 INFO mapred.FileInputFormat: Total input paths to process : 1 14/08/09 14:58:34 INFO spark.SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 2 output partitions (allowLocal=false) 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Parents of final stage: List() 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Missing parents: List() 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[10] at map at KMeans.scala:123), which has no missing parents 14/08/09 14:58:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[10] at map at KMeans.scala:123) 14/08/09 14:58:34 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor 5: ip-172-31-16-25.ec2.internal (PROCESS_LOCAL) 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 2215 bytes in 2 ms 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor 4: ip-172-31-16-24.ec2.internal (PROCESS_LOCAL) 14/08/09 14:58:34 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 2215 bytes in 1 ms -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11834.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
Thanks for your answers. I added some lines to my code and it went through, but I get a error message for my compute cost function now... scala val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(driver, 192.168.0.33, 49242, 0) with no recent heart beats: 156207ms exceeds 45000ms 14/08/08 15:48:42 INFO BlockManager: BlockManager re-registering with master 14/08/08 15:48:42 INFO BlockManagerMaster: Trying to register BlockManager 14/08/08 15:48:42 INFO BlockManagerInfo: Registering block manager 192.168.0.33:49242 with 303.4 MB RAM 14/08/08 15:48:42 INFO BlockManagerMaster: Registered BlockManager 14/08/08 15:48:42 INFO BlockManager: Reporting 0 blocks to the master. console:30: error: value computeCost is not a member of org.apache.spark.mllib.clustering.KMeans val WSSSE = model.computeCost(train) compute cost should be a member of KMeans isn't it? My whole code is here: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf val conf = new SparkConf() .setMaster(local) .setAppName(Kmeans) .set(spark.executor.memory, 2g) val sc = new SparkContext(conf) import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeansModel import org.apache.spark.mllib.linalg.Vectors // Load and parse the data val data = sc.textFile(data/outkmeanssm.txt) val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble))) val train = parsedData.repartition(20).cache() // Set model and run it val model = new KMeans() .setInitializationMode(k-means||) .setK(2) .setMaxIterations(2) .setEpsilon(1e-4) .setRuns(1) .run(train) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = model.computeCost(train) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-Input-Format-tp11654p11788.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: KMeans Input Format
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of memory. I restructured my code now, because I forgot to cache my dataset and set down number of iterations to 2, but still get kicked out of Spark. Did I cache the data wrong (sorry not an expert): scala import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.clustering.KMeans scala import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors scala scala // Load and parse the data scala val data = sc.textFile(data/outkmeanssm.txt) 14/08/07 19:59:10 INFO MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:14 scala val parsedData = data.map(s = Vectors.dense(s.split(' ').map(_.toDouble))) parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at console:16 scala val train = parsedData.cache() train: parsedData.type = MappedRDD[2] at map at console:16 scala scala // Set model scala val model = new KMeans() model: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setInitializationMode(k-means||) res0: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setK(2) res1: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setMaxIterations(2) res2: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setEpsilon(1e-4) res3: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .setRuns(1) res4: org.apache.spark.mllib.clustering.KMeans = org.apache.spark.mllib.clustering.KMeans@4c5fa12d scala .run(train) 14/08/07 19:59:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/07 19:59:22 WARN LoadSnappy: Snappy native library not loaded 14/08/07 19:59:22 INFO FileInputFormat: Total input paths to process : 1 14/08/07 19:59:22 INFO SparkContext: Starting job: takeSample at KMeans.scala:260 14/08/07 19:59:22 INFO DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 7 output partitions (allowLocal=false) 14/08/07 19:59:22 INFO DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260) 14/08/07 19:59:22 INFO DAGScheduler: Parents of final stage: List() 14/08/07 19:59:22 INFO DAGScheduler: Missing parents: List() 14/08/07 19:59:22 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at KMeans.scala:123), which has no missing parents 14/08/07 19:59:22 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map at KMeans.scala:123) 14/08/07 19:59:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:0 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:1 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:2 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:3 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:4 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:5 as 2224 bytes in 1 ms 14/08/07 19:59:22 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL) 14/08/07 19:59:22 INFO TaskSetManager: Serialized task 0.0:6 as 2224 bytes in 0 ms 14/08/07 19:59:22 INFO Executor: Running task ID 1 14/08/07 19:59:22 INFO Executor: Running task ID 4 14/08/07 19:59:22 INFO Executor: Running task ID 3 14/08/07 19:59:22 INFO Executor: Running task ID 2 14/08/07 19:59:22 INFO Executor: Running task ID 0 14/08/07 19:59:22 INFO Executor: Running task ID 5 14/08/07 19:59:22 INFO Executor: Running task ID 6 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager: Found block broadcast_0 locally 14/08/07 19:59:22 INFO BlockManager:
GraphX Pagerank application
I want to use pagerank on a 3GB textfile, which contains a bipartite list with variables id and brand. Example: id,brand 86246,15343 86246,27873 86246,14647 86246,55172 86246,3293 86246,2820 86246,3830 86246,2820 86246,5603 86246,72482 To perform the page rank I have to create a graph object, adding the edges by setting sourceID=id and distID=brand. In GraphLab there is function: g = SGraph().add_edges(data, src_field='id', dst_field='brand') Is there something similar in GraphX? In the GraphX docs there is an example where a separate edgelist and usernames are joined, but I couldn't find a use case for my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-Pagerank-application-tp11562.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Terminal freeze during SVM
so I need to reconfigure my sparkcontext this way: val conf = new SparkConf() .setMaster(local) .setAppName(CountingSheep) .set(spark.executor.memory, 1g) .set(spark.akka.frameSize,20) val sc = new SparkContext(conf) And start a new cluster with the setup scripts from Spark 1.0.1. Is this the right approach? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9941.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Terminal freeze during SVM
Tried the newest branch, but still get stuck on the same task: (kill) runJob at SlidingRDD.scala:74 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Terminal freeze during SVM
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Sample datasets for MLlib and Graphx
Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any help. Best regards, Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a second look. Nick Pentreath wrote For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.