Hello!
I want to play around with several different cluster settings and measure
performances for MLlib and GraphX and was wondering if anybody here could
hit me up with datasets for these applications from 5GB onwards?
I mostly interested in SVM and Triangle Count, but would be glad for any
Nick Pentreath wrote
Take a look at Kaggle competition datasets
- https://www.kaggle.com/competitions
I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
master? Because I am using v 1.0.0 - Alex
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
Sent from the Apache Spark User List
Tried the newest branch, but still get stuck on the same task: (kill) runJob
at SlidingRDD.scala:74
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html
Sent from the Apache Spark User List mailing list
so I need to reconfigure my sparkcontext this way:
val conf = new SparkConf()
.setMaster(local)
.setAppName(CountingSheep)
.set(spark.executor.memory, 1g)
.set(spark.akka.frameSize,20)
val sc = new SparkContext(conf)
And start a new cluster
I want to use pagerank on a 3GB textfile, which contains a bipartite list
with variables id and brand.
Example:
id,brand
86246,15343
86246,27873
86246,14647
86246,55172
86246,3293
86246,2820
86246,3830
86246,2820
86246,5603
86246,72482
To perform the page rank I have to create a graph object,
Thanks for your answers. The dataset is only 400MB, so I shouldn't run out of
memory. I restructured my code now, because I forgot to cache my dataset and
set down number of iterations to 2, but still get kicked out of Spark. Did I
cache the data wrong (sorry not an expert):
scala import
Thanks for your answers. I added some lines to my code and it went through,
but I get a error message for my compute cost function now...
scala val WSSSE = model.computeCost(train)14/08/08 15:48:42 WARN
BlockManagerMasterActor: Removing BlockManager BlockManagerId(driver,
192.168.0.33, 49242, 0)
Thank you for your help. After restructuring my code to Seans input, it
worked without changing Spark context. I now took the same file format just
a bigger file(2.7GB) from s3 to my cluster with 4 c3.xlarge instances and
Spark 1.0.2. Unluckly my task freezes again after a short time. I tried it