Sample datasets for MLlib and Graphx
Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any help. Best regards, Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions For svm there are a couple of ad click prediction datasets of pretty large size. For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ — Sent from Mailbox On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers alexander.rigg...@gmail.com wrote: Hello! I want to play around with several different cluster settings and measure performances for MLlib and GraphX and was wondering if anybody here could hit me up with datasets for these applications from 5GB onwards? I mostly interested in SVM and Triangle Count, but would be glad for any help. Best regards, Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a second look. Nick Pentreath wrote For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Sample datasets for MLlib and Graphx
The Kaggle data is not in libsvm format so you'd have to do some transformation. The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think. To my knowledge these are the largest binary classification datasets I've come across which are easily publicly available (very happy to be proved wrong about this though :) — Sent from Mailbox On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers alexander.rigg...@gmail.com wrote: Nick Pentreath wrote Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions I was looking for files in LIBSVM format and never found something on Kaggle in bigger size. Most competitions I ve seen need data processing and feature generating, but maybe I ve to take a second look. Nick Pentreath wrote For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/ Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html Sent from the Apache Spark User List mailing list archive at Nabble.com.