The Kaggle data is not in libsvm format so you'd have to do some transformation.
The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad prediction data is around 2-3GB compressed I think. To my knowledge these are the largest binary classification datasets I've come across which are easily publicly available (very happy to be proved wrong about this though :) — Sent from Mailbox On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers <alexander.rigg...@gmail.com> wrote: > Nick Pentreath wrote >> Take a look at Kaggle competition datasets >> - https://www.kaggle.com/competitions > I was looking for files in LIBSVM format and never found something on Kaggle > in bigger size. Most competitions I ve seen need data processing and feature > generating, but maybe I ve to take a second look. > Nick Pentreath wrote >> For graph stuff the SNAP has large network >> data: https://snap.stanford.edu/data/ > Thanks > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.