Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released; see the following blog announcements: - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ - http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
The characteristics of this dataset are: - 1 TB of data - binary classification - 13 integer features - 26 categorical features, some of them taking millions of values. - 4B rows Hopefully this dataset will be useful to assess and push further the scalability of Spark and MLlib. Cheers, Olivier -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org