Re: Dataset announcement
Greetings! How about medical data sets, and specifically longitudinal vital signs. Can people send good pointers? Thanks in advance, -- ttfn Simon Edelhaus California 2015 On Wed, Apr 15, 2015 at 6:01 PM, Matei Zaharia wrote: > Very neat, Olivier; thanks for sharing this. > > Matei > > > On Apr 15, 2015, at 5:58 PM, Olivier Chapelle > wrote: > > > > Dear Spark users, > > > > I would like to draw your attention to a dataset that we recently > released, > > which is as of now the largest machine learning dataset ever released; > see > > the following blog announcements: > > - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ > > - > > > http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx > > > > The characteristics of this dataset are: > > - 1 TB of data > > - binary classification > > - 13 integer features > > - 26 categorical features, some of them taking millions of values. > > - 4B rows > > > > Hopefully this dataset will be useful to assess and push further the > > scalability of Spark and MLlib. > > > > Cheers, > > Olivier > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Dataset announcement
Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle wrote: > Dear Spark users, > > I would like to draw your attention to a dataset that we recently released, > which is as of now the largest machine learning dataset ever released; see > the following blog announcements: > - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ > - > > http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx > > The characteristics of this dataset are: > - 1 TB of data > - binary classification > - 13 integer features > - 26 categorical features, some of them taking millions of values. > - 4B rows > > Hopefully this dataset will be useful to assess and push further the > scalability of Spark and MLlib. > > Cheers, > Olivier > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Dataset announcement
Very neat, Olivier; thanks for sharing this. Matei > On Apr 15, 2015, at 5:58 PM, Olivier Chapelle wrote: > > Dear Spark users, > > I would like to draw your attention to a dataset that we recently released, > which is as of now the largest machine learning dataset ever released; see > the following blog announcements: > - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ > - > http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx > > The characteristics of this dataset are: > - 1 TB of data > - binary classification > - 13 integer features > - 26 categorical features, some of them taking millions of values. > - 4B rows > > Hopefully this dataset will be useful to assess and push further the > scalability of Spark and MLlib. > > Cheers, > Olivier > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Dataset announcement
Dear Spark users, I would like to draw your attention to a dataset that we recently released, which is as of now the largest machine learning dataset ever released; see the following blog announcements: - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/ - http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx The characteristics of this dataset are: - 1 TB of data - binary classification - 13 integer features - 26 categorical features, some of them taking millions of values. - 4B rows Hopefully this dataset will be useful to assess and push further the scalability of Spark and MLlib. Cheers, Olivier -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org