Re: Dataset announcement

2015-04-15 Thread Simon Edelhaus
Greetings!

How about medical data sets, and specifically longitudinal vital signs.

Can people send good pointers?

Thanks in advance,


-- ttfn
Simon Edelhaus
California 2015

On Wed, Apr 15, 2015 at 6:01 PM, Matei Zaharia 
wrote:

> Very neat, Olivier; thanks for sharing this.
>
> Matei
>
> > On Apr 15, 2015, at 5:58 PM, Olivier Chapelle 
> wrote:
> >
> > Dear Spark users,
> >
> > I would like to draw your attention to a dataset that we recently
> released,
> > which is as of now the largest machine learning dataset ever released;
> see
> > the following blog announcements:
> > - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
> > -
> >
> http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
> >
> > The characteristics of this dataset are:
> > - 1 TB of data
> > - binary classification
> > - 13 integer features
> > - 26 categorical features, some of them taking millions of values.
> > - 4B rows
> >
> > Hopefully this dataset will be useful to assess and push further the
> > scalability of Spark and MLlib.
> >
> > Cheers,
> > Olivier
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Dataset announcement

2015-04-15 Thread Krishna Sankar
Thanks Olivier. Good work.
Interesting in more than one ways - including training, benchmarking,
testing new releases et al.
One quick question - do you plan to make it available as an S3 bucket ?

Cheers


On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle 
wrote:

> Dear Spark users,
>
> I would like to draw your attention to a dataset that we recently released,
> which is as of now the largest machine learning dataset ever released; see
> the following blog announcements:
>  - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
>  -
>
> http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
>
> The characteristics of this dataset are:
>  - 1 TB of data
>  - binary classification
>  - 13 integer features
>  - 26 categorical features, some of them taking millions of values.
>  - 4B rows
>
> Hopefully this dataset will be useful to assess and push further the
> scalability of Spark and MLlib.
>
> Cheers,
> Olivier
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this.

Matei

> On Apr 15, 2015, at 5:58 PM, Olivier Chapelle  wrote:
> 
> Dear Spark users,
> 
> I would like to draw your attention to a dataset that we recently released,
> which is as of now the largest machine learning dataset ever released; see
> the following blog announcements:
> - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
> -
> http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
> 
> The characteristics of this dataset are:
> - 1 TB of data
> - binary classification
> - 13 integer features
> - 26 categorical features, some of them taking millions of values.
> - 4B rows
> 
> Hopefully this dataset will be useful to assess and push further the
> scalability of Spark and MLlib.
> 
> Cheers,
> Olivier
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataset announcement

2015-04-15 Thread Olivier Chapelle
Dear Spark users,

I would like to draw your attention to a dataset that we recently released,
which is as of now the largest machine learning dataset ever released; see
the following blog announcements:
 - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
 -
http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx

The characteristics of this dataset are:
 - 1 TB of data
 - binary classification
 - 13 integer features
 - 26 categorical features, some of them taking millions of values.
 - 4B rows

Hopefully this dataset will be useful to assess and push further the
scalability of Spark and MLlib.

Cheers,
Olivier



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org