Re: [discovery] Ideas around a public release of ML training set for search

Justin Ormont Fri, 30 Dec 2016 11:28:55 -0800

I think the PII impact in releasing a dataset w/ only numerical feature
vectors is extremely low.

The privacy impact is greater, but having the original query would be
useful for folks wanting to create their own query level features & query
dependent features. You do have a great set of features listed
<https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
action, and release what's possible currently, letting folks play with the
dataset.

I'd recommend having a groupId which is uniq for each instance of a user
running a query. This is used to group together all of the results in a
viewed SERP, and allows the ranking function to worry only about rank order
instead of absolute scoring; aka, the scoring only matters relative to the
other viewed documents.

I'd try out LightGBM & XGBoost in their ranking modes for creating a model.

--justin

On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson <
[email protected]> wrote:

> gh it with 100 normalized queries to get a count, and there are 4852
> features. Lots of them are probably useless, but choosing which ones is
> probably half the battle. These are ~230MB in pickle format, which stores
> the floats in binary. This can then be compressed to ~20MB with gzip, so
> the data size isn't particularly insane. In a released dataset i would
> probably use 10k normalized queries, meaning about 100x this size Could
> plausibly release as csv's instead of pickled numpy arrays. That will
> probably increase the data size further,

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Ideas around a public release of ML training set for search

Reply via email to