Re: [discovery] Ideas around a public release of ML training set for search

Justin Ormont Thu, 05 Jan 2017 14:14:05 -0800

Quite understandable. It's also possible to augment the dataset w/ some
percent (perhaps ~5%) of the data having the, human reviewed & PII safe,
query.


On the PII topic, one missing feature is user geolocation. This will help
disambiguate user intent for queries that are geolocal. For instance, [civic
center <https://en.wikipedia.org/w/index.php?search=civic+center>]
(location search), [john marks
<https://en.wikipedia.org/w/index.php?search=john+marks>] (people query),
or [air marshal <https://en.wikipedia.org/w/index.php?search=air+marshal>]
(alternative meanings in US/UK). Reducing the Lat/Lng to the metropolitan
area, or even state level may mitigate the PII impact. You can likely see
examples of Google/Bing/DDG doing geo based ranking by using a VPN and
running [xyz site:wikipedia.org] queries.

Another feature I'd like to try: one hot encoding of the top 1-5k page
categories. Aka create N binary columns (one for each of the top categories
across enwiki) in the dataset where each column has a 1/0 if the page for
that training row exists in that column's category. This would help uprank
certain types of page categories, and can usefully intact w/ the word
embedding (word2vec) you're using.

--justin


On Thu, Jan 5, 2017 at 12:48 PM, Trey Jones <[email protected]> wrote:

> The privacy impact is greater, but having the original query would be
>> useful for folks wanting to create their own query level features & query
>> dependent features. You do have a great set of features listed
>> <https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
>> action, and release what's possible currently, letting folks play with the
>> dataset.
>
>
> Right now the standard is that all queries that are released must be
> reviewed by humans. A query data dump had to be retracted in the past for
> containing PII, so I don't see us getting around that (nor would I want to,
> really, having seen the kind of info that can be in there).
>
> We did the manual review for the Discernatron query data, but it's not
> scalable for the size of dataset needed to do machine learning. However, if
> anyone has any good ideas for features, please let us know, and maybe we
> can generate those features and share them, too, time permitting.
>
> —Trey
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Fri, Dec 30, 2016 at 2:28 PM, Justin Ormont <[email protected]>
> wrote:
>
>> I think the PII impact in releasing a dataset w/ only numerical feature
>> vectors is extremely low.
>>
>> The privacy impact is greater, but having the original query would be
>> useful for folks wanting to create their own query level features & query
>> dependent features. You do have a great set of features listed
>> <https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for
>> action, and release what's possible currently, letting folks play with the
>> dataset.
>>
>> I'd recommend having a groupId which is uniq for each instance of a user
>> running a query. This is used to group together all of the results in a
>> viewed SERP, and allows the ranking function to worry only about rank order
>> instead of absolute scoring; aka, the scoring only matters relative to the
>> other viewed documents.
>>
>> I'd try out LightGBM & XGBoost in their ranking modes for creating a
>> model.
>>
>> --justin
>>
>> On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson <
>> [email protected]> wrote:
>>
>>> gh it with 100 normalized queries to get a count, and there are 4852
>>> features. Lots of them are probably useless, but choosing which ones is
>>> probably half the battle. These are ~230MB in pickle format, which stores
>>> the floats in binary. This can then be compressed to ~20MB with gzip, so
>>> the data size isn't particularly insane. In a released dataset i would
>>> probably use 10k normalized queries, meaning about 100x this size Could
>>> plausibly release as csv's instead of pickled numpy arrays. That will
>>> probably increase the data size further,
>>
>>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Ideas around a public release of ML training set for search

Reply via email to