Quite understandable. It's also possible to augment the dataset w/ some percent (perhaps ~5%) of the data having the, human reviewed & PII safe, query.
On the PII topic, one missing feature is user geolocation. This will help disambiguate user intent for queries that are geolocal. For instance, [civic center <https://en.wikipedia.org/w/index.php?search=civic+center>] (location search), [john marks <https://en.wikipedia.org/w/index.php?search=john+marks>] (people query), or [air marshal <https://en.wikipedia.org/w/index.php?search=air+marshal>] (alternative meanings in US/UK). Reducing the Lat/Lng to the metropolitan area, or even state level may mitigate the PII impact. You can likely see examples of Google/Bing/DDG doing geo based ranking by using a VPN and running [xyz site:wikipedia.org] queries. Another feature I'd like to try: one hot encoding of the top 1-5k page categories. Aka create N binary columns (one for each of the top categories across enwiki) in the dataset where each column has a 1/0 if the page for that training row exists in that column's category. This would help uprank certain types of page categories, and can usefully intact w/ the word embedding (word2vec) you're using. --justin On Thu, Jan 5, 2017 at 12:48 PM, Trey Jones <[email protected]> wrote: > The privacy impact is greater, but having the original query would be >> useful for folks wanting to create their own query level features & query >> dependent features. You do have a great set of features listed >> <https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for >> action, and release what's possible currently, letting folks play with the >> dataset. > > > Right now the standard is that all queries that are released must be > reviewed by humans. A query data dump had to be retracted in the past for > containing PII, so I don't see us getting around that (nor would I want to, > really, having seen the kind of info that can be in there). > > We did the manual review for the Discernatron query data, but it's not > scalable for the size of dataset needed to do machine learning. However, if > anyone has any good ideas for features, please let us know, and maybe we > can generate those features and share them, too, time permitting. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Fri, Dec 30, 2016 at 2:28 PM, Justin Ormont <[email protected]> > wrote: > >> I think the PII impact in releasing a dataset w/ only numerical feature >> vectors is extremely low. >> >> The privacy impact is greater, but having the original query would be >> useful for folks wanting to create their own query level features & query >> dependent features. You do have a great set of features listed >> <https://phabricator.wikimedia.org/P4677> there. As always, I'd bias for >> action, and release what's possible currently, letting folks play with the >> dataset. >> >> I'd recommend having a groupId which is uniq for each instance of a user >> running a query. This is used to group together all of the results in a >> viewed SERP, and allows the ranking function to worry only about rank order >> instead of absolute scoring; aka, the scoring only matters relative to the >> other viewed documents. >> >> I'd try out LightGBM & XGBoost in their ranking modes for creating a >> model. >> >> --justin >> >> On Thu, Dec 22, 2016 at 4:00 PM, Erik Bernhardson < >> [email protected]> wrote: >> >>> gh it with 100 normalized queries to get a count, and there are 4852 >>> features. Lots of them are probably useless, but choosing which ones is >>> probably half the battle. These are ~230MB in pickle format, which stores >>> the floats in binary. This can then be compressed to ~20MB with gzip, so >>> the data size isn't particularly insane. In a released dataset i would >>> probably use 10k normalized queries, meaning about 100x this size Could >>> plausibly release as csv's instead of pickled numpy arrays. That will >>> probably increase the data size further, >> >> >> >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
