Hi Gerard, I chatted with Trey (who did the analysis) for his opinion on your concerns. Here is his response:
Hi Gerard, I wasn't trying to pass judgement on notability when the search referred to > a particular person, place, or thing, but I did take it as a sign of > non-notability when a page had been created and then deleted for a > particular person or website. Those items could become notable in the > future, and any of them might be notable enough for Wikidata—but the > original discussion seemed to be mainly about queries to English Wikipedia. > My conclusion, for English Wikipedia, is that there is not some gold mine > of super high-frequency typos or new topics that we are missing out on. > More importantly, there are real privacy concerns, and simple fixes—like > requiring some number of unique IP addresses to have searched fro > something—are not enough. > I have looked at thousands of queries from about a dozen other language > Wikipedias—some in more depth than others, and admittedly not usually > sorted by frequency—but my intuition is the same as it was for English > Wikipedia: not enough of value there to override privacy concerns. > Automation is out for privacy reasons and manual review is not worth it, > so this isn't a priority for Discovery right now. I hope that helps to further explain what we found and why we're not acting further on this issue at this time. Cheers, Deb -- Deb Tankersley Product Manager, Discovery IRC: debt Wikimedia Foundation On Sat, Jul 30, 2016 at 1:30 AM, Gerard Meijssen <gerard.meijs...@gmail.com> wrote: > Hoi, > So what do we have? It is what the most missed searches are for the English > Wikipedia. Arguably the searches include content that is "iffie". But when > many people seek info on a porn site, on what basis is it not notable? This > is only for en.wp and the results for other languages can be quite > different.The problem with dismissing the need for this data in this way is > that it supports the status quo for all Wikipedias. It does not suggest > what we can do with a porn site. We could for instance have a Wikidata item > stating that it is a porn site and leave it at that. > > When you compare Wikidata with Wikipedia, Wikidata has significantlyu more > data about whatever than Wikipedia does. All subjects that are notable by > Wikidata standards and many are notable by English Wikipedia standards. > Knowing what subjects are missed in Wikipedia and what people are looking > for is important because they are the people Wikipedia misses. > > NB thanks for the data, the project. > Thanks, > GerardM > > On 29 July 2016 at 23:48, Deborah Tankersley <dtankers...@wikimedia.org> > wrote: > > > Forwarding to the Wikimedia mailing list, I'm sorry for the lateness! > > > > > > -- > > Deb Tankersley > > Product Manager, Discovery > > IRC: debt > > Wikimedia Foundation > > > > ---------- Forwarded message ---------- > > From: Trey Jones <tjo...@wikimedia.org> > > Date: Mon, Jul 25, 2016 at 11:58 AM > > Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of) > > To: A public mailing list about Wikimedia Search and Discovery projects < > > discov...@lists.wikimedia.org> > > Cc: James Heilman <jmh...@gmail.com> > > > > > > I decided to look into this as my 10% project last week. It ended up > being > > a 15% project, but I wanted to finish it up. > > > > I carefully reviewed and categorized the top 100 "unsuccessful" (i.e., > > zero-results) queries from May 2016, and skimmed the top 1,000 from May, > > and skimmed and compared the top 100 / 1,000 for June. > > > > The top result (with several variants in the top 100) is a porn site that > > has had a wiki page created and deleted several times. Various websites > > round out the top 10. Internet personalities and websites dominate the > top > > 100 and several have had pages created and deleted over the years. > There's > > strong evidence of links being used for some queries—though I didn't try > to > > track them down. There's plenty of personally identifiable information in > > the top 1000 most frequent queries. More than 10% of the queries (by > > volume) get good results from the completion suggester or "did you mean" > > spelling suggestions, and more than 10% have some results approximately > two > > months later (i.e., late last week). > > > > Obvious refinements to the search strategy would eliminate so many > > high-frequency queries that any useful mining would be down to slogging > > through the low-impact long tail. > > > > I don’t think there’s a lot here worth extracting, though others may > > disagree. The privacy concerns expressed earlier are genuine, and simple > > attempts to filter PII (using patterns, minimum IP counts, etc) are not > > guaranteed to be effective. > > > > For lots more details (but no actual queries), see here: > > > > > > > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Search_Queries > > > > —Trey > > > > Trey Jones > > Software Engineer, Discovery > > Wikimedia Foundation > > > > On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones <tjo...@wikimedia.org> > wrote: > > > > > Finally, if this is important enough and the task gets prioritized, I'd > > be > > > willing to dive back in and go through the process once and pull out > the > > > top zero-results queries, this time with basic bot exclusion and IP > > > deduplication—which we didn't do early on because we didn't realize > what > > a > > > mess the data was. We could process a week or a month of data and > > > categorize the top 100 to 500 results in terms of personal info, junk, > > > porn, and whatever other categories we want or that bubble up from the > > > data, and perhaps publish the non-personal-info part of the list as an > > > example, either to persuade ourselves that this is worth pursuing, or > as > > a > > > clearer counter to future calls to do so. > > > —Trey > > > > > >> > > > > > ---------- Forwarded message ---------- > > >> From: "James Heilman" <jmh...@gmail.com> > > >> Date: Jul 15, 2016 06:33 > > >> Subject: [Wikimedia-l] Improving search (sort of) > > >> To: "Wikimedia Mailing List" <wikimedia-l@lists.wikimedia.org> > > >> Cc: > > >> > > >> A while ago I requested a list of the "most frequently searched for > > terms > > >> for which no Wikipedia articles are returned". This would allow the > > >> community to than create redirect or new pages as appropriate and help > > >> address the "zero results rate" of about 30%. > > >> > > >> While we are still waiting for this data I have recently come across a > > >> list > > >> of the most frequently clicked on redlinks on En WP produced by Andrew > > >> West > > >> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks > Many > > of > > >> these can be reasonably addressed with a redirect as the issue is > often > > >> capitals. > > >> > > >> Do anyone know where things are at with respect to producing the list > of > > >> most search for terms that return nothing? > > >> > > >> -- > > >> James Heilman > > >> MD, CCFP-EM, Wikipedian > > >> > > > > > _______________________________________________ > > discovery mailing list > > discov...@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/discovery > > _______________________________________________ > > Wikimedia-l mailing list, guidelines at: > > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines > > New messages to: Wikimedia-l@lists.wikimedia.org > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > > > _______________________________________________ > Wikimedia-l mailing list, guidelines at: > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines > New messages to: Wikimedia-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe> > _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>