Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

Deborah Tankersley Wed, 03 Aug 2016 12:13:10 -0700

Hi Gerard,

I chatted with Trey (who did the analysis) for his opinion on your
concerns. Here is his response:


Hi Gerard,

I wasn't trying to pass judgement on notability when the search referred to
> a particular person, place, or thing, but I did take it as a sign of
> non-notability when a page had been created and then deleted for a
> particular person or website. Those items could become notable in the
> future, and any of them might be notable enough for Wikidata—but the
> original discussion seemed to be mainly about queries to English Wikipedia.
> My conclusion, for English Wikipedia, is that there is not some gold mine
> of super high-frequency typos or new topics that we are missing out on.
> More importantly, there are real privacy concerns, and simple fixes—like
> requiring some number of unique IP addresses to have searched fro
> something—are not enough.
> I have looked at thousands of queries from about a dozen other language
> Wikipedias—some in more depth than others, and admittedly not usually
> sorted by frequency—but my intuition is the same as it was for English
> Wikipedia: not enough of value there to override privacy concerns.
> Automation is out for privacy reasons and manual review is not worth it,
> so this isn't a priority for Discovery right now.


I hope that helps to further explain what we found and why we're not acting
further on this issue at this time.

Cheers,

Deb

--
Deb Tankersley
Product Manager, Discovery
IRC: debt
Wikimedia Foundation

On Sat, Jul 30, 2016 at 1:30 AM, Gerard Meijssen <gerard.meijs...@gmail.com>
wrote:

> Hoi,
> So what do we have? It is what the most missed searches are for the English
> Wikipedia. Arguably the searches include content that is "iffie". But when
> many people seek info on a porn site, on what basis is it not notable? This
> is only for en.wp and the results for other languages can be quite
> different.The problem with dismissing the need for this data in this way is
> that it supports the status quo for all Wikipedias. It does not suggest
> what we can do with a porn site. We could for instance have a Wikidata item
> stating that it is a porn site and leave it at that.
>
> When you compare Wikidata with Wikipedia, Wikidata has significantlyu more
> data about whatever than Wikipedia does. All subjects that are notable by
> Wikidata standards and many are notable by English Wikipedia standards.
> Knowing what subjects are missed in Wikipedia and what people are looking
> for is important because they are the people Wikipedia misses.
>
> NB thanks for the data, the project.
> Thanks,
>       GerardM
>
> On 29 July 2016 at 23:48, Deborah Tankersley <dtankers...@wikimedia.org>
> wrote:
>
> > Forwarding to the Wikimedia mailing list, I'm sorry for the lateness!
> >
> >
> > --
> > Deb Tankersley
> > Product Manager, Discovery
> > IRC: debt
> > Wikimedia Foundation
> >
> > ---------- Forwarded message ----------
> > From: Trey Jones <tjo...@wikimedia.org>
> > Date: Mon, Jul 25, 2016 at 11:58 AM
> > Subject: Re: [discovery] Fwd: [Wikimedia-l] Improving search (sort of)
> > To: A public mailing list about Wikimedia Search and Discovery projects <
> > discov...@lists.wikimedia.org>
> > Cc: James Heilman <jmh...@gmail.com>
> >
> >
> > I decided to look into this as my 10% project last week. It ended up
> being
> > a 15% project, but I wanted to finish it up.
> >
> > I carefully reviewed and categorized the top 100 "unsuccessful" (i.e.,
> > zero-results) queries from May 2016, and skimmed the top 1,000 from May,
> > and skimmed and compared the top 100 / 1,000 for June.
> >
> > The top result (with several variants in the top 100) is a porn site that
> > has had a wiki page created and deleted several times. Various websites
> > round out the top 10. Internet personalities and websites dominate the
> top
> > 100 and several have had pages created and deleted over the years.
> There's
> > strong evidence of links being used for some queries—though I didn't try
> to
> > track them down. There's plenty of personally identifiable information in
> > the top 1000 most frequent queries. More than 10% of the queries (by
> > volume) get good results from the completion suggester or "did you mean"
> > spelling suggestions, and more than 10% have some results approximately
> two
> > months later (i.e., late last week).
> >
> > Obvious refinements to the search strategy would eliminate so many
> > high-frequency queries that any useful mining would be down to slogging
> > through the low-impact long tail.
> >
> > I don’t think there’s a lot here worth extracting, though others may
> > disagree. The privacy concerns expressed earlier are genuine, and simple
> > attempts to filter PII (using patterns, minimum IP counts, etc) are not
> > guaranteed to be effective.
> >
> > For lots more details (but no actual queries), see here:
> >
> >
> >
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Top_Unsuccessful_Search_Queries
> >
> > —Trey
> >
> > Trey Jones
> > Software Engineer, Discovery
> > Wikimedia Foundation
> >
> > On Fri, Jul 15, 2016 at 11:31 AM, Trey Jones <tjo...@wikimedia.org>
> wrote:
> >
> > > Finally, if this is important enough and the task gets prioritized, I'd
> > be
> > > willing to dive back in and go through the process once and pull out
> the
> > > top zero-results queries, this time with basic bot exclusion and IP
> > > deduplication—which we didn't do early on because we didn't realize
> what
> > a
> > > mess the data was. We could process a week or a month of data and
> > > categorize the top 100 to 500 results in terms of personal info, junk,
> > > porn, and whatever other categories we want or that bubble up from the
> > > data, and perhaps publish the non-personal-info part of the list as an
> > > example, either to persuade ourselves that this is worth pursuing, or
> as
> > a
> > > clearer counter to future calls to do so.
> > > —Trey
> > >
> > >>
> >
> > > ---------- Forwarded message ----------
> > >> From: "James Heilman" <jmh...@gmail.com>
> > >> Date: Jul 15, 2016 06:33
> > >> Subject: [Wikimedia-l] Improving search (sort of)
> > >> To: "Wikimedia Mailing List" <wikimedia-l@lists.wikimedia.org>
> > >> Cc:
> > >>
> > >> A while ago I requested a list of the "most frequently searched for
> > terms
> > >> for which no Wikipedia articles are returned". This would allow the
> > >> community to than create redirect or new pages as appropriate and help
> > >> address the "zero results rate" of about 30%.
> > >>
> > >> While we are still waiting for this data I have recently come across a
> > >> list
> > >> of the most frequently clicked on redlinks on En WP produced by Andrew
> > >> West
> > >> https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks
> Many
> > of
> > >> these can be reasonably addressed with a redirect as the issue is
> often
> > >> capitals.
> > >>
> > >> Do anyone know where things are at with respect to producing the list
> of
> > >> most search for terms that return nothing?
> > >>
> > >> --
> > >> James Heilman
> > >> MD, CCFP-EM, Wikipedian
> > >>
> > >
> > _______________________________________________
> > discovery mailing list
> > discov...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/discovery
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> > New messages to: Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] Fwd: [discovery] Fwd: Improving search (sort of)

Reply via email to