Forwarding...

---------- Forwarded message ----------
From: Trey Jones <tjo...@wikimedia.org>
Date: Thu, Aug 4, 2016 at 1:25 PM
Subject: [discovery] Stripping Question Marks From Wiki Searches is Now
Live!
To: A public mailing list about Wikimedia Search and Discovery projects <
discov...@lists.wikimedia.org>


*Stripping Question Marks From Wiki Searches*
*Do you ask questions on Wikipedia? Would you like better results?*

*Summary:* Because the large majority of question marks are used to ask
questions by users unfamiliar with bash-style wildcards
<https://en.wikipedia.org/wiki/Glob_(programming)>, the default behavior
for CirrusSearch will now be to ignore question marks (replacing them with
a space). Escaping them with a backslash (\?) will preserve their wildcard
meaning. Regular expressions in *insource:* will not be affected and should
not be escaped. This option can be modified on a per-wiki basis if needed
(see $wgCirrusSearchStripQuestionMarks
<https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/CirrusSearch.php>
).


When people ask *how old is tom cruise?* on Wikipedia they almost certainly
don’t expect the question mark in *cruise?* to match an additional letter.
They aren’t looking for the words *cruised, cruiser, * or *cruises—*but
that’s what they get, and it keeps them from finding the information they
are really after.

Search on Wikipedia (and other Wikimedia projects) includes a lot of
features that most users don’t know about. Most require special keywords,
and some even require specialized knowledge, such as familiarity with
regular expressions. It’s pretty difficult to invoke these special features
by accident.

But search also supports two particular bash-style wildcards without any
special syntax: *** will match any number of characters, and *?* will match
exactly one. Asterisks do come up from time to time, but people use
question marks all the time—because they like to ask questions!

A recent review of query-string features
<https://commons.wikimedia.org/wiki/File:From_Zero_to_Hero_-_Anticipating_Zero_Results_From_Query_Features,_Ignoring_Content.pdf>
called
out quotes and question marks as the two largest-impact predictors of
unsuccessful queries on Wikipedia. In a follow-up survey of queries with
question marks
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Dropping_Final_Question_Marks_in_the_Top_10_Wikipedias>
in
six of the top ten Wikipedias (by search volume), most question marks are
being used to ask questions (the other four of the top 10 were not
reviewed). In all ten of the top ten, stripping final question marks
dramatically decreased the number of ?-final queries that got either no
results, or very few results (i.e., less than 3). The improvement was
around 10-45% for ?-final queries, depending on the wiki. The overall
impact is much more modest (less than 0.5%) because queries with question
marks are not terribly common.

As a result of this analysis, we’ve implemented a change to search which
will by default replace question marks with spaces (to preserve the word
boundaries they intend in queries like *how?why?*). This setting can be
changed on a per-wiki basis, and other options include (i) only stripping
question marks at a clear word boundary (such as before a space), (ii) only
stripping question marks at the end of the query, and (iii) leaving the
question marks alone.

For the rarer users who do use question marks as a one-letter wildcard,
when question mark stripping is enabled, question marks can be escaped with
a backslash (e.g., *wiki\?edia*) to preserve their original wildcard
meaning. Power searchers who use *insource:* won’t need to do anything
special; queries with*insource:* will not be modified.

Here's a screenshot
<https://commons.wikimedia.org/wiki/File:Old-are_viruses_living%3F.png> of
the former question mark behavior, where it is treated as a wildcard.
Note that “living?” only matches the name “Livings”, leading to two very
unsatisfactory results.

Here's a screenshot
<https://commons.wikimedia.org/wiki/File:New-are_viruses_living%3F.png> of
the new question mark behavior, where it is ignored. Now the question
and part of the answer can be seen in the snippet for the very first
result, and all of the top three results seem relevant.

(Sorry I can't embed the screenshots—the mailing list won't allow messages
over 40K.)

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
discov...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
Wikitech-ambassadors mailing list
Wikitech-ambassadors@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors

Reply via email to