Hi Trey, Cool analysis. I'm curious whether the infrastructure let's you look at query sessions---- do these queries with special symbols occur late in a multi-query sequence that included simpler versions earlier in the sequence?
Maybe you can segment users who are confused about the query language versus power users who are iteratively enhancing a query. The latter seems likely to generate low-result-count queries that are more acceptable because the user up twisted the query intentionally. John Sent from +1-617-899-2066 > On May 27, 2016, at 5:17 PM, Trey Jones <[email protected]> wrote: > > Hi everyone, > > Mikhail, Data Analyst Extraordinaire, recently published his report, "From > Zero to Hero"[1] on the relationship between various features of queries as > strings (rather than the content of the query) and those queries getting no > results. > > Today for my 10% project I took a quick look at the two most impactful > features, quotes and question marks. These two features stood out in > Mikhail's report as having both relatively high volume and a relatively > higher chance of getting no results. > > I'm not planning on doing a more formal report right now, though I will > probably copy this email to my Notes page. > > Quotes make sense, as we try to get an exact match for strings inside quotes, > which limits our options for making a match. Question marks are actually a > little-known, little-used, poorly documented, and poorly understood wildcard: > they stand for any single character. Most users use them to ask questions. > > I took a random sample of 50,000 English Wikipedia queries (using my > now-favorite criteria at [2]—basically, full text queries from normal humans > (as best as we can tell) with fewer than 3 results). I extracted all the > queries with quotes (170) and all the queries that ended in question marks, > that is, looked like questions (274). There were 4 queries that were all > questions and spaces (e.g., ???? ???????? ????)—they caused problems as they > are very expensive queries that repeatedly failed on the test cluster, so I > discarded them. I also took a random sub-sample of 1K queries from the larger > sample of 50K. > > All samples had plenty of gibberish queries (e.g., "fhdsfhsdjkfgdsjklgsdl"?), > queries in other languages, and the other usual cruft. > > For the sample with quotes, I used Relevance Forge to compare the results of > running queries as is vs replacing quotes with spaces. The summary stats are > below. The zero results rate for queries with quotes went down by almost > half, and more than half of queries has changes in their top 5 results. The > TotalHits stats are wildly skewed by one query that increased it's results by > over 300,000. (There always seems to be an outlier!) > > Metrics: > Query Count: 170 > Num TotalHits Changed: μ: 3049.99; σ: 26435.14; median: 1.00 > > Zero Results: 38.2% (-37.1%) > Top 5 Sorted Results Differ: 51.8% > Top 5 Unsorted Results Differ: 51.2% > Num Top 5 Results Changed: μ: 2.14; σ: 2.30; median: 1.00 > > For the sample with question marks, I used Relevance Forge to compare the > results of running queries as is vs dropping all trailing question marks and > spaces. Some queries ended in multiple question marks (removed), and some > queries had other question marks in the middle of the query (kept). The > summary stats are below. The summary is similar to those with quotes: almost > half of the zero results queries got results, and more than half of all > queries had changes to their top 5 results, and the mean number of total hits > is blown out by one query that got more than 300K additional results. > > Metrics: > Query Count: 274 > Num TotalHits Changed: μ: 1875.48; σ: 19885.60; median: 1.00 > > Zero Results: 43.1% (-39.1%) > Top 5 Sorted Results Differ: 53.3% > Top 5 Unsorted Results Differ: 53.3% > Num Top 5 Results Changed: μ: 2.22; σ: 2.33; median: 1.00 > > For the 1K sample query, I used Relevance Forge to compare the results of > running queries as is vs (a) replacing quotes with spaces, (b) dropping all > trailing question marks and spaces, and (c) doing both (there are even a very > few queries with both quotes and trailing question marks!). > > Keep in mind that these are all poorly performing queries (fewer than 3 > results). Summary results: > > (a) quotes > Metrics: > Query Count: 1000 > Num TotalHits Changed: μ: 0.31; σ: 9.70; median: 0.00 > Zero Results: 79.5% (-0.1%) > Top 5 Sorted Results Differ: 0.1% > Top 5 Unsorted Results Differ: 0.1% > Num Top 5 Results Changed: μ: 0.01; σ: 0.16; median: 0.00 > > (b) question marks > Metrics: > Query Count: 1000 > Num TotalHits Changed: μ: 0.16; σ: 3.45; median: 0.00 > Zero Results: 79.4% (-0.2%) > Top 5 Sorted Results Differ: 0.4% > Top 5 Unsorted Results Differ: 0.4% > Num Top 5 Results Changed: μ: 0.02; σ: 0.32; median: 0.00 > > (c) quotes and question marks (pretty much the sum of the previous two!) > Metrics: > Query Count: 1000 > Num TotalHits Changed: μ: 0.47; σ: 10.30; median: 0.00 > Zero Results: 79.3% (-0.3%) > Top 5 Sorted Results Differ: 0.5% > Top 5 Unsorted Results Differ: 0.5% > Num Top 5 Results Changed: μ: 0.03; σ: 0.35; median: 0.00 > > Overall, it's a pretty small effect, and a lot of the results are not always > great when quotes are dropped, but it's a very small effort to make the > change. > > A quick look at the queries with question marks didn't show any that were > obviously intended to be used as wildcards (except maybe all-question-marks, > like ????—but who knows what that is supposed to be?). > > It has been suggested before and I would also now recommend disabling ? as a > wildcard—it causes many more problems than it solves. > > Re-running poor-performing queries that have quotes without the quotes is an > easy win. We should do that too! > > > Thoughts, comments, and suggestions welcome! > > —Trey > > [1] > https://github.com/wikimedia-research/Discovery-Search-Adhoc-QueryFeatures/blob/master/report.pdf > [2] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki#Random_sampling > > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
