Re: [Wikimedia-search] Completion suggestion API demo

2015-08-26 Thread Stas Malyshev
Hi!

 I uploaded a small HTML page to compare both approaches:
 http://cirrus-browser-bot.wmflabs.org/suggest.html

This is very cool! From my very short testing, seems that it works
pretty nicely.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search


Re: [Wikimedia-search] Completion suggestion API demo

2015-08-26 Thread Erik Bernhardson
I ran some zero result rate tests against this API today, it is a huge
reduction in the zero result rate over the existing prefix search.  from
32% to 19% (on a 1% sample of prefix searches for an entire day)

On Wed, Aug 26, 2015 at 12:34 PM, Stas Malyshev smalys...@wikimedia.org
wrote:

 Hi!

  I uploaded a small HTML page to compare both approaches:
  http://cirrus-browser-bot.wmflabs.org/suggest.html

 This is very cool! From my very short testing, seems that it works
 pretty nicely.

 --
 Stas Malyshev
 smalys...@wikimedia.org

 ___
 Wikimedia-search mailing list
 Wikimedia-search@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search


[Wikimedia-search] Final results of the first A/B test

2015-08-26 Thread Oliver Keyes
Hey all,

Several weeks ago we ran an A/B test to try and decrease the number of
searches on Wikipedia returning zero results. This consisted of a
small config change that reduced the confidence needed for our systems
to provide search results, along with a change to the smoothing
algorithm used to bump the quality of the results now provided.

Our intent was not only to reduce the zero results rate but also to
prototype the actual process of A/B testing and identify issues we
could fix to make future tests easier and more reliable - this was the
first A/B test we had run.

5% of searches were registered as a control group, and an additional
5% subject to the reduced confidence and smoothing algorithm change.
An initial analysis over the first day's 7m events was run on 7
August, and a final analysis looking at an entire week of data was
completed yesterday. You can read the full results at
https://github.com/wikimedia-research/SuggestItOff/blob/master/initial_analysis/presentation.Rmd
and 
https://github.com/wikimedia-research/SuggestItOff/blob/master/final_analysis/presentation.pdf

Based on what we've seen we conclude that there is, at best, a
negligible effect from this change - and it's hard to tell if there
even is an effect at all. Accordingly, we recommend the default
behaviour be used for all users, and the experiment disabled.

This may sound like a failure, but it's actually not. For one thing,
we've learned that the config variables here probably aren't our venue
for dramatic changes - the defaults are pretty sensible. If we're
looking for dramatic changes in the rate, we should be applying
dramatic changes in the system's behaviour.

In addition, we identified a lot of process issues we can fix for the
next round of A/B tests, making it easier to analyse the data that
comes in and making the results easier to rely on. These include:

1. Who the gods would destroy, they first bless with real-time
analytics. The dramatic difference between the outcome of the initial
and final analysis speaks partly to the small size of the effect seen
and the power analysis issues mentoned below, but it's also a good
reminder that a single day of data, however many datapoints it
contains, is rarely the answer. User behaviour varies dramatically
depending on the day of the week or the month of the year - a week
should be seen as the minimum testing period for us to have any
confidence in what we see.
2. Power analysis is a must-have. Our hypothesis for the negligible
and totally varying size of the effect is simply the amount of data we
had; when you're looking at millions upon millions of events for a
pair of options, you're going to see patterns - because with enough
data stared at for long enough, pretty much anything can happen. That
doesn't mean it's /real/. In the future we need to be setting our
sample size using proper, a-priori power analysis - or switching to
Bayesian methods where this sort of problem doesn't appear.

These are not new lessons to the org as a whole (at least, I hope not)
but they are nice reminders, and I hope that sharing them allows us to
start building up an org-wide understanding of how we A/B test and the
costs of not doing things in a very deliberate way.

Thanks,

-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search


Re: [Wikimedia-search] Measuring user user satisfaction while reducing it at the same time?

2015-08-26 Thread Oliver Keyes
Can you think of a way of consistently identifying a user from page to
page, but only in the trace following them landing on the search page,
that does not include page parameters?

On 26 August 2015 at 16:30, Max Semenik maxsem.w...@gmail.com wrote:
 While doing CR for
 https://gerrit.wikimedia.org/r/#/c/232896/3/modules/ext.wikimediaEvents.search.js
 I came to have serious doubts about this approach.

 In brief, it attempts to track user satisfaction with search results by
 measuring how long do people stay on pages. It does that by appending
 fromsearch=1 to links for 0.5% of users. However, this results in page views
 being uncached and thus increasing HTML load time by a factor of 4-5 and,
 consequentially, kicking even short pages' first paint outside of comfort
 zone of 1 second - and that's measured from the office, with ping of 2-3 ms
 to ulsfo. My concern here is that as a result we're trying to measure the
 very metric we're screwing with, resulting in experiment being inaccurate.

 Can we come up with a way of measurement that's less intrusive or alter the
 requirements of the experiment?

 --
 Best regards,
 Max Semenik ([[User:MaxSem]])

 ___
 Wikimedia-search mailing list
 Wikimedia-search@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search




-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search


[Wikimedia-search] Zero Results Rate—One Month Followup

2015-08-26 Thread Trey Jones
Hey everyone,

I've re-run my big wiki zero result rate numbers to see what has changed
in the last month. The results are here:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#One_Month_Followup

Since I was only looking at the big 52 wikis (100K+ articles), the zero
results rate is under 20% (good news), but it hasn't gone down in a month
(bad news).

I looked very briefly at the full text zero results rate for the rest of
the wikis for yesterday. That zero results rate was 56.6%! Lots of hits
from wikidata and itwiki-things! There were some DOI queries, but none of
the other usual suspects.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search


Re: [Wikimedia-search] Measuring user user satisfaction while reducing it at the same time?

2015-08-26 Thread Dan Garry
Nice catch Max. Thanks for reporting it. Do you have any suggestions for
how we could alleviate this issue?

Thanks,
Dan

On 26 August 2015 at 13:30, Max Semenik maxsem.w...@gmail.com wrote:

 While doing CR for
 https://gerrit.wikimedia.org/r/#/c/232896/3/modules/ext.wikimediaEvents.search.js
 I came to have serious doubts about this approach.

 In brief, it attempts to track user satisfaction with search results by
 measuring how long do people stay on pages. It does that by appending
 fromsearch=1 to links for 0.5% of users. However, this results in page
 views being uncached and thus increasing HTML load time by a factor of 4-5
 and, consequentially, kicking even short pages' first paint outside of
 comfort zone of 1 second - and that's measured from the office, with ping
 of 2-3 ms to ulsfo. My concern here is that as a result we're trying to
 measure the very metric we're screwing with, resulting in experiment being
 inaccurate.

 Can we come up with a way of measurement that's less intrusive or alter
 the requirements of the experiment?

 --
 Best regards,
 Max Semenik ([[User:MaxSem]])

 ___
 Wikimedia-search mailing list
 Wikimedia-search@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimedia-search




-- 
Dan Garry
Lead Product Manager, Discovery
Wikimedia Foundation
___
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search