Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web

Jon Katz Thu, 18 Feb 2016 16:00:59 -0800

Hi,
Can someone on this list point me to where the more-like code sits? Or
better, yet would be someone documenting the rules that govern
prioritization of suggestions.


I would like to document the logic for our communities so that we can have
an open discussion about what variables and weighting we should use to
suggest articles.
-J

On Mon, Feb 15, 2016 at 11:26 AM, Dmitry Brant <dbr...@wikimedia.org> wrote:

> Just a quick note that our latest production release (just published)
> contains this A/B test, in addition to the other updates.
> Looking forward to seeing the numbers from this!
>
> -Dmitry
>
>
> On Sun, Jan 31, 2016 at 9:35 PM, Dmitry Brant <dbr...@wikimedia.org>
> wrote:
>
>> Roger that! I think we could squeeze it in -- the change would be pretty
>> straightforward. We'll be able to release a Beta with this A/B test in
>> short order, but it will probably be a couple weeks until our next
>> production release. I hope that's all right.
>>
>>
>> On Sat, Jan 30, 2016 at 1:02 PM, Gabriel Wicke <gwi...@wikimedia.org>
>> wrote:
>>
>>> We are also happy to add cached entry points for high-traffic end
>>> points in the REST API. I commented to that effect at
>>> https://phabricator.wikimedia.org/T124216#1984206. Let us know if you
>>> think this would be useful for this use case.
>>>
>>> On Sat, Jan 30, 2016 at 8:11 AM, Adam Baso <ab...@wikimedia.org> wrote:
>>> > Okay. As per https://phabricator.wikimedia.org/T124225#1984080 I
>>> think if
>>> > we're doing near term experimentation with a controlled A/B test the
>>> Android
>>> > app is the only logical place to start. Dmitry, can that work for you?
>>> It's
>>> > not required, but I think it would be neat to see if we can move the
>>> needle
>>> > even more. Of course your quarterly goals take top priority...but what
>>> do
>>> > you think?
>>> >
>>> > On Sat, Jan 23, 2016 at 5:58 AM, Adam Baso <ab...@wikimedia.org>
>>> wrote:
>>> >>
>>> >> Hey all, am planning to look at Phabricator tasks and provide a reply
>>> >> during the upcoming weekdays. Just wanted to acknowledge I saw your
>>> replies!
>>> >>
>>> >>
>>> >> On Friday, January 22, 2016, Erik Bernhardson <
>>> ebernhard...@wikimedia.org>
>>> >> wrote:
>>> >>>
>>> >>> On Thu, Jan 21, 2016 at 1:29 AM, Joaquin Oltra Hernandez
>>> >>> <jhernan...@wikimedia.org> wrote:
>>> >>>>
>>> >>>> Regarding the caching, we would need to agree between apps and web
>>> about
>>> >>>> the url and smaxage parameter as Adam noted so that the urls are
>>> exactly the
>>> >>>> same to not bloat varnish and reuse the same cached objects across
>>> >>>> platforms.
>>> >>>>
>>> >>>> It is an extremely adhoc and brittle solution but seems like it
>>> would be
>>> >>>> the greatest win.
>>> >>>>
>>> >>>> 20% of the traffic from searches by being only in android and web
>>> beta
>>> >>>> seems a lot to me, and we should work on reducing it, otherwise
>>> when it hits
>>> >>>> web stable we're going to crush the servers, so caching seems the
>>> highest
>>> >>>> priority.
>>> >>>>
>>> >>> To clarify its 20% of the load, as opposed to 20% of the traffic. But
>>> >>> same difference :)
>>> >>>
>>> >>>>
>>> >>>> Let's chime in https://phabricator.wikimedia.org/T124216 and
>>> continue
>>> >>>> the cache discussion there.
>>> >>>>
>>> >>>> Regarding the validity of results with opening text only, how
>>> should we
>>> >>>> proceed? Adam?
>>> >>>>
>>> >>> I've put together https://phabricator.wikimedia.org/T124258 to track
>>> >>> putting together an AB test that measures the difference in click
>>> through
>>> >>> rates for the two approaches.
>>> >>>
>>> >>>
>>> >>>>
>>> >>>> On Wed, Jan 20, 2016 at 9:34 PM, David Causse <
>>> dcau...@wikimedia.org>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Hi,
>>> >>>>>
>>> >>>>> Yes we can combine many factors, from templates (quality but also
>>> >>>>> disambiguation/stubs), size and others.
>>> >>>>> Today cirrus uses mostly the number of incoming links which (imho)
>>> is
>>> >>>>> not very good for morelike.
>>> >>>>> On enwiki results will also be scored according the weights
>>> defined in
>>> >>>>>
>>> https://en.wikipedia.org/wiki/MediaWiki:Cirrussearch-boost-templates.
>>> >>>>>
>>> >>>>> I wrote a small bash to compare results :
>>> >>>>> https://gist.github.com/nomoa/93c5097e3c3cb3b6ebad
>>> >>>>> Here is some random results from the list (Semetimes better,
>>> sometimes
>>> >>>>> worse) :
>>> >>>>>
>>> >>>>> $ sh morelike.sh Revolution_Muslim
>>> >>>>> Defaults
>>> >>>>>         "title": "Chess",
>>> >>>>>         "title": "Suicide attack",
>>> >>>>>         "title": "Zachary Adam Chesser",
>>> >>>>> =======
>>> >>>>> Opening text no boost links
>>> >>>>>         "title": "Hungarian Revolution of 1956",
>>> >>>>>         "title": "Muslims for America",
>>> >>>>>         "title": "Salafist Front",
>>> >>>>>
>>> >>>>> $ sh morelike.sh Chesser
>>> >>>>> Defaults
>>> >>>>>         "title": "Chess",
>>> >>>>>         "title": "Edinburgh",
>>> >>>>>         "title": "Edinburgh Corn Exchange",
>>> >>>>> =======
>>> >>>>> Opening text no boost links
>>> >>>>>         "title": "Dreghorn Barracks",
>>> >>>>>         "title": "Edinburgh Chess Club",
>>> >>>>>         "title": "Threipmuir Reservoir",
>>> >>>>>
>>> >>>>> $ sh morelike.sh Time_%28disambiguation%29
>>> >>>>> Defaults
>>> >>>>>         "title": "Atlantis: The Lost Empire",
>>> >>>>>         "title": "Stargate",
>>> >>>>>         "title": "Stargate SG-1",
>>> >>>>> =======
>>> >>>>> Opening text no boost links
>>> >>>>>         "title": "Father Time (disambiguation)",
>>> >>>>>         "title": "The Last Time",
>>> >>>>>         "title": "Time After Time",
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> Le 20/01/2016 19:34, Jon Robson a écrit :
>>> >>>>>>
>>> >>>>>> I'm actually  interested to see whether this yields better
>>> results in
>>> >>>>>> certain examples where the algorithm is lacking [1]. If it's done
>>> as
>>> >>>>>> an A/B test we could even measure things such as click throughs
>>> in the
>>> >>>>>> related article feature (whether they go up or not)
>>> >>>>>>
>>> >>>>>> Out of interest is it also possible to take article size and type
>>> into
>>> >>>>>> account and not returning any morelike results for things like
>>> >>>>>> disambiguation pages and stubs?
>>> >>>>>>
>>> >>>>>> [1] https://www.mediawiki.org/wiki/Topic:Swsjajvdll3pf8ya
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Wed, Jan 20, 2016 at 9:47 AM, Adam Baso <ab...@wikimedia.org>
>>> >>>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> One thing we could do regarding the quality of the output is
>>> check
>>> >>>>>>> results
>>> >>>>>>> against a random sample of popular articles (example approach to
>>> find
>>> >>>>>>> some
>>> >>>>>>> articles) on mdot Wikipedia. Presuming that improves the quality
>>> of
>>> >>>>>>> the
>>> >>>>>>> recommendations or at least does not degrade them, we should
>>> consider
>>> >>>>>>> adding
>>> >>>>>>> the enhancement task to a future sprint, with further
>>> instrumentation
>>> >>>>>>> and
>>> >>>>>>> A/B testing / timeboxed beta test, etc.
>>> >>>>>>>
>>> >>>>>>> Joaquin, smaxage (e.g., 24 hour cached responses) does seem a
>>> good
>>> >>>>>>> fix for
>>> >>>>>>> now for further reduction of client perceived wait, at least for
>>> >>>>>>> non-cold
>>> >>>>>>> cache requests, even if we stop beating up the backend. Does
>>> anyone
>>> >>>>>>> know of
>>> >>>>>>> a compelling reason to not do that for the time being? The main
>>> thing
>>> >>>>>>> that
>>> >>>>>>> comes to mind as always is growing the Varnish cache object pool
>>> -
>>> >>>>>>> probably
>>> >>>>>>> not a huge deal while the thing is only in beta, but on the
>>> stable
>>> >>>>>>> channel
>>> >>>>>>> maybe noteworthy because it would run on probably most pages (but
>>> >>>>>>> that's
>>> >>>>>>> what edge caches are for, after all).
>>> >>>>>>>
>>> >>>>>>> Erik, from your perspective does use of smaxage relieve the
>>> backend
>>> >>>>>>> sufficiently?
>>> >>>>>>>
>>> >>>>>>> If we do smaxage, then Web, Android, iOS should standardize their
>>> >>>>>>> URLs so we
>>> >>>>>>> get more cache hits at the edge across all clients. Here's the
>>> URL I
>>> >>>>>>> see
>>> >>>>>>> being used on the web today from mobile web beta:
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> -Adam
>>> >>>>>>>
>>> >>>>>>> On Wed, Jan 20, 2016 at 7:45 AM, Joaquin Oltra Hernandez
>>> >>>>>>> <jhernan...@wikimedia.org> wrote:
>>> >>>>>>>>
>>> >>>>>>>> I'd be up to it if we manage to cram it up in a following
>>> sprint and
>>> >>>>>>>> it is
>>> >>>>>>>> worth it.
>>> >>>>>>>>
>>> >>>>>>>> We could run a controlled test against production with a long
>>> batch
>>> >>>>>>>> of
>>> >>>>>>>> articles and check median/percentiles response time with
>>> repeated
>>> >>>>>>>> runs and
>>> >>>>>>>> highlight the different results for human inspection regarding
>>> >>>>>>>> quality.
>>> >>>>>>>>
>>> >>>>>>>> It's been noted previously that the results are far from ideal
>>> >>>>>>>> (which they
>>> >>>>>>>> are because it is just morelike), and I think it would be a
>>> great
>>> >>>>>>>> idea to
>>> >>>>>>>> change the endpoint to a specific one that is smarter and has
>>> some
>>> >>>>>>>> cache (we
>>> >>>>>>>> could do much more to get relevant results besides text
>>> similarity,
>>> >>>>>>>> take
>>> >>>>>>>> into account links, or see also links if there are, etc...).
>>> >>>>>>>>
>>> >>>>>>>> As a note, in mobile web the related articles extension allows
>>> >>>>>>>> editors to
>>> >>>>>>>> specify articles to show in the section, which would avoid
>>> queries
>>> >>>>>>>> to
>>> >>>>>>>> cirrussearch if it was more used (once rolled into stable I
>>> guess).
>>> >>>>>>>>
>>> >>>>>>>> I remember that the performance related task was closed as
>>> resolved
>>> >>>>>>>> (https://phabricator.wikimedia.org/T121254#1907192), should we
>>> >>>>>>>> reopen it or
>>> >>>>>>>> create a new one?
>>> >>>>>>>>
>>> >>>>>>>> I'm not sure if we ended up adding the smaxage parameter (I
>>> think we
>>> >>>>>>>> didn't), should we? To me it seems a no-brainer that we should
>>> be
>>> >>>>>>>> caching
>>> >>>>>>>> this results in varnish since they don't need to be completely
>>> up to
>>> >>>>>>>> date
>>> >>>>>>>> for this use case.
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Jan 19, 2016 at 11:54 PM, Erik Bernhardson
>>> >>>>>>>> <ebernhard...@wikimedia.org> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Both mobile apps and web are using CirrusSearch's morelike:
>>> feature
>>> >>>>>>>>> which
>>> >>>>>>>>> is showing some performance issues on our end. We would like to
>>> >>>>>>>>> make a
>>> >>>>>>>>> performance optimization to it, but before we would prefer to
>>> run
>>> >>>>>>>>> an A/B
>>> >>>>>>>>> test to see if the results are still "about as good" as they
>>> are
>>> >>>>>>>>> currently.
>>> >>>>>>>>>
>>> >>>>>>>>> The optimization is basically: Currently more like this takes
>>> the
>>> >>>>>>>>> entire
>>> >>>>>>>>> article into account, we would like to change this to take
>>> only the
>>> >>>>>>>>> opening
>>> >>>>>>>>> text of an article into account. This should reduce the amount
>>> of
>>> >>>>>>>>> work we
>>> >>>>>>>>> have to do on the backend saving both server load and latency
>>> the
>>> >>>>>>>>> user sees
>>> >>>>>>>>> running the query.
>>> >>>>>>>>>
>>> >>>>>>>>> This can be triggered by adding these two query parameters to
>>> the
>>> >>>>>>>>> search
>>> >>>>>>>>> api request that is being performed:
>>> >>>>>>>>>
>>> >>>>>>>>> cirrusMltUseFields=yes&cirrusMltFields=opening_text
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> The API will give a warning that these parameters do not
>>> exist, but
>>> >>>>>>>>> they
>>> >>>>>>>>> are safe to ignore. Would any of you be willing to run this
>>> test?
>>> >>>>>>>>> We would
>>> >>>>>>>>> basically want to look at user perceived latency along with
>>> click
>>> >>>>>>>>> through
>>> >>>>>>>>> rates for the current default setup along with the restricted
>>> setup
>>> >>>>>>>>> using
>>> >>>>>>>>> only opening_text.
>>> >>>>>>>>>
>>> >>>>>>>>> Erik B.
>>> >>>>>>>>>
>>> >>>>>>>>> _______________________________________________
>>> >>>>>>>>> Mobile-l mailing list
>>> >>>>>>>>> Mobile-l@lists.wikimedia.org
>>> >>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >>>>>>>>>
>>> >>>>>>>
>>> >>>>>>> _______________________________________________
>>> >>>>>>> Mobile-l mailing list
>>> >>>>>>> Mobile-l@lists.wikimedia.org
>>> >>>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >>>>>>>
>>> >>>>>> _______________________________________________
>>> >>>>>> Mobile-l mailing list
>>> >>>>>> Mobile-l@lists.wikimedia.org
>>> >>>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> Mobile-l mailing list
>>> >>>>> Mobile-l@lists.wikimedia.org
>>> >>>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Mobile-l mailing list
>>> >>>> Mobile-l@lists.wikimedia.org
>>> >>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >>>>
>>> >>>
>>> >
>>> >
>>> > _______________________________________________
>>> > Mobile-l mailing list
>>> > Mobile-l@lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>> >
>>>
>>>
>>>
>>> --
>>> Gabriel Wicke
>>> Principal Engineer, Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Mobile-l mailing list
>>> Mobile-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>>>
>>
>>
>>
>> --
>> Dmitry Brant
>> Mobile Apps Team (Android)
>> Wikimedia Foundation
>> https://www.mediawiki.org/wiki/Wikimedia_mobile_engineering
>>
>>
>
>
> --
> Dmitry Brant
> Mobile Apps Team (Android)
> Wikimedia Foundation
> https://www.mediawiki.org/wiki/Wikimedia_mobile_engineering
>
>
> _______________________________________________
> Mobile-l mailing list
> Mobile-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>
>

_______________________________________________
Mobile-l mailing list
Mobile-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mobile-l

Re: [WikimediaMobile] Similar articles feature performance in CirrusSearch for apps and mobile web

Reply via email to