This is great, thank you all for your input!

It does seem like ElasticSearch (and likely MoreLikeThis) are the way to
go, and I'm very happy to hear that this could be integrated with other use
cases relatively easily. I'll definitely keep those in mind and I hope to
come back to this in a few weeks.

Thanks again!

2017-10-06 19:17 GMT+01:00 Morten Wang <nett...@gmail.com>:

> In my experience, the problem you're trying to solve boils down to finding
> articles similar to a given search query that are in the given category.
> Trying to outsmart Lucene on that kind of a problem is going to be
> challenging given that it's for example used as a benchmark in research[1],
> so switching over to ElasticSearch is arguably the way to go.
>
> There's a specific feature in Lucene called "MoreLikeThis", and it's also
> exposed in WP's search API to find articles similar to other articles. The
> documentation[2] of that feature provides a fairly good explanation of how
> it works, making it a possible starting point on how to filter a given
> document to improve the search results.
>
> If I remember correctly there are a couple of research papers that study
> how to recommend sources for articles (or articles for a given source), but
> I'd have to go look for them to find them. You might want to consider
> searching the Research Newsletter archives and Google Scholar as that might
> give you a couple of existing approaches.
>
>
> Footnotes:
> 1: A paper I reviewed for the Research Newsletter used it:
> https://meta.wikimedia.org/wiki/Research:Newsletter/
> 2016/May#Evaluating_link-based_recommendations_for_Wikipedia
> 2: https://lucene.apache.org/core/3_0_3/api/contrib-
> queries/org/apache/lucene/search/similar/MoreLikeThis.html
>
>
> Cheers,
> Morten
>
>
> On 1 October 2017 at 18:36, Mukunda Modell <mmod...@wikimedia.org> wrote:
>
>> I think this is a really cool idea. I don't know of other similar tools
>> but it does sound like something that should be a good fit for
>> elasticsearch.
>>
>> On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves <
>> guilherme.p.g...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I've been hacking on a new tool and I thought I'd share what (little) I
>>> have so far to get some comments and learn of related approaches from the
>>> community.
>>>
>>> The basic idea would be to have a browser extension that tells the user
>>> if the current page they're viewing looks like a good reference for a
>>> Wikipedia article, for some whitelisted domains like news websites. This
>>> would hopefully prompt casual/opportunistic edits, especially for articles
>>> that may be overlooked normally.
>>>
>>> As a proof of concept for a backend, I built a simple bag-of-words model
>>> of the TextExtracts of enwiki's 
>>> Category:All_articles_needing_additional_references.
>>> I then set up a tool [1] to receive HTML input and retrieve the 5 most
>>> similar articles to that input. You can try it out in your browser [2], or
>>> on the command line [3]. The results could definitely be better, but having
>>> tried it on a few different articles over the past few days, I think
>>> there's some potential there.
>>>
>>> I'd be interested in hearing your thoughts on this. Specifically:
>>>
>>> * If such a backend/API were available, would you be interested in using
>>> it for other tools? If so, what functionality would you expect from it?
>>> * I'm thinking of just throwing away the above proof of concept and
>>> using ElasticSearch, though I don't know a lot about it. Is anyone aware of
>>> a similar dataset that already exists there, by any chance? Or any reasons
>>> not to go that way?
>>> * Any other comments on the overall idea or implementation?
>>>
>>> Thanks!
>>>
>>> 1- https://github.com/eggpi/similarity
>>> 2- https://tools.wmflabs.org/similarity/
>>> 3- Example: curl https://www.nytimes.com/2017/0
>>> 9/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X POST
>>> http://tools.wmflabs.org/similarity/search --form "text=<-"
>>> --
>>> Guilherme P. Gonçalves
>>> _______________________________________________
>>> Cloud mailing list
>>> Cloud@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>>
>> _______________________________________________
>> Cloud mailing list
>> Cloud@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>>
>
> _______________________________________________
> Cloud mailing list
> Cloud@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/cloud
>
>


-- 
Guilherme P. Gonçalves
_______________________________________________
Cloud mailing list
Cloud@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to