Hi everyone, Apologies for resurrecting this old thread, but I finally got around to making this (mostly) work so I thought I'd come back with an update. You can install the extension for either Chrome or Firefox below:
https://chrome.google.com/webstore/detail/wikipedia-needs-reference/michcligfeahibdmakjapmaigojkddmk https://addons.mozilla.org/en-GB/firefox/addon/wikipedia-needs-references/ The full code for the extension, server and the script that populates ElasticSearch are on GitHub (http://github.com/eggpi/similarity/), and the backend is hosted on Toolforge. It's definitely experimental and lacking in various ways (there's not even a proper icon yet!), but I've used it for a few weeks and managed to make some edits through it. If this sounds interesting, please give it a try and feel free to file issues. Thanks! 2017-10-08 14:34 GMT+02:00 Guilherme Gonçalves <guilherme.p.g...@gmail.com>: > This is great, thank you all for your input! > > It does seem like ElasticSearch (and likely MoreLikeThis) are the way to > go, and I'm very happy to hear that this could be integrated with other use > cases relatively easily. I'll definitely keep those in mind and I hope to > come back to this in a few weeks. > > Thanks again! > > 2017-10-06 19:17 GMT+01:00 Morten Wang <nett...@gmail.com>: > >> In my experience, the problem you're trying to solve boils down to >> finding articles similar to a given search query that are in the given >> category. Trying to outsmart Lucene on that kind of a problem is going to >> be challenging given that it's for example used as a benchmark in >> research[1], so switching over to ElasticSearch is arguably the way to go. >> >> There's a specific feature in Lucene called "MoreLikeThis", and it's also >> exposed in WP's search API to find articles similar to other articles. The >> documentation[2] of that feature provides a fairly good explanation of how >> it works, making it a possible starting point on how to filter a given >> document to improve the search results. >> >> If I remember correctly there are a couple of research papers that study >> how to recommend sources for articles (or articles for a given source), but >> I'd have to go look for them to find them. You might want to consider >> searching the Research Newsletter archives and Google Scholar as that might >> give you a couple of existing approaches. >> >> >> Footnotes: >> 1: A paper I reviewed for the Research Newsletter used it: >> https://meta.wikimedia.org/wiki/Research:Newsletter/2016 >> /May#Evaluating_link-based_recommendations_for_Wikipedia >> 2: https://lucene.apache.org/core/3_0_3/api/contrib-queries/ >> org/apache/lucene/search/similar/MoreLikeThis.html >> >> >> Cheers, >> Morten >> >> >> On 1 October 2017 at 18:36, Mukunda Modell <mmod...@wikimedia.org> wrote: >> >>> I think this is a really cool idea. I don't know of other similar tools >>> but it does sound like something that should be a good fit for >>> elasticsearch. >>> >>> On Fri, Sep 29, 2017 at 9:34 AM Guilherme Gonçalves < >>> guilherme.p.g...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> I've been hacking on a new tool and I thought I'd share what (little) I >>>> have so far to get some comments and learn of related approaches from the >>>> community. >>>> >>>> The basic idea would be to have a browser extension that tells the user >>>> if the current page they're viewing looks like a good reference for a >>>> Wikipedia article, for some whitelisted domains like news websites. This >>>> would hopefully prompt casual/opportunistic edits, especially for articles >>>> that may be overlooked normally. >>>> >>>> As a proof of concept for a backend, I built a simple bag-of-words >>>> model of the TextExtracts of enwiki's >>>> Category:All_articles_needing_additional_references. >>>> I then set up a tool [1] to receive HTML input and retrieve the 5 most >>>> similar articles to that input. You can try it out in your browser [2], or >>>> on the command line [3]. The results could definitely be better, but having >>>> tried it on a few different articles over the past few days, I think >>>> there's some potential there. >>>> >>>> I'd be interested in hearing your thoughts on this. Specifically: >>>> >>>> * If such a backend/API were available, would you be interested in >>>> using it for other tools? If so, what functionality would you expect from >>>> it? >>>> * I'm thinking of just throwing away the above proof of concept and >>>> using ElasticSearch, though I don't know a lot about it. Is anyone aware of >>>> a similar dataset that already exists there, by any chance? Or any reasons >>>> not to go that way? >>>> * Any other comments on the overall idea or implementation? >>>> >>>> Thanks! >>>> >>>> 1- https://github.com/eggpi/similarity >>>> 2- https://tools.wmflabs.org/similarity/ >>>> 3- Example: curl https://www.nytimes.com/2017/0 >>>> 9/22/opinion/sunday/portugal-drug-decriminalization.html | curl -X >>>> POST http://tools.wmflabs.org/similarity/search --form "text=<-" >>>> -- >>>> Guilherme P. Gonçalves >>>> _______________________________________________ >>>> Cloud mailing list >>>> Cloud@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/cloud >>>> >>> >>> _______________________________________________ >>> Cloud mailing list >>> Cloud@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/cloud >>> >>> >> >> _______________________________________________ >> Cloud mailing list >> Cloud@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/cloud >> >> > > > -- > Guilherme P. Gonçalves > -- Guilherme P. Gonçalves
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud