Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-27 Thread Floeck, Fabian (AIFB)
Hi, as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann and me) are also working on an algorithm for authorship detection. Our approach is somewhat different than Luca and Michael's in that we rebuild authorship information for words in paragraphs and sentences via MD5

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Matthew Flaschen
On 02/26/2013 02:29 AM, Luca de Alfaro wrote: >- We need a way to poll the database for things like what are all >revision_ids of a given page. We could use the API instead, but it's less >efficient. Yes, as others have said LAbs should allow that either now or shortly. You should sig

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Gabriel Wicke
Hi Luca, we are working on somewhat related issues in Parsoid [1][2]. The modified HTML DOM is diffed vs. the original DOM on the way in. Each modified node is annotated with the base revision. We don't store this information yet- right now we use it to selectively serialize modified parts of the

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Petr Bena
your site doesn't work http://blamemaps.wmflabs.org/mw/index.php/Main_Page -> the connection timed out On Tue, Feb 26, 2013 at 5:52 PM, Bartosz Dziewoński wrote: > I have briefly toyed with something similar. Unlike yours, it has a (very > simple and rudimentary) interface, but no sophisticated

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Bartosz Dziewoński
I have briefly toyed with something similar. Unlike yours, it has a (very simple and rudimentary) interface, but no sophisticated algorithms inside :) – just a standard LCS diff library. It also works in real time (but is awfully slow). It can be seen at http://wikiblame.heroku.com/ (source at

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Mark A. Hershberger
On 02/25/2013 09:21 PM, Luca de Alfaro wrote: > The problem is of putting together a bit of effort to get to that first > running version. How big are the wikis that you've tried this on? Would smaller academic wikis be able to use this code? I may have a use for your code since one of the wiki

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Krenair
It sounds like some of those things should be working in labs soon with DB replication. I doubt they'll let you store terabytes though. Alex Monk On 26/02/13 07:29, Luca de Alfaro wrote: What we wrote can work also on labs, but: - We need a way to poll the database for things like what ar

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-26 Thread Sumana Harihareswara
On 02/25/2013 06:21 PM, Luca de Alfaro wrote: > I am writing this message as we hope this might be of interest, and as we > would be quite happy to find people willing to collaborate. Is anybody > interested in developing a GUI for it and talk to us about what API we > should have for retrieving t

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-25 Thread Luca de Alfaro
I agree: in fact we don't do it in the write pipeline. The code we wrote implements a simple queue, where page_id are queued for processing. The processing job then gets a page_id out of that table, and processes all the missing revisions for that page_id. So this is useful also if (say) there i

Re: [Wikitech-l] Blame maps aka authorship detection

2013-02-25 Thread Matthew Flaschen
On 02/25/2013 09:21 PM, Luca de Alfaro wrote: > I am writing this message as we hope this might be of interest, and as we > would be quite happy to find people willing to collaborate. Is anybody > interested in developing a GUI for it and talk to us about what API we > should have for retrieving t

[Wikitech-l] Blame maps aka authorship detection

2013-02-25 Thread Luca de Alfaro
Dear All, Michael Shavlovky and I have been working on blame maps (authorship detection) for the various Wikipedias. We have code in the WikiMedia repository that has been written with the goal to obtain a production system capable of attributing all content (not just a research demo). Here are s