If anyone is interested in a faster processing of revision differences, you 
could also adapt the strategy we implemented for wikiwho [1], which is keeping 
track of bigger unchanged text chunks with hashes and just diffing the 
remaining text (usually a relatively small part oft the article). We 
specifically introduced that technique because diffing all the text was too 
expensive. And in principle, it can produce the same output, although we 
currently use it for authorship detection, which is a slightly different task.  
Anyway, it is on average >100 times faster than pure "traditional" diffing. 
Maybe that is useful for someone. Code is available at github [2].

[1] http://f-squared.org/wikiwho
[2] https://github.com/maribelacosta/wikiwho


On 14.12.2014, at 07:23, Jeremy Baron 
<jer...@tuxmachine.com<mailto:jer...@tuxmachine.com>> wrote:


On Dec 13, 2014 12:33 PM, "Aaron Halfaker" 
<ahalfa...@wikimedia.org<mailto:ahalfa...@wikimedia.org>> wrote:
> 1. It turns out that generating diffs is computationally complex, so 
> generating them in real time is slow and lame.  I'm working to generate all 
> diffs historically using Hadoop and then have a live system listening to 
> recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all 
enwiki diffs for all time. (don't remember if this is namespace limited) But 
also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into 
fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Cheers,
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.flo...@gesis.org<mailto:fabian.flo...@gesis.org>

www.gesis.org
www.facebook.com/gesis.org





_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to