very interesting research - many thanks for sharing that.
----- "Robert Rohde" <raro...@gmail.com> wrote: > From: "Robert Rohde" <raro...@gmail.com> > To: "Wikimedia Foundation Mailing List" <foundation-l@lists.wikimedia.org> > Sent: Thursday, 27 August, 2009 17:41:29 GMT +00:00 GMT Britain, Ireland, > Portugal > Subject: [Foundation-l] Frequency of Seeing Bad Versions - now with traffic > data > > Recently, I reported on a simple study of how likely one was to > encounter recent vandalism in Wikipedia based on selecting articles at > random and using revert behavior as a proxy for recent vandalism. > > http://lists.wikimedia.org/pipermail/foundation-l/2009-August/054171.html > > One of the key limitations of that work was that it was looking at > articles selected at random from the pool of all existing page titles. > That approach was of the most immediate interest to me, but it didn't > directly address the likelihood of encountering vandalism based on the > way that Wikipedia is actually used because the selection of articles > that people choose to visit is highly non-random. > > I've now redone that analysis with a crude traffic based weighting. > For traffic information I used the same data stream used by > http://stats.grok.se. That data is recorded hourly. For simplicity I > chose 20 hours at random from the last eight months and averaged those > together to get a rough picture of the relative prominence of pages. > I then chose a selection of 30000 articles at random with their > probability of selection proportional to the traffic they received, > and repeated the prior analysis previously described. (Note that this > has the effect of treating the prominence of each page as a constant > over time. In practice we know some pages rise to prominence while > other fall down, but I am assuming the average pattern is still a good > enough approximation to be useful.) > > From this sample I found 5,955,236 revert events in 38,096,653 edits. > This is an increase of 29 times in edit frequency and 58 times the > number of revert events that were found from a uniform sampling of > pages. I suspect it surprises no one that highly trafficked pages are > edited more often and subject to more vandalism than the average page, > though it might not have been obvious that the the ratio of reverts to > normal edits is also increased over more obscure pages. > > As before, the revert time distribution has a very long tail, though > as predicted the times are generally reduced when traffic weighting is > applied. In the traffic weighted sample, the median time to revert is > 3.4 minutes and the mean time is 2.2 hours (compared to 6.7 minutes > and 18.2 hours with uniform weighting). Again, I think it is worth > acknowledging that having a majority of reverts occur within only a > few minutes is a strong testament to the efficiency and dedication > with which new edits are usually reviewed by the community. We could > be much worse off if most things weren't caught so quickly. > > Unfortunately, in comparing the current analysis to the previous one, > the faster response time is essentially being overwhelmed by the much > larger number of vandalism occurrences. The net result is that > averaged over the whole history of Wikipedia a visitor would be > expected to receive a recently degraded article version during about > 1.1% of requests (compared to ~0.37% in the uniform weighting > estimate). The last six months averaged a slightly higher 1.3% (1 in > 80 requests). As before, most of the degraded content that people are > likely to actually encounter is coming from the subset of things that > get by the initial monitors and survive for a long time. Among edits > that are eventually reverted the longest lasting 5% of bad content > (those edits taking > 7.2 hours to revert) is responsible for 78% of > the expected encounters with recently degraded material. One might > speculate that such long-lived material is more likely to reflect > subtle damage to a page rather than more obvious problems like page > blanking. I did not try to investigate this. > > In my sample, the number of reverts being made to articles has > declined ~40% since a peak in late 2006. However, the mean and median > time to revert is little changed over the last two years. What little > trend exists points in the direction of slightly slower responses. > > > So to summarize, the results here are qualitatively similar to those > found in the previous work. However with traffic weighting we find > quantitative differences such that reverts occur much more often but > take less time to be executed. The net effect of these competing > factors is such that the bad content is more likely to be seen than > suggested by the uniform weighting. > > -Robert Rohde > > _______________________________________________ > foundation-l mailing list > foundation-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l > _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l