On Fri, Aug 28, 2009 at 10:08 AM, Thomas Dalton <thomas.dal...@gmail.com>wrote:
> 2009/8/28 Anthony <wikim...@inbox.org>: > > If you're going to do it, maybe we should work on a rough-consensus > > objective definition of "vandalism" before you release the file, > though... > > Don't we have a consensus definition already? Vandalism is bad faith > editing. You may also want to include test edits since they are > treated in the same way (just with different warning messages). That > isn't objective, but it should be close enough. We can argue over a > few borderline cases. Well, it relies on information (intent) which we can't determine simply from the content of the edit (sometimes it is implied if you look at the entire behavior of the user, but that's too messy). Is a POV edit "vandalism"? I think it has to be treated as such, at least some of the time ("Windows is the worst operating system ever"), but there are certainly edits which are clearly POV but the intent is unclear (many people don't know the rules). We need to remove intent from the definition, and I suppose call it "degraded articles". But simply saying that anything POV is vandalism would potentially include just about any large article. I suppose we can just list everything that's arguably vandalism and then categorize it later though. I expect we'll come up with several different final numbers, which I guess is okay (the only part that really needs to be pristinely unbiased is the selection of pageviews), though I do expect some people will adapt their definition of vandalism to fit the data. I support the request for 5000 random pageviews (uniform distribution > by pageview over the last 6 months) from the logs. Seems like it could be reused for a lot of different types of studies, so long as the researcher isn't exposed to the details of the urls before coming up with his/her methodology. And I think the analysis of those 5000 pageviews in all sorts of ways would "crowdsource" well. I'd love to see a "Nature Study" equivalent, analyzing the more subjective aspects of the articles in addition to just plain old vandalized/not-vandalized. If we can't get the 5000 random pageviews (do the logs even still exist?), I suppose wikistats will do. They have pageviews broken down by hour, so the non-uniformity of a single hour is probably fairly small for the popular pages most likely to be selected. Worst part is that it's a whole lot of data to download, and I'm not sure any shortcuts can be taken without screwing up the non-uniformity. I considered just downloading the projectcounts and then selecting the date-hours weighted accordingly then downloading only the date-hour files needed, but that does potentially introduce error if the non-article traffic isn't well correlated to the article traffic, so I dunno. Probably a safe assumption that they are well correlated, but I'd rather not guess. Maybe talk-page traffic is highly correlated to increased vandalism, or decreased vandalism. It's possible, so I'd rather be safe. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l