Re: [Wiki-research-l] Estimate of vandal population
I think a rough analysis user / IP talk pages could give you a number pretty quickly. You probably would want to do it by hand first and then write a script that analyses the wikipedia dump file. It is doable by hand, if you just sub-sample a few hundred pages randomly. And if normalized by a total number of user pages vs total number of users this would already give a rough estimate. Kind Regards, Dmitry On Tue, Oct 1, 2013 at 11:00 AM, Ziko van Dijk zvand...@gmail.com wrote: So Piotr, if I understand you well it is about the question how many of the people who are our contributors according to the statistics (per 5 edits a month, or 100 edits a month) are actually vandals? I could imagine that some vandals manage to make 5 edits before being blocked, or lose interest before they are blocked, and appear in the statistics. Kind regards Ziko 2013/9/29 Piotr Konieczny pio...@post.pl I know of the categories, but the problem is that they do not seem to be comprehensive. I can estimate, based on them, that there are at least 150k or so editors who were banned for vandalism, but it seems many vandals do not make it into those categories, suggesting this number is underestimated. Still, we should be able to get some estimates. We know, for example, that something like 5 or 6 million of accounts have made 1+ edit on English Wikipedia. How many of them were indefinitely blocked? This should give us some idea. Alternatively, we know how many accounts make an edit to Wikipedia every given timeframe. About 100,000-120,000 editors make at least one edit to Wikipedia each month. If we knew how many are indef blocked in that period, that would be another useful estimate. -- Piotr Konieczny, PhDhttp://hanyang.academia.edu/PiotrKoniecznyhttp://scholar.google.com/citations?user=gdV8_AEJhttp://en.wikipedia.org/wiki/User:Piotrus On 9/30/2013 11:44 AM, Stuart Yeates wrote: I guess it depends on whether Piotr is looking for an estimate of accounts used for vandalism or an estimate of the people who operate them. One seems straight forward, the other more challenging. Perhaps combining the categories below with sock puppet investigations and some fancy stats? Cheers Stuart On 29/09/2013, at 12:13 am, h hant...@gmail.com wrote: Hello Piotr, I believe that in Chinese Wikipedia, blocked indefinitely is a user category called Wikipedians that are blocked indefinitely 被永久封禁的維基人 http://zh.wikipedia.org/wiki/Category:%E8%A2%AB%E6%B0%B8%E4%B9%85%E5%B0%81%E7%A6%81%E7%9A%84%E7%B6%AD%E5%9F%BA%E4%BA%BA Its equivalent Wikidata table has the following pages in other language versions: http://www.wikidata.org/wiki/Q4616402#sitelinks-wikipedia Language Code Linked article English enwiki Category:Blocked historical usershttp://en.wikipedia.org/wiki/Category:Blocked_historical_users italiano itwiki Categoria:Wikipedia:Cloni sospettihttp://it.wikipedia.org/wiki/Categoria:Wikipedia:Cloni_sospetti latviešu lvwiki Kategorija:Uz nenoteiktu laiku nobloķētie lietotājihttp://lv.wikipedia.org/wiki/Kategorija:Uz_nenoteiktu_laiku_noblo%C4%B7%C4%93tie_lietot%C4%81ji slovenčina skwiki Kategória:Wikipédia:Natrvalo zablokovaní používateliahttp://sk.wikipedia.org/wiki/Kateg%C3%B3ria:Wikip%C3%A9dia:Natrvalo_zablokovan%C3%AD_pou%C5%BE%C3%ADvatelia česky cswiki Kategorie:Wikipedie:Natrvalo zablokovaní uživateléhttp://cs.wikipedia.org/wiki/Kategorie:Wikipedie:Natrvalo_zablokovan%C3%AD_u%C5%BEivatel%C3%A9 български bgwiki Категория:Блокирани неприемливи потребителски именаhttp://bg.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%91%D0%BB%D0%BE%D0%BA%D0%B8%D1%80%D0%B0%D0%BD%D0%B8_%D0%BD%D0%B5%D0%BF%D1%80%D0%B8%D0%B5%D0%BC%D0%BB%D0%B8%D0%B2%D0%B8_%D0%BF%D0%BE%D1%82%D1%80%D0%B5%D0%B1%D0%B8%D1%82%D0%B5%D0%BB%D1%81%D0%BA%D0%B8_%D0%B8%D0%BC%D0%B5%D0%BD%D0%B0 олык марий mhrwiki Категорий:Википедий:Йӧн петырымеhttp://mhr.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D0%B9:%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D0%B9:%D0%99%D3%A7%D0%BD_%D0%BF%D0%B5%D1%82%D1%8B%D1%80%D1%8B%D0%BC%D0%B5 українська ukwiki Категорія:Безстроково заблоковані користувачіhttp://uk.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D1%96%D1%8F:%D0%91%D0%B5%D0%B7%D1%81%D1%82%D1%80%D0%BE%D0%BA%D0%BE%D0%B2%D0%BE_%D0%B7%D0%B0%D0%B1%D0%BB%D0%BE%D0%BA%D0%BE%D0%B2%D0%B0%D0%BD%D1%96_%D0%BA%D0%BE%D1%80%D0%B8%D1%81%D1%82%D1%83%D0%B2%D0%B0%D1%87%D1%96 中文 zhwiki Category:被永久封禁的維基人http://zh.wikipedia.org/wiki/Category:%E8%A2%AB%E6%B0%B8%E4%B9%85%E5%B0%81%E7%A6%81%E7%9A%84%E7%B6%AD%E5%9F%BA%E4%BA%BA 日本語 jawiki Category:無期限ブロックを受けたユー ザーhttp://ja.wikipedia.org/wiki/Category:%E7%84%A1%E6%9C%9F%E9%99%90%E3%83%96%E3%83%AD%E3%83%83%E3%82%AF%E3%82%92%E5%8F%97%E3%81%91%E3%81%9F%E3%83%A6%E3%83%BC%E3%82%B6%E3%83%BC I hope that it helps. Best, han-teng liao 2013/9/29 Piotr Konieczny pio...@post.pl Hi everyone,
Re: [Wiki-research-l] Revert detection
Hi Aaron, Neat LimitedQueue class. It looks like this reverts code wouldn't handle some corner cases, for example I don't see logic that would distinguish between blanking (which produces duplicate checksums) and reverts. -- Best, Dmitry On Sun, Aug 21, 2011 at 3:15 PM, Aaron Halfaker aaron.halfa...@gmail.comwrote: I've updated my dump processing python project to include code for quickly detecting identity reverts from XML dumps. See https://bitbucket.org/halfak/wikimedia-utilities for the project and the process() function at bottom of https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/processors/reverts.py for the algorithm. The actual function with the revert detection logic is about 50 lines long. The resulting dump.map function using this revert processor() will emit revert revisions and reverted revisions with the following fields: Revert revision: - revert - denotes that this row is a reverting edit - revision_id - the rev_id if the reverting edit - reverted_to_id - the rev_id of the reverted to edit - for_vandalism - using D_LOOSE/D_STRICT regular expression on the reverting comment (See Priedhorsky et al. Creating, Destroying and Restoring Value in Wikipedia GROUP 2007) - reverted_revs - number of revisions that were reverted (this is the number of revisions between the reverting edit and reverted to edit) Reverted revision: - reverted - denotes that this row is a reverted edit - revision_id - the rev_id of the reverted edit - reverting_id - the rev_id if the reverting edit - reverted_to_id - the rev_id of the reverted to edit - for_vandalism - using D_LOOSE/D_STRICT regular expression on the reverting comment (See Priedhorsky et al. Creating, Destroying and Restoring Value in Wikipedia GROUP 2007) - reverted_revs - number of revisions that were reverted (this is the number of revisions between the reverting edit and reverted to edit) I hope this is helpful. -Aaron On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker aaron.halfa...@gmail.comwrote: An identity revert is one which changes the article to an absolutely identical previous state. This is a common operation in the English Wikipedia. There is a Kittur Kraut (and others) paper which I can't recall that found the vast majority of reverts of any sort were identity. Some other types the define are: - Partial reverts: Part of an edit is discarded - Effective reverts: Looks to be an identity revert, but not *exactly* the same as a previous revision. Often a few white-space characters were out of place. See http://www.grouplens.org/node/427 for a discussion of the difficulty of detecting reverts in better ways. My code detects identity reverts. For example suppose the following is the content of a sequence of revisions. 1. foo 2. bar 3. foobar 4. bar 5. barbar Revision #4 reverts back to revision #2 and revision #3 is reverted. When looking for identity reverts, I have found that limiting the number of revisions that can be reverted to ~15 produces the highest quality of results. This is discussed in http://www.grouplens.org/node/416 (see http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for quick/dirty summary of the work.). This subject deserves a long conversation, but I think the bit you might be interested in is that the identity revert (described above and example) seems to be the accepted approach for identifying reverts for most types of analyses. -Aaron On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian fabian.flo...@kit.eduwrote: Hi Aaron, thanks, that would be awesome :) we built something ourselves, but I'm not quite content with it. Could you also tell me how you defined a revert (and maybe how you determine who is the reverter)? Because this is a crucial issue for me. Is it the complete deletion of all the characters entered by an editor in an edit? What about editors that revert others or delete content? do you treat their edits as being reverted if the deleted content gets reintroduced? Did you take into account location of the words in the text or did you use a bag-of-words model? I read many papers and tool documentations that use reverts, and some mention their method (while many don't), while it seems almost no-one describes their definition of what a revert actually is. But maybe I will get the answers to this from your code as well :) Anyway, thanks for the help! Best, Fabian On 19 Aug 2011, at 18:31, Aaron Halfaker wrote: Fabian, I actually have some software for quickly producing reverts from a database dump. The framework for doing it is available here: https://bitbucket.org/halfak/wikimedia-utilities. I still have to package up the code that actually generates the reverts though. It's just a matter of finding time to sit down with it and figure out the dependencies! I
Re: [Wiki-research-l] Revert detection
There have been a few publication on the subject: 1. Us vs. them: Understanding social dynamics in Wikipedia with revert graph visualizations, B Suh, EH Chi, BA Pendleton. 2. He says, she says: Conflict and coordination in Wikipedia., A Kittur, B Suh, BA Pendleton. From my experience I can tell that analyzing MD5s is not enough to identify all reverts. And there are some tricks even to these. Generally you need to have knowledge about user reputations, article content, comment content to identify true reverts. There are several groups of reverts which can be loosely identified as: * regular reverts; * self-reverts; * revert wars; You need to take care of these cases when identifying reverts. Some cases can be tricky, for example: # Marking : between duplicates, by other users (reverted, questionable) # Revision 54 (regular edit) User0Regular edit # Revision 55 (regular edit) User1Regular edit # Revision 56 (revert to 54) User2Vandalism # Revision 57 (vandalism)User2Vandalism # Revision 58 (revert to 56/54) User3Correcting vandalism, but not quite # Revision 59 (revert to 55) User4Revert to Revision 55 Note that User 2 had tried to hide his 'revert vandalism' with regular vandalism, this had misled User3, but was finally corrected by User4. Blanking also creates duplicate MD5 signatures, you need to take care of these. And of course users do reverts manually (and in some cases not exactly). If you familiar with Python, you may want to take a look at the following code: lookup line 444: def analyze_reverts(revisions) in the: http://code.google.com/p/pymwdat/source/browse/trunk/toolkit.py -- Best, Dmitry On Thu, Aug 18, 2011 at 2:40 AM, Flöck, Fabian fabian.flo...@kit.eduwrote: Hi, I'm trying to detect reverts in Wikipedia for my research, right now with a self-built script using MD5hashes and DIFFs between revisions. I always read about people taking reverts into account in their data, but it's seldomly described HOW exactly a revert is determined or what tool they use to do that. Can you point me to any research or tools or tell me maybe what you used in your own research to identify which edits were reverted and/or who reverted them? Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Skype: f.floeck_work E-Mail: fabian.flo...@kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] wikistream: displays wikipedia updates in realtime
Just verified, it is back up. And actual changes are also coming through [filtered by negative user ratings (calculated using some pretty old wikipedia dump)]. -- Best, Dmitry On Wed, Aug 17, 2011 at 2:33 AM, Dmitry Chichkov dchich...@gmail.comwrote: Hmm... Somebody actually visited the site. Interesting. I've been running it over a year and I haven't seen the thing used much. Looks like it was only some weird IP change, I've updated the DNS. So it should be back up pretty soon. Anyway the main point was to show some alternative implementation/ideas. And by the way source code is available here http://code.google.com/p/wrdese/ it's a very lightweight Django/JQuery project and can be tweaked fairly easily... -- Best, Dmitry On Wed, Aug 17, 2011 at 1:19 AM, Federico Leva (Nemo) nemow...@gmail.comwrote: Ed Summers, 22/06/2011 12:14: On Wed, Jun 22, 2011 at 2:25 AM, Dmitry Chichkov wrote: You may want to take a look at the wpcvn.com - it also displays realtime stream (filtered)... Oh wow, maybe I can shut mine off now :-) Looks like the opposite happened. Nemo ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files
Hello, This is an excellent news! Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on what kind of task (xml parsing, diffs, md5, etc?) was it obtained? -- Best, Dmitry On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanli...@gmail.comwrote: Hello! Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump (dump file with the complete edit histories) using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible). This means: 1) We can now harness Hadoop's distributed computing capabilities in analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP. The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki (Apologies for cross-posting to wikitech-l and wiki-research-l) [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ Best, Diederik ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files
Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can see how this can be very useful! Otherwise... well... It seems like Hadoop gives you a lot of overhead, and it is just not practical to do parsing this way. With a straightforward implementation in Python, on a single Core2 Duo you can parse the dump (7z), compute diffs, md5, etc and store everything into a binary form in about 6-7 days. For example an implementation here: http://code.google.com/p/pymwdat/ can do exactly that. I imagine that with faster C++ code and with modern i7 box it can be done within a day. And after that this precomputed binary form (diffs+metadata+stats take about several times of the .7z dump ~ 100Gb) can be serialized very efficiently (just about an hour on a single box). Saying that, I still think using Hadoop/EC2 could be really nice. Particularly if the dump can be made available on the S3/EC2. -- Best, Dmitry On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere dvanli...@gmail.comwrote: So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchich...@gmail.comwrote: Hello, This is an excellent news! Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on what kind of task (xml parsing, diffs, md5, etc?) was it obtained? -- Best, Dmitry On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanli...@gmail.comwrote: Hello! Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump(dump file with the complete edit histories)using Hadoop. Prior to WikiHadoop and the accompanying InputFormatReader it was not possible to use Hadoop to analyze the full Wikipedia dump files (see the detailed tutorial / background for an explanation why that was not possible). This means: 1) We can now harness Hadoop's distributed computing capabilities in analyzing the full dump files. 2) You can send either one or two revisions to a single mapper so it's possible to diff two revisions and see what content has been addded / removed. 3) You can exclude namespaces by supplying a regular expression. 4) We are using Hadoop's Streaming interface which means people can use this InputFormat Reader using different languages such as Java, Python, Ruby and PHP. The source code is available at: https://github.com/whym/wikihadoop A more detailed tutorial and installation guide is available at: https://github.com/whym/wikihadoop/wiki (Apologies for cross-posting to wikitech-l and wiki-research-l) [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ Best, Diederik ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- a href=http://about.me/diederik;Check out my about.me profile!/a ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Fraction of reverts
I can recommend searching reverts wikipedia on the google scholar: http://scholar.google.com/scholar?q=reverts+wikipedia If you want to try running some analysis on the dump yourself, there's reverts analysis python code available here: http://code.google.com/p/pymwdat/ -- Best, Dmitry On Mon, Aug 15, 2011 at 6:18 PM, Tilman Bayer tba...@wikimedia.org wrote: I think Ed Chi's group at PARC did some the earliest studies about revert rates: http://asc-parc.blogspot.com/2009/08/part-2-more-details-of-changing-editor.html Monthly ratio of reverted edits by editor class http://asc-parc.blogspot.com/2009/07/part-1-slowing-growth-of-wikipedia-some.html http://www.parc.com/content/attachments/singularity-is-not-near.pdf On Tue, Aug 16, 2011 at 1:53 AM, Denny Vrandecic vrande...@googlemail.com wrote: Hello, does anyone have a rough estimate of how many edits get reverted? Does anyone have a study handy? Cheers, Denny ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- T. Bayer Movement Communications Wikimedia Foundation IRC (Freenode): HaeB ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Web 2.0 recent changes patrol tool demo (WPCVN)
Yes, but as far as I understand, this API can not provide recent revisions information in real time. :( So it is not directly usable for the wpcvn rc patrol tool as it continuously requires recent edits data. It looks like they've published the code though. I'll try to find some tome and integrate their ratings into the wpcvn. Their approach of fine-grained text origins sounds pretty solid. And their algorithm also had performed better than mine in the PAN 10 competition. By the way, if anybody from the Wikitrust team is present here - congrats! And I've just skimmed over your paper - excellent work. -- Cheers, Dmitry On Fri, Aug 20, 2010 at 12:02 AM, Daniel Kinzler dan...@brightbyte.dewrote: Hi Dimitry: Dmitry Chichkov schrieb: Some time ago as a Python/Django/JQuery/pywikipedia exercise I've hacked a web based recent changes patrol tool. An alpha version can be seen at the: http://www.wpcvn.com It includes a few interesting features that may be useful to the community ( researchers designing similar tools): 1. tool uses editors ratings, primarily based on user counters (includes reverted revisions counters) calculated using the wiki dump; Perhaps have a look at the WikiTrust API: http://www.wikitrust.net/vandalism-api WPCVN aggregates recent changes IRC feed, IRC feed from the MiszaBot and WPCVN user actions. I'm currently prototyping an XMPP based RC feed, which has much more detailed info, and is more reliable, than the IRC feed: http://meta.wikimedia.org/wiki/Recentchanges_via_XMPP#Prototype It also uses pre-calculated Wikipedia users karma (based on the recent en-wiki dump analysis) to separate edits made by users with clearly good or bad reputation. No *this* definitly sounds like WikiTrust, though I'm not sure if they expose this info via the API. -- daniel ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)
Yes. It is fairly easy to produce the list limited to a time period, or any other custom stats (e.g. 'reverted edits ratios' for anonymous users, etc). It's just several hours of processing. But it is limited with the time frame of the recent database dump. For the en-wiki it is 2010/01/30. Send your complains to the xmldatadumps-l (xmldatadump...@lists.wikimedia.org) ;) . By the way, I've posted (somewhat cleaned-up) python script that I've used to calculate that list. It's available here: http://code.google.com/p/pymwdat/ For en-wiki dump requires: * 31 Gb enwiki-20100130-pages-meta-history.xml.7z download; * 250 Gb free disk space (for intermediate data dump); * ~week to pre-process the dump (modern desktop); * ~3 hours to do a simple run (e.g calculate the list like I did). Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s for page revisions, calculating page diffs and pickling the results (alongside with other metadata) to disk. It uses a custom diff algorithm optimized for the wikipedia (regular diff is a way too slow and doesn't handle copy editing well). It needs memory if one wants to calculate/hold stats for every editor/page (4Gb minimal, 8Gb recommended, 24Gb+ preferred). But obviously one can filter yourself a data subset or even work on a single page. Requires System/Libraries: * Python 2.6+, Linux (I've never tried it on Windows); * PyWikipedia/Trunk ( http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ ) * OrderedDict (available in Python 2.7 or http://pypi.python.org/pypi/ordereddict/) * 7-Zip (command line 7za) -- Dmitry On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg jay...@gmail.com wrote: On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov dchich...@gmail.com wrote: If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt Lovely! This could be used to add semi-protection or pending-changes to reduce the amount of unnecessary work. Is it easy to limit this to reverts within a period, such as the last 12 months? It would also be useful to filter out irregular edit-wars, or pages which were subject to frequent reverts, but have become stable. -- John Vandenberg ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt This list was calculated using the following sampling criteria: * All pages from the enwiki-20100130 dump; ** Filtered pages with more than 1000 revisions; ** Filtered pages with revert ratios 0.3; * Sorted in descending revert ratios. Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum; BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL). -- Regards, Dmitry ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l