Re: [Wiki-research-l] Estimate of vandal population

2013-10-01 Thread Dmitry Chichkov
I think a rough analysis user / IP talk pages could give you a number
pretty quickly. You probably would want to do it by hand first and then
write a script that analyses the wikipedia dump file. It is doable by hand,
if you just sub-sample a few hundred pages randomly. And if normalized by a
total number of user pages vs total number of users this would already give
a rough estimate.

Kind Regards,
Dmitry


On Tue, Oct 1, 2013 at 11:00 AM, Ziko van Dijk zvand...@gmail.com wrote:

 So Piotr, if I understand you well it is about the question how many of
 the people who are our contributors according to the statistics (per 5
 edits a month, or 100 edits a month) are actually vandals? I could imagine
 that some vandals manage to make 5 edits before being blocked, or lose
 interest before they are blocked, and appear in the statistics.
 Kind regards
 Ziko


 2013/9/29 Piotr Konieczny pio...@post.pl

  I know of the categories, but the problem is that they do not seem to
 be comprehensive. I can estimate, based on them, that there are at least
 150k or so editors who were banned for vandalism, but it seems many vandals
 do not make it into those categories, suggesting this number is
 underestimated.

 Still, we should be able to get some estimates. We know, for example,
 that  something like 5 or 6 million of accounts have made 1+ edit on
 English Wikipedia. How many of them were indefinitely blocked? This should
 give us some idea.

 Alternatively, we know how many accounts make an edit to Wikipedia every
 given timeframe. About 100,000-120,000 editors make at least one edit to
 Wikipedia each month. If we knew how many are indef blocked in that period,
 that would be another useful estimate.


 --
 Piotr Konieczny, 
 PhDhttp://hanyang.academia.edu/PiotrKoniecznyhttp://scholar.google.com/citations?user=gdV8_AEJhttp://en.wikipedia.org/wiki/User:Piotrus



 On 9/30/2013 11:44 AM, Stuart Yeates wrote:

 I guess it depends on whether Piotr is looking for an estimate of
 accounts used for vandalism or an estimate of the people who operate them.
 One seems straight forward, the other more challenging. Perhaps combining
 the categories below with sock puppet investigations and some fancy stats?

  Cheers
 Stuart

 On 29/09/2013, at 12:13 am, h hant...@gmail.com wrote:

   Hello Piotr,
I believe that in Chinese Wikipedia, blocked indefinitely is a user
 category called Wikipedians that are blocked indefinitely 被永久封禁的維基人
 http://zh.wikipedia.org/wiki/Category:%E8%A2%AB%E6%B0%B8%E4%B9%85%E5%B0%81%E7%A6%81%E7%9A%84%E7%B6%AD%E5%9F%BA%E4%BA%BA
Its equivalent Wikidata table has the following pages in other
 language versions:
 http://www.wikidata.org/wiki/Q4616402#sitelinks-wikipedia
  Language Code Linked article
English enwiki Category:Blocked historical 
 usershttp://en.wikipedia.org/wiki/Category:Blocked_historical_users
   italiano itwiki Categoria:Wikipedia:Cloni 
 sospettihttp://it.wikipedia.org/wiki/Categoria:Wikipedia:Cloni_sospetti
   latviešu lvwiki Kategorija:Uz nenoteiktu laiku nobloķētie 
 lietotājihttp://lv.wikipedia.org/wiki/Kategorija:Uz_nenoteiktu_laiku_noblo%C4%B7%C4%93tie_lietot%C4%81ji
   slovenčina skwiki Kategória:Wikipédia:Natrvalo zablokovaní 
 používateliahttp://sk.wikipedia.org/wiki/Kateg%C3%B3ria:Wikip%C3%A9dia:Natrvalo_zablokovan%C3%AD_pou%C5%BE%C3%ADvatelia
   česky cswiki Kategorie:Wikipedie:Natrvalo zablokovaní 
 uživateléhttp://cs.wikipedia.org/wiki/Kategorie:Wikipedie:Natrvalo_zablokovan%C3%AD_u%C5%BEivatel%C3%A9
   български bgwiki Категория:Блокирани неприемливи потребителски 
 именаhttp://bg.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D1%8F:%D0%91%D0%BB%D0%BE%D0%BA%D0%B8%D1%80%D0%B0%D0%BD%D0%B8_%D0%BD%D0%B5%D0%BF%D1%80%D0%B8%D0%B5%D0%BC%D0%BB%D0%B8%D0%B2%D0%B8_%D0%BF%D0%BE%D1%82%D1%80%D0%B5%D0%B1%D0%B8%D1%82%D0%B5%D0%BB%D1%81%D0%BA%D0%B8_%D0%B8%D0%BC%D0%B5%D0%BD%D0%B0
   олык марий mhrwiki Категорий:Википедий:Йӧн 
 петырымеhttp://mhr.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D0%B8%D0%B9:%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D0%B9:%D0%99%D3%A7%D0%BD_%D0%BF%D0%B5%D1%82%D1%8B%D1%80%D1%8B%D0%BC%D0%B5
   українська ukwiki Категорія:Безстроково заблоковані 
 користувачіhttp://uk.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%B3%D0%BE%D1%80%D1%96%D1%8F:%D0%91%D0%B5%D0%B7%D1%81%D1%82%D1%80%D0%BE%D0%BA%D0%BE%D0%B2%D0%BE_%D0%B7%D0%B0%D0%B1%D0%BB%D0%BE%D0%BA%D0%BE%D0%B2%D0%B0%D0%BD%D1%96_%D0%BA%D0%BE%D1%80%D0%B8%D1%81%D1%82%D1%83%D0%B2%D0%B0%D1%87%D1%96
   中文 zhwiki 
 Category:被永久封禁的維基人http://zh.wikipedia.org/wiki/Category:%E8%A2%AB%E6%B0%B8%E4%B9%85%E5%B0%81%E7%A6%81%E7%9A%84%E7%B6%AD%E5%9F%BA%E4%BA%BA
   日本語 jawiki Category:無期限ブロックを受けたユー 
 ザーhttp://ja.wikipedia.org/wiki/Category:%E7%84%A1%E6%9C%9F%E9%99%90%E3%83%96%E3%83%AD%E3%83%83%E3%82%AF%E3%82%92%E5%8F%97%E3%81%91%E3%81%9F%E3%83%A6%E3%83%BC%E3%82%B6%E3%83%BC



 I hope that it helps.
 Best,
 han-teng liao



 2013/9/29 Piotr Konieczny pio...@post.pl

 Hi everyone,

Re: [Wiki-research-l] Revert detection

2011-08-22 Thread Dmitry Chichkov
Hi Aaron,

Neat LimitedQueue class. It looks like this reverts code wouldn't handle
some corner cases,
for example I don't see logic that would distinguish between blanking (which
produces duplicate checksums) and reverts.

-- Best, Dmitry

On Sun, Aug 21, 2011 at 3:15 PM, Aaron Halfaker aaron.halfa...@gmail.comwrote:

 I've updated my dump processing python project to include code for quickly
 detecting identity reverts from XML dumps.  See
 https://bitbucket.org/halfak/wikimedia-utilities for the project and the
 process() function at bottom of
 https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/processors/reverts.py
  for
 the algorithm.  The actual function with the revert detection logic is about
 50 lines long.

 The resulting dump.map function using this revert processor() will emit
 revert revisions and reverted revisions with the following fields:

 Revert revision:

- revert - denotes that this row is a reverting edit
- revision_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. Creating, Destroying and
Restoring Value in Wikipedia GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)


 Reverted revision:

- reverted - denotes that this row is a reverted edit
- revision_id - the rev_id of the reverted edit
- reverting_id - the rev_id if the reverting edit
- reverted_to_id - the rev_id of the reverted to edit
- for_vandalism - using D_LOOSE/D_STRICT regular expression on the
reverting comment (See Priedhorsky et al. Creating, Destroying and
Restoring Value in Wikipedia GROUP 2007)
- reverted_revs - number of revisions that were reverted (this is the
number of revisions between the reverting edit and reverted to edit)

 I hope this is helpful.

 -Aaron

 On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker 
 aaron.halfa...@gmail.comwrote:

 An identity revert is one which changes the article to an absolutely
 identical previous state.  This is a common operation in the English
 Wikipedia.

 There is a Kittur  Kraut (and others) paper which I can't recall that
 found the vast majority of reverts of any sort were identity.  Some other
 types the define are:

- Partial reverts: Part of an edit is discarded
- Effective reverts: Looks to be an identity revert, but not
*exactly* the same as a previous revision.  Often a few white-space
characters were out of place.

 See http://www.grouplens.org/node/427 for a discussion of the difficulty
 of detecting reverts in better ways.

 My code detects identity reverts.  For example suppose the following is
 the content of a sequence of revisions.


1. foo
2. bar
3. foobar
4. bar
5. barbar

 Revision #4 reverts back to revision #2 and revision #3 is reverted.  When
 looking for identity reverts, I have found that limiting the number of
 revisions that can be reverted to ~15 produces the highest quality of
 results.  This is discussed in http://www.grouplens.org/node/416 (see
 http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for
 quick/dirty summary of the work.).

 This subject deserves a long conversation, but I think the bit you might
 be interested in is that the identity revert (described above and example)
 seems to be the accepted approach for identifying reverts for most types of
 analyses.

 -Aaron

 On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian fabian.flo...@kit.eduwrote:

 Hi Aaron,

 thanks, that would be awesome :) we built something ourselves, but I'm
 not quite content with it.

 Could you also tell me how you defined a revert (and maybe how you
 determine who is the reverter)? Because this is a crucial issue for me.
 Is it the complete deletion of all the characters entered by an editor in
 an edit? What about editors that revert others or delete content? do you
 treat their edits as being reverted if the deleted content gets
 reintroduced? Did you take into account location of the words in the text or
 did you use a bag-of-words model?
 I read many papers and tool documentations that use reverts, and some
 mention their method (while many don't), while it seems almost no-one
 describes their definition of what a revert actually is.

 But maybe I will get the answers to this from your code as well :)

 Anyway, thanks for the help!

 Best,
 Fabian


 On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:

 Fabian,

 I actually have some software for quickly producing reverts from a
 database dump.  The framework for doing it is available here:
 https://bitbucket.org/halfak/wikimedia-utilities.  I still have to
 package up the code that actually generates the reverts though.  It's just a
 matter of finding time to sit down with it and figure out the dependencies!
  I 

Re: [Wiki-research-l] Revert detection

2011-08-18 Thread Dmitry Chichkov
There have been a few publication on the subject:
1. Us vs. them: Understanding social dynamics in Wikipedia with revert
graph visualizations, B Suh, EH Chi, BA Pendleton.
2. He says, she says: Conflict and coordination in Wikipedia., A Kittur, B
Suh, BA Pendleton.


From my experience I can tell that analyzing MD5s is not enough to identify
all reverts.
And there are some tricks even to these. Generally you need to have
knowledge about user reputations,
article content, comment content to identify true reverts.


There are several groups of reverts which can be loosely identified as:
 * regular reverts;
 * self-reverts;
 * revert wars;

You need to take care of these cases when identifying reverts.


Some cases can be tricky, for example:
 # Marking : between duplicates, by other users (reverted, questionable)
 # Revision 54 (regular edit)  User0Regular edit
 # Revision 55 (regular edit)  User1Regular edit
 # Revision 56 (revert to 54)  User2Vandalism
 # Revision 57 (vandalism)User2Vandalism
 # Revision 58 (revert to 56/54) User3Correcting vandalism, but not
quite
 # Revision 59 (revert to 55)  User4Revert to Revision 55

Note that User 2 had tried to hide his 'revert vandalism' with regular
vandalism,
this had misled User3, but was finally corrected by User4.

Blanking also creates duplicate MD5 signatures, you need to take care of
these.
And of course users do reverts manually (and in some cases not exactly).

If you familiar with Python, you may want to take a look at the following
code:
lookup line 444: def analyze_reverts(revisions) in the:
 http://code.google.com/p/pymwdat/source/browse/trunk/toolkit.py


-- Best, Dmitry



On Thu, Aug 18, 2011 at 2:40 AM, Flöck, Fabian fabian.flo...@kit.eduwrote:

 Hi,

 I'm trying to detect reverts in Wikipedia for my research, right now with a
 self-built script using MD5hashes and DIFFs between revisions. I always read
 about people taking reverts into account in their data, but it's seldomly
 described HOW exactly a revert is determined or what tool they use to do
 that. Can you point me to any research or tools or tell me maybe what you
 used in your own research to identify which edits were reverted and/or who
 reverted them?

 Best,

 Fabian




 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Skype: f.floeck_work
 E-Mail: fabian.flo...@kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT – University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] wikistream: displays wikipedia updates in realtime

2011-08-17 Thread Dmitry Chichkov
Just verified, it is back up. And actual changes are also coming through
[filtered by negative user ratings (calculated using some pretty old
wikipedia dump)].

-- Best, Dmitry

On Wed, Aug 17, 2011 at 2:33 AM, Dmitry Chichkov dchich...@gmail.comwrote:

 Hmm... Somebody actually visited the site. Interesting. I've been running
 it over a year and I haven't seen the thing used much.
 Looks like it was only some weird IP change, I've updated the DNS. So it
 should be back up pretty soon.

 Anyway the main point was to show some alternative implementation/ideas.
 And by the way source code is available here
 http://code.google.com/p/wrdese/ it's a very lightweight Django/JQuery
 project and can be tweaked fairly easily...

 -- Best, Dmitry


 On Wed, Aug 17, 2011 at 1:19 AM, Federico Leva (Nemo) 
 nemow...@gmail.comwrote:

 Ed Summers, 22/06/2011 12:14:
  On Wed, Jun 22, 2011 at 2:25 AM, Dmitry Chichkov  wrote:
  You may want to take a look at the wpcvn.com - it also displays
 realtime
  stream (filtered)...
 
  Oh wow, maybe I can shut mine off now :-)

 Looks like the opposite happened.

 Nemo

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Hello,

This is an excellent news!

Have you tried running it on Amazon EC2? It would be really nice to know how
well WikiHadoop scale up with the number of nodes.
Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on what
kind of task (xml parsing, diffs, md5, etc?) was it obtained?

-- Best, Dmitry

On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere dvanli...@gmail.comwrote:

 Hello!

 Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
 Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
 on a customized stream-based InputFormatReader that allows parsing of both
 bz2 compressed and uncompressed files of the full Wikipedia dump (dump
 file with the complete edit histories) using Hadoop. Prior to WikiHadoop
 and the accompanying InputFormatReader it was not possible to use Hadoop to
 analyze the full Wikipedia dump files (see the detailed tutorial /
 background for an explanation why that was not possible).

 This means:
 1) We can now harness Hadoop's distributed computing capabilities in
 analyzing the full dump files.
 2) You can send either one or two revisions to a single mapper so it's
 possible to diff two revisions and see what content has been addded /
 removed.
 3) You can exclude namespaces by supplying a regular expression.
 4) We are using Hadoop's Streaming interface which means people can use
 this InputFormat Reader using different languages such as Java, Python, Ruby
 and PHP.

 The source code is available at: https://github.com/whym/wikihadoop
 A more detailed tutorial and installation guide is available at:
 https://github.com/whym/wikihadoop/wiki


 (Apologies for cross-posting to wikitech-l and wiki-research-l)

 [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/


 Best,

 Diederik


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can
see how this can be very useful! Otherwise... well... It seems like Hadoop
gives you a lot of overhead, and it is just not practical to do parsing this
way.

With a straightforward implementation in Python, on a single Core2 Duo you
can parse the dump (7z), compute diffs, md5, etc and store everything into a
binary form in about 6-7 days.
For example an implementation here: http://code.google.com/p/pymwdat/  can
do exactly that. I imagine that with faster C++ code and with modern i7 box
it can be done within a day.
And after that this precomputed binary form (diffs+metadata+stats take about
several times of the .7z dump ~ 100Gb) can be serialized very efficiently
(just about an hour on a single box).

Saying that, I still think using Hadoop/EC2 could be really nice.
Particularly if the dump can be made available on the S3/EC2.

-- Best, Dmitry


On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere dvanli...@gmail.comwrote:

 So the 14 day task included xml parsing and creating diffs. We might gain
 performance improvements by fine-tuning the Hadoop configuration although
 that seems to be more of  an art than science.
 Diederik


  On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchich...@gmail.comwrote:

 Hello,

 This is an excellent news!

 Have you tried running it on Amazon EC2? It would be really nice to know
 how well WikiHadoop scale up with the number of nodes.
 Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on
 what kind of task (xml parsing, diffs, md5, etc?) was it obtained?

 -- Best, Dmitry

 On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere 
 dvanli...@gmail.comwrote:

 Hello!

 Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
 and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
 hard on a customized stream-based InputFormatReader that allows parsing of
 both bz2 compressed and uncompressed files of the full Wikipedia dump(dump 
 file with the complete edit histories)using Hadoop. Prior to WikiHadoop and 
 the accompanying InputFormatReader it
 was not possible to use Hadoop to analyze the full Wikipedia dump files
 (see the detailed tutorial / background for an explanation why that was not
 possible).

 This means:
 1) We can now harness Hadoop's distributed computing capabilities in
 analyzing the full dump files.
 2) You can send either one or two revisions to a single mapper so it's
 possible to diff two revisions and see what content has been addded /
 removed.
 3) You can exclude namespaces by supplying a regular expression.
 4) We are using Hadoop's Streaming interface which means people can use
 this InputFormat Reader using different languages such as Java, Python, Ruby
 and PHP.

 The source code is available at: https://github.com/whym/wikihadoop
 A more detailed tutorial and installation guide is available at:
 https://github.com/whym/wikihadoop/wiki


 (Apologies for cross-posting to wikitech-l and wiki-research-l)

 [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/


 Best,

 Diederik


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 a href=http://about.me/diederik;Check out my about.me profile!/a

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fraction of reverts

2011-08-15 Thread Dmitry Chichkov
I can recommend searching reverts wikipedia  on the google scholar:
http://scholar.google.com/scholar?q=reverts+wikipedia

If you want to try running some analysis on the dump yourself, there's
reverts analysis python code available here:
http://code.google.com/p/pymwdat/

-- Best, Dmitry


On Mon, Aug 15, 2011 at 6:18 PM, Tilman Bayer tba...@wikimedia.org wrote:

 I think Ed Chi's group at PARC did some the earliest studies about revert
 rates:


 http://asc-parc.blogspot.com/2009/08/part-2-more-details-of-changing-editor.html
 Monthly ratio of reverted edits by editor class

 http://asc-parc.blogspot.com/2009/07/part-1-slowing-growth-of-wikipedia-some.html
 http://www.parc.com/content/attachments/singularity-is-not-near.pdf

 On Tue, Aug 16, 2011 at 1:53 AM, Denny Vrandecic
 vrande...@googlemail.com wrote:
  Hello,
 
  does anyone have a rough estimate of how many edits get reverted?
  Does anyone have a study handy?
 
  Cheers,
  Denny
 
 
 
  ___
  Wiki-research-l mailing list
  Wiki-research-l@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
 



 --
 T. Bayer
 Movement Communications
 Wikimedia Foundation
 IRC (Freenode): HaeB

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Web 2.0 recent changes patrol tool demo (WPCVN)

2010-08-20 Thread Dmitry Chichkov
Yes, but as far as I understand, this API can not provide recent revisions
information in real time. :(
So it is not directly usable for the wpcvn rc patrol tool as it continuously
requires recent edits data.

It looks like they've published the code though. I'll try to find some tome
and integrate their ratings into the wpcvn.
Their approach of fine-grained text origins sounds pretty solid. And their
algorithm also had performed better than mine in the PAN 10 competition.
By the way, if  anybody from the Wikitrust team is present here - congrats!
And I've just skimmed over your paper - excellent work.

-- Cheers, Dmitry



On Fri, Aug 20, 2010 at 12:02 AM, Daniel Kinzler dan...@brightbyte.dewrote:

 Hi Dimitry:

 Dmitry Chichkov schrieb:
  Some time ago as a Python/Django/JQuery/pywikipedia exercise I've hacked
  a web based recent changes patrol tool. An alpha version can be seen at
  the: http://www.wpcvn.com
 
  It includes a few interesting features that may be useful to the
  community ( researchers designing similar tools):
  1. tool uses editors ratings, primarily based on user counters (includes
  reverted revisions counters) calculated using the wiki dump;

 Perhaps have a look at the WikiTrust API: 
 http://www.wikitrust.net/vandalism-api

  WPCVN aggregates recent changes IRC feed, IRC feed from the
  MiszaBot and WPCVN user actions.

 I'm currently prototyping an XMPP based RC feed, which has much more
 detailed
 info, and is more reliable, than the IRC feed:
 http://meta.wikimedia.org/wiki/Recentchanges_via_XMPP#Prototype

  It also uses pre-calculated Wikipedia
  users karma (based on the recent en-wiki dump analysis) to separate
  edits made by users with clearly good or bad reputation.

 No *this* definitly sounds like WikiTrust, though I'm not sure if they
 expose
 this info via the API.

 -- daniel

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

2010-08-19 Thread Dmitry Chichkov
Yes. It is fairly easy to produce the list limited to a time period, or any
other custom stats (e.g. 'reverted edits ratios' for anonymous users, etc).
It's just several hours of processing. But it is limited with the time frame
of the recent database dump. For the en-wiki it is 2010/01/30. Send your
complains to the xmldatadumps-l  (xmldatadump...@lists.wikimedia.org)  ;) .


By the way, I've posted (somewhat cleaned-up) python script that I've used
to calculate that list. It's available here:
 http://code.google.com/p/pymwdat/

For en-wiki dump requires:
* 31 Gb enwiki-20100130-pages-meta-history.xml.7z download;
* 250 Gb free disk space (for intermediate data  dump);
* ~week to pre-process the dump (modern desktop);
* ~3 hours to do a simple run (e.g calculate the list like I did).

Dump preprocessed is basically extracting/parsing .xml.7z, calculating MD5s
for page revisions, calculating page diffs and pickling the results
(alongside with other metadata) to disk. It uses a custom diff algorithm
optimized for the wikipedia (regular diff is a way too slow and doesn't
handle copy editing well).

It needs memory if one wants to calculate/hold stats for every editor/page
(4Gb minimal, 8Gb recommended, 24Gb+ preferred).
But obviously one can filter yourself a data subset or even work on a single
page.

Requires System/Libraries:
* Python 2.6+, Linux (I've never tried it on Windows);
* PyWikipedia/Trunk (
http://svn.wikimedia.org/svnroot/pywikipedia/trunk/pywikipedia/ )
* OrderedDict (available in Python 2.7 or
http://pypi.python.org/pypi/ordereddict/)
* 7-Zip (command line 7za)

-- Dmitry




On Thu, Aug 19, 2010 at 8:46 AM, John Vandenberg jay...@gmail.com wrote:

 On Sat, Aug 14, 2010 at 6:12 AM, Dmitry Chichkov dchich...@gmail.com
 wrote:
  If anybody is interested, I've made a list of 'most reverted pages' in
 the
  english wikipedia based on the analysis of the enwiki-20100130 dump. Here
 is
  the list:
  http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
  http://wpcvn.com/enwiki-20100130.most.reverted.txt

 Lovely!

 This could be used to add semi-protection or pending-changes to reduce
 the amount of unnecessary work.

 Is it easy to limit this to reverts within a period, such as the last 12
 months?

 It would also be useful to filter out irregular edit-wars, or pages
 which were subject to frequent reverts, but have become stable.

 --
 John Vandenberg

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

2010-08-13 Thread Dmitry Chichkov
If anybody is interested, I've made a list of 'most reverted pages' in the
english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios  0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision
with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert
wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l