Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

2010-08-13 Thread Dmitry Chichkov
Yes, working with large data sets is fun. There are always surprises. For
example top of the 'words, most reverted by trusted users' are not the
expected infantile type of things either. I haven't done the analysis on the
full dump yet, but on the subset from the full histories of articles from
the PAN 10 LAB test set following words came up top (sorted by chi-square,
note that this is very preliminary and tokenization/regularization might
have been wrong):

token, chi-sq, regular-diff-tok-cnt, revert-diff-tok-cnt
Image:Example.jpg|Ca 87959701.9043 113 7568
[[Media:Example.ogg] 5549492.56549 62 2196
title]][http://www.e 606182.771025 0 363
 305908.640902 0 365
[http://youtube.com/ 253267.5237 189 407
pooo 214921.014803 0 375
you 154597.596739 18655 102007
 129822.419517 1 238
value="transparent"> 129575.702482 1 168
 126503.143626 23 166
language|Macedonian] 123467.452157 121 164
 119613.359035 0 280
 118479.373501 5 686
 114581.582068 2 158
 110263.074451 0 155
i 109590.406785 2620 55971

-- Cheers, Dmitry




On Fri, Aug 13, 2010 at 4:06 PM, Luca de Alfaro  wrote:

> Thanks, this is great fun!  As an Italian, let me quote:
>
> (0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun')
> (0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey')
> (0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas')
> (0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian
> Renaissance')
> (0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon')
> (0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy')
> (0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')
>
> It is absolutely great to see that Italian Renaissance (with Incas) is one
> of the few cultural topics that makes it as high in the list as the usual
> excrement-sex-infantile type of things!!
>
> Luca
>
> On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov wrote:
>
>> If anybody is interested, I've made a list of 'most reverted pages' in the
>> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
>> the list:
>> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
>> http://wpcvn.com/enwiki-20100130.most.reverted.txt
>>
>> This list was calculated using the following sampling criteria:
>> * All pages from the enwiki-20100130 dump;
>> ** Filtered pages with more than 1000 revisions;
>> ** Filtered pages with revert ratios > 0.3;
>> * Sorted in descending revert ratios.
>>
>> Page revision is considered to be a revert if there is a previous revision
>> with a matching MD5 checksum;
>> BTW, if anybody needs it, the python code that identifies reverts, revert
>> wars, self-reverts, etc is available (LGPL).
>>
>> -- Regards, Dmitry
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

2010-08-13 Thread Luca de Alfaro
Thanks, this is great fun!  As an Italian, let me quote:

(0.42525520906166969, (7151, 3041, 59, 514, 63, 2519, 955), 'Penis')
(0.42516069788797062, (1089, 463, 29, 27, 16, 470, 84), 'Inner core')
(0.42490272373540855, (1285, 546, 11, 64, 27, 515, 122), 'Stuff')
(0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun')
(0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey')
(0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas')
(0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian
Renaissance')
(0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon')
(0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy')
(0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap')

It is absolutely great to see that Italian Renaissance (with Incas) is one
of the few cultural topics that makes it as high in the list as the usual
excrement-sex-infantile type of things!!

Luca

On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov wrote:

> If anybody is interested, I've made a list of 'most reverted pages' in the
> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
> the list:
> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
> http://wpcvn.com/enwiki-20100130.most.reverted.txt
>
> This list was calculated using the following sampling criteria:
> * All pages from the enwiki-20100130 dump;
> ** Filtered pages with more than 1000 revisions;
> ** Filtered pages with revert ratios > 0.3;
> * Sorted in descending revert ratios.
>
> Page revision is considered to be a revert if there is a previous revision
> with a matching MD5 checksum;
> BTW, if anybody needs it, the python code that identifies reverts, revert
> wars, self-reverts, etc is available (LGPL).
>
> -- Regards, Dmitry
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)

2010-08-13 Thread Dmitry Chichkov
If anybody is interested, I've made a list of 'most reverted pages' in the
english wikipedia based on the analysis of the enwiki-20100130 dump. Here is
the list:
http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz
http://wpcvn.com/enwiki-20100130.most.reverted.txt

This list was calculated using the following sampling criteria:
* All pages from the enwiki-20100130 dump;
** Filtered pages with more than 1000 revisions;
** Filtered pages with revert ratios > 0.3;
* Sorted in descending revert ratios.

Page revision is considered to be a revert if there is a previous revision
with a matching MD5 checksum;
BTW, if anybody needs it, the python code that identifies reverts, revert
wars, self-reverts, etc is available (LGPL).

-- Regards, Dmitry
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] APIs for revision quality

2010-08-13 Thread Luca de Alfaro
Dear All,

we have put together APIs that make available for each revision:

   - A 'quality score' for the revision, which is 1 minus our estimated
   probability that the revision is vandalism.  That is, this notion of quality
   only captures "absence of vandalism".
   - A bunch of raw scores, which include the reputation of the author,
   information on how much the text has been revised by a mix of different,
   trusted authors, size of the change from the previous revision to this one,
   and more.

Details of the API calls are at http://www.wikitrust.net/vandalism-api .
The data should be available for all revisions.  Occasionally, we might lack
information for a recent revision.
Ideally, if a revision is missing, we would put it in a queue of revisions
to be processed; I am not sure whether this mechanism is in place already
though; but if not, we will provide it soon.

I hope this is of interest!  We are looking to add more APIs, but this is a
start.

Luca
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] New research project

2010-08-13 Thread Steven Walling
Thanks for the feedback and resources all. Links to relevant research and
suggestions for editing the draft taxonomy have both been extremely helpful.

To answer emijrp:

While we are interested in getting a better picture of who the most active
Wikimedians are using these metrics, we don't see the results being
something like say, the current lists of Wikimedians by number of edits.

At this point we haven't considered there to be a need to rank or weigh
different activities to different degrees. We simply want to make sure that
we can get a more comprehensive map of the roles volunteers can take.

Steven Walling

On Thu, Aug 12, 2010 at 4:18 AM, emijrp  wrote:

> Hi;
>
> It is an interesting topic, but also, I think that a difficult one. It is
> said that editcount is not the real value of an user. It is correct, but,
> how do you compare the value of different actions like editing a page,
> fixing a typo, adding a paragraph, converting a png diagram to svg, taking a
> pic, developing a bot, ect? There is no a comparison table.
>
> I can help with the tech side of this task, if help is needed.
>
> Regards,
> emijrp
>
> 2010/8/12 Steven Walling 
>
>> Hi all,
>>
>> I'm Steven Walling, a longtime Wikimedia volunteer. Damian Finol (also a
>> longtime volunteer) and I are working on the beginnings of a new research
>> project in cooperation with the Foundation's Community Department.
>>
>> Everyone knows that editors are publicly listed based on edit count, and
>> some other details are visible related to the type of contributions an
>> individual user makes.
>>
>> The goal of this project is to try and highlight highly active volunteers
>> who may not participate in tasks that produce a high edit count. By creating
>> a detailed taxonomy of sorts for all the different roles users can take in a
>> project, we hope to get a better picture of who the most active contributors
>> are and what they are doing.
>>
>> If anyone has done similar roles-based investigations into volunteer
>> participation or has any suggestions at all, please feel free to contact us
>> at feedb...@wikimedia.org.ve or via the project's page on Meta at
>> http://meta.wikimedia.org/wiki/Contribution_Taxonomy_Project
>>
>> Thanks,
>>
>> Steven Walling
>> http://enwp.org/User:Steven_Walling
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] WMF Staff Introductions.

2010-08-13 Thread Giovanni Luca Ciampaglia

Hello everybody,
I usually lurk the list so I hope you won't mind the bit of 
self-promotion here but ... you might be interested in this poster paper 
I presented at ICWSM 2010:

http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1517

We were looking at the distribution of user activity periods in 
Wikipedia and we had the same problem of defining when a user stops 
contributing for good. We also used a hard threshold, which means that 
the resulting sample is truncated, and you have to take this into 
account when fitting the data to a distribution. We found that a mixture 
of two log-normals describes the data very well. This means that the are 
two characteristic time scales that describe user participation: short- 
and long-time users. Short-time users for example stay on average 30 
minutes before stopping contributing.


Another paper (by Yang et al.) at ICWSM this year performed a survival 
analysis similar to what Felipe's talking about, but on data from Q&A 
communities:

http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1466

Best,

G

Il 10.08.10 23.49, Luca de Alfaro ha scritto:
It's difficult to tell when a person leaves, because ... you never 
know if a contribution they made is the last one.
A measure would be "how many users have done an edit in the last 
month", and this is actually an incredibly simple DB query to run (how 
fast it runs, is another question).  This can tell you how the number 
of users is evolving.  Hey, I could run this on my wikitrust database, 
if the Foundation does not wish to do this :-)
Another measure, which is slightly harder but not much to compute, is 
the average time a user who does an edit has been registered.
Together these two figures (across time) could give you a pretty good 
picture of what is going on.


Luca


On Tue, Aug 10, 2010 at 2:38 PM, Felipe Ortega 
mailto:glimmer_phoe...@yahoo.es>> wrote:


Hello guys.

First of all, kudos for this initiative! It's great that all
researchers in this list can get to know the names and interests
of WMF staff working on same topics.

Additional context for Piotr. Believe me, it's really challenging
to define a set of clear, and *exact* conditions to consider that
any wikipedian ceased to contribute.

For our analysis published by WSJ last November, we followed
similar requirements to those in the Former Contributors Survey.
In particular, we established 3 months of inactivity as a
"reasonable" period to consider that an editor took a long break.
The main difference is that in the Survey they focus on editors
who reached at least a reasonable number of lifetime revisions
(20-99), while we included everyone.

I already broke down the net gain curve for different cohorts,
according to number of edits, and there is no significant change
in the trends (I believe that the meaningful info is the slope,
not the numbers).

For what is worth, I think the best constructive critic we
received about this approach came from Jimmy Wales. Jimmy
explained a useful twist to the methodology, that they seem to be
applying for internal metrics at Wikia.

Instead of trying to measure how many people "left", which will
always have methodological drawbacks, we can ask the following
question: what percentage of editors survived up to a certain age?

For instance: what's the percentage of editors who made at least
20 lifetime edits who are still active one month later? Three
months later? And then: is that percentage improving, constant, or
getting worse over time?

Indeed, limiting the scope to recorded revisions (the only event
we can certainly measure) we avoid many of these methodological
problems.

I'm still spending time with flagged-revisions, but in case Howie
or anybody else is interested, it shouldn't be difficult to have a
look at this.

BTW, Howie thanks for uploading the survey slides. Terrific the
work you did, guys.

Cheers,
Felipe.

--- El mar, 10/8/10, Piotr Konieczny mailto:pio...@post.pl>> escribió:

> De: Piotr Konieczny mailto:pio...@post.pl>>
> Asunto: Re: [Wiki-research-l] WMF Staff Introductions.
> Para: "Research into Wikimedia content and communities"
mailto:wiki-research-l@lists.wikimedia.org>>
> Fecha: martes, 10 de agosto, 2010 20:21
> Welcome!
>
> I have to say that
>
http://strategy.wikimedia.org/wiki/Former_Contributors_Survey_Results
> of
> which I've just learned from your post is an excellent
> piece of
> research, one that was needed for a very long time.
>
> One question comes to mind: we now, roughly, how many
> editors we are
> gaining per months. Are there any estimates on how many we
> are losing
> (per month, year, total)? I cannot find such numbers in
> that survey.
>
> --
> Piotr Konieczny
>
> Parul Vora wrote:

[Wiki-research-l] Wikipedia Research Conference: CPOV in Leipzig

2010-08-13 Thread Niesyto, Johanna


[[Wikipedia:Ein kritischer Standpunkt]]
September 25-26, 2010
University Library Leipzig, Germany

On 25th and 26th of September 2010 the German speaking conference 
[[Wikipedia:Ein kritischer Standpunkt]] ([[Wikipedia:Critical Point of View]]) 
will take place at the University Library in Leipzig, Germany. The conference 
will gather Wikipedia researchers, critics as well as community-members from 
the German-speaking world for an interdisciplinary debate. In particular the 
significance of Wikipedia for education, politics, culture and society will be 
discussed.

Wikipedia is one of the largest, if not the largest, self-contained general 
knowledge reference of our time. It offers critical insights into the 
contemporary status of knowledge, its organizing principles, function, impact, 
production styles, mechanisms for conflict resolution, and relation to power 
(re-)constitution. New strategic and tactical operations of knowledge and power 
are clearly at work through Wikipedia. Of specific interest is the concept of 
'the open', which is ambiguous within the social formation(s) constituted by 
Wikipedia, serving as both a rallying concept of digital democracy enthusiasts 
and as an ideoglical nodal point masking new agonistic encounters.

In both material and perceptional ways, every new technology modifies the 
conditions of possibility for knowledge. The logic of technologies bleeds into 
the very structures and organizing principles of knowledge, and today both 
medium and message may reflect the ideas of the (organized) network, multitude, 
or the Deleuzian machine. It is through a selected mix of technological and 
normative conditions – the distributed architecture of the net, the Wiki 
software platform, commons-based property licenses and the FLOSS zeitgeist – 
that Wikipedia as the encyclopedia of the information age emerges, both 
continuing and transforming the Enlightenment encyclopedic impulse or will to 
know.

The main topics of the conference are Wikipedia & The Politics of Open 
Knowledge, Digital Governance, and Wikipedia & Education. These topics derive 
from the significance of the online encyclopedia in the reconfiguration of 
knowledge (re-)production and its consequences for the public, architectures of 
participation, and political education in a media democracy. Alongside 
presentations of established scholars like Christian Stegbauer, Peter Haber, 
Rainer Hammwöhner, Ramón Reichert, and Ulrich Johannes Schneider, the programme 
of the conference will consist of a panel discussion of Wikipedia 
community-members and critics, as well as Wikipedia-workshops and a research 
network meeting.

The research network meeting addresses Wikipedia researchers to discuss their 
current research and draft new research projects. Especially aimed for young 
academics, the research network meeting is planned as open space, allowing its 
participants to actively engage in the event as questions and topics are shaped 
and discussed among the group. To participate, we ask for a registration by 
email not later than August 31, 2010 to i...@cpov.de. Please include a 
description of your research interest or abstract of your research on one page 
and tell us, if you are interested to make a short presentation.

The Leipzig conference continues the series of international conferences of the 
Wikipedia Research Initiative Critical Point of View from January and March 
2010 in Bangalore (India) and Amsterdam (Netherlands). It is hosted by cultiv – 
Gesellschaft für internationale Kulturprojekte e.V. in cooperation with the 
Research Initiative Critical Point of View and funded by the Bundeszentrale für 
politische Bildung.

The conference will be open to the public. There will be no participation fee. 
Conference language is German.

For further information please visit the conference website: www.cpov.de

Deadline for the Registration for the network meeting: August 31, 2010

Concept and Editorial board: Geert Lovink, Johanna Niesyto and Andreas 
Möllenkamp

Contact
cultiv
Gesellschaft für internationale Kulturprojekte e.V.
Bernhard-Göring-Str. 65
D-04107 Leipzig
Tel. +49-341-2228893
Email: i...@cpov.de
www.cpov.de


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l