Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)
Yes, working with large data sets is fun. There are always surprises. For example top of the 'words, most reverted by trusted users' are not the expected infantile type of things either. I haven't done the analysis on the full dump yet, but on the subset from the full histories of articles from the PAN 10 LAB test set following words came up top (sorted by chi-square, note that this is very preliminary and tokenization/regularization might have been wrong): token, chi-sq, regular-diff-tok-cnt, revert-diff-tok-cnt Image:Example.jpg|Ca 87959701.9043 113 7568 [[Media:Example.ogg] 5549492.56549 62 2196 title]][http://www.e 606182.771025 0 363 305908.640902 0 365 [http://youtube.com/ 253267.5237 189 407 pooo 214921.014803 0 375 you 154597.596739 18655 102007 129822.419517 1 238 value="transparent"> 129575.702482 1 168 126503.143626 23 166 language|Macedonian] 123467.452157 121 164 119613.359035 0 280 118479.373501 5 686 114581.582068 2 158 110263.074451 0 155 i 109590.406785 2620 55971 -- Cheers, Dmitry On Fri, Aug 13, 2010 at 4:06 PM, Luca de Alfaro wrote: > Thanks, this is great fun! As an Italian, let me quote: > > (0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun') > (0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey') > (0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas') > (0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian > Renaissance') > (0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon') > (0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy') > (0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap') > > It is absolutely great to see that Italian Renaissance (with Incas) is one > of the few cultural topics that makes it as high in the list as the usual > excrement-sex-infantile type of things!! > > Luca > > On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov wrote: > >> If anybody is interested, I've made a list of 'most reverted pages' in the >> english wikipedia based on the analysis of the enwiki-20100130 dump. Here is >> the list: >> http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz >> http://wpcvn.com/enwiki-20100130.most.reverted.txt >> >> This list was calculated using the following sampling criteria: >> * All pages from the enwiki-20100130 dump; >> ** Filtered pages with more than 1000 revisions; >> ** Filtered pages with revert ratios > 0.3; >> * Sorted in descending revert ratios. >> >> Page revision is considered to be a revert if there is a previous revision >> with a matching MD5 checksum; >> BTW, if anybody needs it, the python code that identifies reverts, revert >> wars, self-reverts, etc is available (LGPL). >> >> -- Regards, Dmitry >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)
Thanks, this is great fun! As an Italian, let me quote: (0.42525520906166969, (7151, 3041, 59, 514, 63, 2519, 955), 'Penis') (0.42516069788797062, (1089, 463, 29, 27, 16, 470, 84), 'Inner core') (0.42490272373540855, (1285, 546, 11, 64, 27, 515, 122), 'Stuff') (0.42477231329690346, (2745, 1166, 28, 110, 46, 1054, 341), 'Gun') (0.42474916387959866, (2990, 1270, 37, 149, 23, 1190, 321), 'Monkey') (0.42443438914027148, (1105, 469, 20, 21, 2, 427, 166), 'Incas') (0.42433090024330899, (2055, 872, 39, 45, 15, 825, 259), 'Italian Renaissance') (0.42375950742484608, (2761, 1170, 34, 94, 24, 978, 461), 'Watermelon') (0.42362613587191694, (2311, 979, 22, 121, 19, 937, 233), 'Puppy') (0.4235686492495831, (1799, 762, 20, 83, 34, 669, 231), 'Crap') It is absolutely great to see that Italian Renaissance (with Incas) is one of the few cultural topics that makes it as high in the list as the usual excrement-sex-infantile type of things!! Luca On Fri, Aug 13, 2010 at 1:12 PM, Dmitry Chichkov wrote: > If anybody is interested, I've made a list of 'most reverted pages' in the > english wikipedia based on the analysis of the enwiki-20100130 dump. Here is > the list: > http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz > http://wpcvn.com/enwiki-20100130.most.reverted.txt > > This list was calculated using the following sampling criteria: > * All pages from the enwiki-20100130 dump; > ** Filtered pages with more than 1000 revisions; > ** Filtered pages with revert ratios > 0.3; > * Sorted in descending revert ratios. > > Page revision is considered to be a revert if there is a previous revision > with a matching MD5 checksum; > BTW, if anybody needs it, the python code that identifies reverts, revert > wars, self-reverts, etc is available (LGPL). > > -- Regards, Dmitry > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] Most reverted pages in the en-wikipedia (enwiki-20100130 dump)
If anybody is interested, I've made a list of 'most reverted pages' in the english wikipedia based on the analysis of the enwiki-20100130 dump. Here is the list: http://wpcvn.com/enwiki-20100130.most.reverted.tar.bz http://wpcvn.com/enwiki-20100130.most.reverted.txt This list was calculated using the following sampling criteria: * All pages from the enwiki-20100130 dump; ** Filtered pages with more than 1000 revisions; ** Filtered pages with revert ratios > 0.3; * Sorted in descending revert ratios. Page revision is considered to be a revert if there is a previous revision with a matching MD5 checksum; BTW, if anybody needs it, the python code that identifies reverts, revert wars, self-reverts, etc is available (LGPL). -- Regards, Dmitry ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] APIs for revision quality
Dear All, we have put together APIs that make available for each revision: - A 'quality score' for the revision, which is 1 minus our estimated probability that the revision is vandalism. That is, this notion of quality only captures "absence of vandalism". - A bunch of raw scores, which include the reputation of the author, information on how much the text has been revised by a mix of different, trusted authors, size of the change from the previous revision to this one, and more. Details of the API calls are at http://www.wikitrust.net/vandalism-api . The data should be available for all revisions. Occasionally, we might lack information for a recent revision. Ideally, if a revision is missing, we would put it in a queue of revisions to be processed; I am not sure whether this mechanism is in place already though; but if not, we will provide it soon. I hope this is of interest! We are looking to add more APIs, but this is a start. Luca ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] New research project
Thanks for the feedback and resources all. Links to relevant research and suggestions for editing the draft taxonomy have both been extremely helpful. To answer emijrp: While we are interested in getting a better picture of who the most active Wikimedians are using these metrics, we don't see the results being something like say, the current lists of Wikimedians by number of edits. At this point we haven't considered there to be a need to rank or weigh different activities to different degrees. We simply want to make sure that we can get a more comprehensive map of the roles volunteers can take. Steven Walling On Thu, Aug 12, 2010 at 4:18 AM, emijrp wrote: > Hi; > > It is an interesting topic, but also, I think that a difficult one. It is > said that editcount is not the real value of an user. It is correct, but, > how do you compare the value of different actions like editing a page, > fixing a typo, adding a paragraph, converting a png diagram to svg, taking a > pic, developing a bot, ect? There is no a comparison table. > > I can help with the tech side of this task, if help is needed. > > Regards, > emijrp > > 2010/8/12 Steven Walling > >> Hi all, >> >> I'm Steven Walling, a longtime Wikimedia volunteer. Damian Finol (also a >> longtime volunteer) and I are working on the beginnings of a new research >> project in cooperation with the Foundation's Community Department. >> >> Everyone knows that editors are publicly listed based on edit count, and >> some other details are visible related to the type of contributions an >> individual user makes. >> >> The goal of this project is to try and highlight highly active volunteers >> who may not participate in tasks that produce a high edit count. By creating >> a detailed taxonomy of sorts for all the different roles users can take in a >> project, we hope to get a better picture of who the most active contributors >> are and what they are doing. >> >> If anyone has done similar roles-based investigations into volunteer >> participation or has any suggestions at all, please feel free to contact us >> at feedb...@wikimedia.org.ve or via the project's page on Meta at >> http://meta.wikimedia.org/wiki/Contribution_Taxonomy_Project >> >> Thanks, >> >> Steven Walling >> http://enwp.org/User:Steven_Walling >> >> ___ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > ___ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] WMF Staff Introductions.
Hello everybody, I usually lurk the list so I hope you won't mind the bit of self-promotion here but ... you might be interested in this poster paper I presented at ICWSM 2010: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1517 We were looking at the distribution of user activity periods in Wikipedia and we had the same problem of defining when a user stops contributing for good. We also used a hard threshold, which means that the resulting sample is truncated, and you have to take this into account when fitting the data to a distribution. We found that a mixture of two log-normals describes the data very well. This means that the are two characteristic time scales that describe user participation: short- and long-time users. Short-time users for example stay on average 30 minutes before stopping contributing. Another paper (by Yang et al.) at ICWSM this year performed a survival analysis similar to what Felipe's talking about, but on data from Q&A communities: http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1466 Best, G Il 10.08.10 23.49, Luca de Alfaro ha scritto: It's difficult to tell when a person leaves, because ... you never know if a contribution they made is the last one. A measure would be "how many users have done an edit in the last month", and this is actually an incredibly simple DB query to run (how fast it runs, is another question). This can tell you how the number of users is evolving. Hey, I could run this on my wikitrust database, if the Foundation does not wish to do this :-) Another measure, which is slightly harder but not much to compute, is the average time a user who does an edit has been registered. Together these two figures (across time) could give you a pretty good picture of what is going on. Luca On Tue, Aug 10, 2010 at 2:38 PM, Felipe Ortega mailto:glimmer_phoe...@yahoo.es>> wrote: Hello guys. First of all, kudos for this initiative! It's great that all researchers in this list can get to know the names and interests of WMF staff working on same topics. Additional context for Piotr. Believe me, it's really challenging to define a set of clear, and *exact* conditions to consider that any wikipedian ceased to contribute. For our analysis published by WSJ last November, we followed similar requirements to those in the Former Contributors Survey. In particular, we established 3 months of inactivity as a "reasonable" period to consider that an editor took a long break. The main difference is that in the Survey they focus on editors who reached at least a reasonable number of lifetime revisions (20-99), while we included everyone. I already broke down the net gain curve for different cohorts, according to number of edits, and there is no significant change in the trends (I believe that the meaningful info is the slope, not the numbers). For what is worth, I think the best constructive critic we received about this approach came from Jimmy Wales. Jimmy explained a useful twist to the methodology, that they seem to be applying for internal metrics at Wikia. Instead of trying to measure how many people "left", which will always have methodological drawbacks, we can ask the following question: what percentage of editors survived up to a certain age? For instance: what's the percentage of editors who made at least 20 lifetime edits who are still active one month later? Three months later? And then: is that percentage improving, constant, or getting worse over time? Indeed, limiting the scope to recorded revisions (the only event we can certainly measure) we avoid many of these methodological problems. I'm still spending time with flagged-revisions, but in case Howie or anybody else is interested, it shouldn't be difficult to have a look at this. BTW, Howie thanks for uploading the survey slides. Terrific the work you did, guys. Cheers, Felipe. --- El mar, 10/8/10, Piotr Konieczny mailto:pio...@post.pl>> escribió: > De: Piotr Konieczny mailto:pio...@post.pl>> > Asunto: Re: [Wiki-research-l] WMF Staff Introductions. > Para: "Research into Wikimedia content and communities" mailto:wiki-research-l@lists.wikimedia.org>> > Fecha: martes, 10 de agosto, 2010 20:21 > Welcome! > > I have to say that > http://strategy.wikimedia.org/wiki/Former_Contributors_Survey_Results > of > which I've just learned from your post is an excellent > piece of > research, one that was needed for a very long time. > > One question comes to mind: we now, roughly, how many > editors we are > gaining per months. Are there any estimates on how many we > are losing > (per month, year, total)? I cannot find such numbers in > that survey. > > -- > Piotr Konieczny > > Parul Vora wrote:
[Wiki-research-l] Wikipedia Research Conference: CPOV in Leipzig
[[Wikipedia:Ein kritischer Standpunkt]] September 25-26, 2010 University Library Leipzig, Germany On 25th and 26th of September 2010 the German speaking conference [[Wikipedia:Ein kritischer Standpunkt]] ([[Wikipedia:Critical Point of View]]) will take place at the University Library in Leipzig, Germany. The conference will gather Wikipedia researchers, critics as well as community-members from the German-speaking world for an interdisciplinary debate. In particular the significance of Wikipedia for education, politics, culture and society will be discussed. Wikipedia is one of the largest, if not the largest, self-contained general knowledge reference of our time. It offers critical insights into the contemporary status of knowledge, its organizing principles, function, impact, production styles, mechanisms for conflict resolution, and relation to power (re-)constitution. New strategic and tactical operations of knowledge and power are clearly at work through Wikipedia. Of specific interest is the concept of 'the open', which is ambiguous within the social formation(s) constituted by Wikipedia, serving as both a rallying concept of digital democracy enthusiasts and as an ideoglical nodal point masking new agonistic encounters. In both material and perceptional ways, every new technology modifies the conditions of possibility for knowledge. The logic of technologies bleeds into the very structures and organizing principles of knowledge, and today both medium and message may reflect the ideas of the (organized) network, multitude, or the Deleuzian machine. It is through a selected mix of technological and normative conditions – the distributed architecture of the net, the Wiki software platform, commons-based property licenses and the FLOSS zeitgeist – that Wikipedia as the encyclopedia of the information age emerges, both continuing and transforming the Enlightenment encyclopedic impulse or will to know. The main topics of the conference are Wikipedia & The Politics of Open Knowledge, Digital Governance, and Wikipedia & Education. These topics derive from the significance of the online encyclopedia in the reconfiguration of knowledge (re-)production and its consequences for the public, architectures of participation, and political education in a media democracy. Alongside presentations of established scholars like Christian Stegbauer, Peter Haber, Rainer Hammwöhner, Ramón Reichert, and Ulrich Johannes Schneider, the programme of the conference will consist of a panel discussion of Wikipedia community-members and critics, as well as Wikipedia-workshops and a research network meeting. The research network meeting addresses Wikipedia researchers to discuss their current research and draft new research projects. Especially aimed for young academics, the research network meeting is planned as open space, allowing its participants to actively engage in the event as questions and topics are shaped and discussed among the group. To participate, we ask for a registration by email not later than August 31, 2010 to i...@cpov.de. Please include a description of your research interest or abstract of your research on one page and tell us, if you are interested to make a short presentation. The Leipzig conference continues the series of international conferences of the Wikipedia Research Initiative Critical Point of View from January and March 2010 in Bangalore (India) and Amsterdam (Netherlands). It is hosted by cultiv – Gesellschaft für internationale Kulturprojekte e.V. in cooperation with the Research Initiative Critical Point of View and funded by the Bundeszentrale für politische Bildung. The conference will be open to the public. There will be no participation fee. Conference language is German. For further information please visit the conference website: www.cpov.de Deadline for the Registration for the network meeting: August 31, 2010 Concept and Editorial board: Geert Lovink, Johanna Niesyto and Andreas Möllenkamp Contact cultiv Gesellschaft für internationale Kulturprojekte e.V. Bernhard-Göring-Str. 65 D-04107 Leipzig Tel. +49-341-2228893 Email: i...@cpov.de www.cpov.de ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l