Re: [Wiki-research-l] [Analytics] Bots vs. Wikipedians – Who edits more?

2013-10-14 Thread Diederik van Liere
Very cool. If you include wikidata then more than 50% of the edits on the
Wikimedia projects are made by bots. One of the dead horses I like to beat
is that bot editors should be treated as first class citizens of Wikipedia
and this data nicely illustrates that.  I think this is a bigger watershed
moment (we might have reached this threshold a while back) then mobile vs
non-mobile and we should have a way more rigorous discussion about the
future of bots on Wikipedia. Particularly as all our big features are aimed
at human editors :)
D


On Mon, Oct 14, 2013 at 12:30 PM, Dario Taraborelli <
dtarabore...@wikimedia.org> wrote:

> A new app by Thomas Steiner (@tomayac) counting bot vs human edits in real
> time from the RecentChanges feed:
>
> http://wikipedia-edits.herokuapp.com/
>
> (read more [2]). The application comes with a public API exposing
> Wikipedia and Wikidata edits as Server-Sent Events. [1]
>
> Dario
>
> [1]
> http://blog.tomayac.com/index.php?date=2013-10-14&time=16:49:46&perma=Bots+vs.+Wikipedians.html
> [2] https://en.wikipedia.org/wiki/Server-sent_events
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] diffdb formatted Wikipedia dump

2013-10-11 Thread Diederik van Liere
> *From: *Susan Biancani 
> *Subject: **[Wiki-research-l] diffdb formatted Wikipedia dump*
> *Date: *October 3, 2013 10:06:44 PM PDT
> *To: *wiki-research-l@lists.wikimedia.org
> *Reply-To: *Research into Wikimedia content and communities <
> wiki-research-l@lists.wikimedia.org>
>
> I'm looking for a dump from English Wikipedia in diff format (i.e. each
> entry is the text that was added/deleted since the last edit, rather than
> each entry is the current state of the page).
>
> The Summer of Research folks provided a handy guide to how to create such
> a dataset from the standard complete dumps here:
> http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
> But the time estimate they give is prohibitive for me (20-24 hours for
> each dump file--there are currently 158--running on 24 cores). I'm a grad
> student in a social science department, and don't have access to extensive
> computing power. I've been paying out of pocket for AWS, but this would get
> expensive.
>
> There is a diff-format dataset available, but only through April, 2011
> (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
> diff-format dataset for January, 2010- March, 2013 (or, for everything up
> to March, 2013).
>
> Does anyone know if such a dataset exists somewhere? Any leads or
> suggestions would be much appreciated!
>
> Hi Susan,

There is no newer version of the dataset then you have found, that's the
bad news. The good news is that the dataset was used with really slow
commodity hardware -- what you could do is run it on AWS using a smaller
dataset, for example the Dutch Wikipedia and see how long it takes. An
alternative solution would be to start thinking (with other researchers and
Wikimedia community members) of having a small Hadoop cluster in Labs with
only public data. That way you don't need to pay but obviously it will be
less performant.   The Analytics has puppet manifests ready that will build
an entire hadoop cluster.

The wikimedia-analytics mailinglist is a good place for such a conversation
or if you need more hands on help with the diffdb then please com to irc:
wikimedia-analytics.

Best,
Diederik
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Announcing availability new dataset diffdb

2011-11-04 Thread Diederik van Liere
Hi Rami,

If I recall correctly, we use the diff library from google (
http://code.google.com/p/google-diff-match-patch/)
and the total size is about 420Gb (after decompression).

But you can also just download a couple of chunks and see if you can handle
those.
Best,
Diederik


On Fri, Nov 4, 2011 at 5:56 PM, Rami Al-Rfou'  wrote:

> Hi Diederik,
>
> I have two questions:
>
>1. Which algorithm you used to get the added/removed content between
>two revisions of wikipedia?
>2. What is the size of the diffdb dump after extracting? I do not want
>to waste wikipedia bandwidth if I know that I can not deal with it ;).
>
> By the way, what you did  is exactly what I just started working on to
> implement for my project, so thanks a lot :)
>
> Regards.
>
> On Fri, Nov 4, 2011 at 13:19, Diederik van Liere wrote:
>
>> Dear Wiki Researchers,
>>
>>
>> During the summer we have worked on Wikihadoop [0], a tool that allows us
>> to create the diffs between two revisions of a Wiki article using Hadoop.
>> Now I am happy to announce that the entire diffdb is available for
>> download at http://http://dumps.wikimedia.org/other/diffdb/
>>
>> This dataset is based on the English Wikipedia April 2011 XML dump files.
>> The advantage of this dataset is that:
>> a) You can search for specific content being added / removed
>> b) Measure more accurately how much text an editor has added or removed
>>
>> We are currently working on a Lucene-based application [1] that will
>> allow us to quickly search for specific strings being added or removed.
>>
>> If you have any questions, then please let me know!
>>
>> [0] https://github.com/whym/wikihadoop
>> [1] https://github.com/whym/diffindexer
>>
>>
>> Best regards,
>>
>> Diederik van Liere
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Rami Al-Rfou'
> 631-371-3165
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
http://about.me/diederik";>Check out my about.me profile!
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Announcing availability new dataset diffdb

2011-11-04 Thread Diederik van Liere
Dear Wiki Researchers,


During the summer we have worked on Wikihadoop [0], a tool that allows us
to create the diffs between two revisions of a Wiki article using Hadoop.
Now I am happy to announce that the entire diffdb is available for download
at http://http://dumps.wikimedia.org/other/diffdb/

This dataset is based on the English Wikipedia April 2011 XML dump files.
The advantage of this dataset is that:
a) You can search for specific content being added / removed
b) Measure more accurately how much text an editor has added or removed

We are currently working on a Lucene-based application [1] that will allow
us to quickly search for specific strings being added or removed.

If you have any questions, then please let me know!

[0] https://github.com/whym/wikihadoop
[1] https://github.com/whym/diffindexer


Best regards,

Diederik van Liere
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
So the 14 day task included xml parsing and creating diffs. We might gain
performance improvements by fine-tuning the Hadoop configuration although
that seems to be more of  an art than science.
Diederik


On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov wrote:

> Hello,
>
> This is an excellent news!
>
> Have you tried running it on Amazon EC2? It would be really nice to know
> how well WikiHadoop scale up with the number of nodes.
> Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on
> what kind of task (xml parsing, diffs, md5, etc?) was it obtained?
>
> -- Best, Dmitry
>
> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere 
> wrote:
>
>> Hello!
>>
>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
>> and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
>> hard on a customized stream-based InputFormatReader that allows parsing of
>> both bz2 compressed and uncompressed files of the full Wikipedia dump(dump 
>> file with the complete edit histories)using Hadoop. Prior to WikiHadoop and 
>> the accompanying InputFormatReader it
>> was not possible to use Hadoop to analyze the full Wikipedia dump files
>> (see the detailed tutorial / background for an explanation why that was not
>> possible).
>>
>> This means:
>> 1) We can now harness Hadoop's distributed computing capabilities in
>> analyzing the full dump files.
>> 2) You can send either one or two revisions to a single mapper so it's
>> possible to diff two revisions and see what content has been addded /
>> removed.
>> 3) You can exclude namespaces by supplying a regular expression.
>> 4) We are using Hadoop's Streaming interface which means people can use
>> this InputFormat Reader using different languages such as Java, Python, Ruby
>> and PHP.
>>
>> The source code is available at: https://github.com/whym/wikihadoop
>> A more detailed tutorial and installation guide is available at:
>> https://github.com/whym/wikihadoop/wiki
>>
>>
>> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>>
>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>>
>>
>> Best,
>>
>> Diederik
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
http://about.me/diederik";>Check out my about.me profile!
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
Hello!

Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
on a customized stream-based InputFormatReader that allows parsing of both
bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
accompanying InputFormatReader it was not possible to use Hadoop to analyze
the full Wikipedia dump files (see the detailed tutorial / background for an
explanation why that was not possible).

This means:
1) We can now harness Hadoop's distributed computing capabilities in
analyzing the full dump files.
2) You can send either one or two revisions to a single mapper so it's
possible to diff two revisions and see what content has been addded /
removed.
3) You can exclude namespaces by supplying a regular expression.
4) We are using Hadoop's Streaming interface which means people can use this
InputFormat Reader using different languages such as Java, Python, Ruby and
PHP.

The source code is available at: https://github.com/whym/wikihadoop
A more detailed tutorial and installation guide is available at:
https://github.com/whym/wikihadoop/wiki


(Apologies for cross-posting to wikitech-l and wiki-research-l)

[0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/


Best,

Diederik
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Fraction of reverts

2011-08-15 Thread Diederik van Liere
Some more pointers: 
http://meta.wikimedia.org/wiki/Research:Newbie_reverts_and_article_length
http://meta.wikimedia.org/wiki/Research:Newbie_reverts_and_subsequent_editing_behavior

Best,
Diederik

On 2011-08-15, at 9:00 PM, Denny Vrandecic wrote:

> Thank you, Daniel!
> 
> On Aug 15, 2011, at 17:12, Daniel Mietchen wrote:
> 
>> Hi Denny,
>> 
>> just read
>> http://en.wikipedia.org/w/index.php?title=Wikipedia:Wikipedia_Signpost/2011-08-15/Women_and_Wikipedia&oldid=445064196
>> earlier today, which
>> states
>> "Women are more likely to be reverted when they have very few edits
>> (7% vs 5%); however, in accounts with more than eight edits, the
>> effect disappears."
>> 
>> Cheers,
>> 
>> Daniel
>> 
>> On Tue, Aug 16, 2011 at 1:53 AM, Denny Vrandecic
>>  wrote:
>>> Hello,
>>> 
>>> does anyone have a rough estimate of how many edits get reverted?
>>> Does anyone have a study handy?
>>> 
>>> Cheers,
>>> Denny
>>> 
>>> 
>>> 
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>> 
>> 
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> 
> 
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



smime.p7s
Description: S/MIME cryptographic signature
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Editor Trends Study - Improving the tool

2010-11-11 Thread Diederik van Liere
Dear Felipe,

We did investigate other tools before deciding to embark on this new
project, as you rightly point out we should minimize code overlap.
Pywikipediabot is an editing tool as far as I know and your tool,
WikixRay, has definitely proven itself. However, I believe that a
no-sql solution will give better performance than sql databases and
that has been one of the main reasons to write this tool.

I am not sure if a separate mailing list is required, at the moment
it's not, but thanks for the suggestion and I have added the SVN link.

Best,

Diederik
> To: Research into Wikimedia content and communities
>        
> Message-ID: <376712.40857...@web27504.mail.ukl.yahoo.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
>
> --- El mi?, 10/11/10, Diederik van Liere  escribi?:
>
> De: Diederik van Liere 
> Asunto: [Wiki-research-l] Editor Trends Study - Improving the tool
> Para: wiki-research-l@lists.wikimedia.org
> Fecha: mi?rcoles, 10 de noviembre, 2010 00:02
>
> Hi, Diederik,
>
> I'm also glad to see progress in this project. Some comments inline.
>
> Dear researchers,
>
> Recently, we started the Editor Trends Study 
> (http://strategy.wikimedia.org/wiki/Editor_Trends_Study).
> The goal of this study is to get a better understanding of the community
>
> dynamics within the different Wikipedia projects.
>
> Part of this project consists of developing a tool 
> (http://strategy.wikimedia.org/wiki/Editor_Trends_Study/Software)
>
> that parses a Wikipedia dump file, extracts the required information, stores 
> it
> in a database and exports it to a CSV file. This CSV file can then be used in 
> a
> statistical program such as R, Stata or SAS.
>
> Well, I would have expected that the team would have done some previous 
> search for open source code already available, that implements at least some 
> (if not exactly all or the very same) of the planned functionalities.
>
> Some examples are my own tool, WikiXRay, and Pywikpediabot (that, AFAIK, now 
> it also includes a fast parser of Wikipedia dump files).
>
> For my tool, now I use git for version control and you can use any of the two 
> repos available (the official at libresoft, or the mirror at Gitorious):
>
> http://git.libresoft.es/WikixRay/
> http://gitorious.org/wikixray/wikixray
>
> Well, they might not be the best possible software available, but I guess 
> they can help to solve some problems, or at least help you to speed up the 
> development and to avoid starting from scratch.
>
>
> We are looking for some volunteers that would enjoy testing the tool. You 
> don't need to be a
> software developer (although it helps :)) to help us; some patience, a bit of 
> time and
> a fairly recent computer is all you need. You should be comfortable 
> installing programs,
>
> working with a command-line interface and have basic Subversion experience.
> Python experience is a real bonus!
>
> The testing will focus on getting the tool to run without any supervision. 
> For more background information, have a look at:
>
> http://strategy.wikimedia.org/wiki/Editor_Trends_Study/Software
>
> Perhaps you're going to provide this info later, but I don't see the links to 
> your SVN repo (only [] ).
>
> We are testing the tool with the largest Wikipedia projects, so if you would 
> like to replicate
>
> the analysis on your own favorite Wikipedia project or help improve the 
> quality of the tool then please contact me off-list.
>
> I think it should be more effective to have another public list to which 
> people specifically interested in this tool can suscribe (for example, like 
> we have one for XML dumps exclusively).
>
> This should sensibly reduce the number of duplicated bug reports, and 
> comments, since other people can learn about known issues.
>
> Hope this helps.
>
> Best,
> Felipe.
>
> Best,
>
> Diederik

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Editor Trends Study - Improving the tool

2010-11-11 Thread Diederik van Liere
Dear Piotr,
Thanks for your comments. A GUI is not very likely on the roadmap as
that requires significant time to develop, but I will try my best to
make the online documentation as clear as possible and you can always
email we if you have any questions.

Best,

Diederik
> To: Research into Wikimedia content and communities
>        
> Message-ID: <4cd9d9e3.4040...@post.pl>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Diederik van Liere wrote:
>
>> We are looking for some volunteers that would enjoy testing the tool.
>> You don't need to be a
>> software developer (although it helps :)) to help us; some patience, a
>> bit of time and
>> a fairly recent computer is all you need. You should be comfortable
>> installing programs,
>> working with a command-line interface and have basic Subversion experience.
>> Python experience is a real bonus!
>
> Quick feedback:
> * glad to see progress!
> * the wiki pages you link seem well designed and how-to's appear to make
> sense :)
> * as long as there is a need for a command-line interface and no
> graphical user interface, many would-be users will not be able to use it
> * ditto for things like Python and Subversion (I never even heard of the
> latter...).
>
> I assume that having a GUI is planned in some foreseeable future?
>
>
> --
> Piotr Konieczny
>
> "To be defeated and not submit, is victory; to be victorious and rest on
> one's laurels, is defeat." --J?zef Pilsudski

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Editor Trends Study - Improving the tool

2010-11-09 Thread Diederik van Liere
Dear researchers,

Recently, we started the Editor Trends Study (
http://strategy.wikimedia.org/wiki/Editor_Trends_Study).
The goal of this study is to get a better understanding of the community
dynamics within the different Wikipedia projects.

Part of this project consists of developing a tool (
http://strategy.wikimedia.org/wiki/Editor_Trends_Study/Software)
that parses a Wikipedia dump file, extracts the required information, stores
it
in a database and exports it to a CSV file. This CSV file can then be used
in a
statistical program such as R, Stata or SAS.

We are looking for some volunteers that would enjoy testing the tool. You
don't need to be a
software developer (although it helps :)) to help us; some patience, a bit
of time and
a fairly recent computer is all you need. You should be comfortable
installing programs,
working with a command-line interface and have basic Subversion experience.
Python experience is a real bonus!

The testing will focus on getting the tool to run without any supervision.
For more background information, have a look at:
http://strategy.wikimedia.org/wiki/Editor_Trends_Study/Software

We are testing the tool with the largest Wikipedia projects, so if you would
like to replicate
the analysis on your own favorite Wikipedia project or help improve the
quality of the tool then please contact me off-list.



Best,

Diederik
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Editor Trends Study - Requesting your Input

2010-10-18 Thread Diederik van Liere
Dear Wikipedia Researchers,

We have posted a wiki about the Editor Trends Study on the strategy
wiki, you can find it here:
http://strategy.wikimedia.org/wiki/Editor_Trends_Study

We would like to have your input on our suggested approach and in
particular we are curious about your thoughts concerning the following
topics:

1) Definitions of New Editor and Active Editor, do these definitions
correspond with your experiences and are they clear?
2) We suggest doing two types of analysis (Active editor composition
by tenure and Cohort analysis New Wikipedians). What additional
analysis would you suggest that could reveal crucial information that
these two analyses would not generate?
3) Sample of Wikipedia sites to study, let us know if there are other
Wikipedia projects that may be useful to analyze and we'll do our best
to include those.



Please leave your ideas / suggestions on the Talk page and thanks for
your input.


Best,

Howie & Diederik

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l