Re: [Wiki-research-l] Wikipedia traffic: selected language versions

2014-05-18 Thread Federico Leva (Nemo)

Thanks for your suggestions. Just some quick pointers below.

h, 18/05/2014 08:26:

(I-A). Tabulate the data points in absolute numbers first, not
percentage numbers [...]
(I-B). Include all language versions for the *editing traffic* report as
well. [...]
(I-C). Provide static data objects in more accessible format (i.e. csv
and/or json). [...]
(II-A).  Putting viewing traffic and editing traffic report on the same
page. [...]
(II-B).  Organizing and archiving the traffic reports for historical
comparison. [...]
(I-C). Provide dynamic data objects in more accessible format (i.e. csv
and/or json).


At least the first four are just changes in the WikiStats reports 
formatting, personally I encourage you to submit patches: 
https://git.wikimedia.org/summary/analytics%2Fwikistats.git (should be 
the squids directory, but there is some ongoing refactoring of the repos).


On archives and history rewriting/reports regeneration, see also 
https://bugzilla.wikimedia.org/show_bug.cgi?id=46198



[...] (III-B).  Smaller (i.e more specific) geographic aggregate units.
The country (geographic) information is often based on geo-IP databases,
and sometimes provincial and city-level data would be available.


http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html


[...]

( I know that the Unicode Common Locale Data Repository (CLDR Version 25
http://cldr.unicode.org/index/downloads/cldr-25)
provides“language-territory”
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.htmlor
“territory-language”
http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.htmlunit-based
charts, but I believe that the Wikimedia projects can use and build one
better..)  [...]


No, we definitely can't, not alone. I've asked for help, please 
contribute: 
https://www.mediawiki.org/wiki/Universal_Language_Selector/FAQ#How_does_Universal_Language_Selector_determine_which_languages_I_may_understand.


Nemo

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Conference venues for Social Media Researchers [Deadlines are fast approaching!] Reminder

2014-05-18 Thread Anatoliy Gruzd
*Apologies for cross-posting*



Calling all Social Media and Online Communities Researchers!

Please consider submitting your research to the following conferences.
Deadlines are fast approaching.

(1) #SMSociety14: SOCIAL MEDIA AND SOCIETY CONFERENCE
Location: Toronto, ON, Canada
When: September 27-28, 2014
Poster Abstracts Due: May 23, 2014  (!!! in 5 days !!!)
More info: http://SocialMediaAndSociety.com/?page_id=549

Conference organizers:
  Anatoliy Gruzd, Dalhousie University
  Barry Wellman, University of Toronto
  Philip Mai, Dalhousie University
  Jenna Jacobson, University of Toronto


(2) Hawaii International Conference on System Sciences (HICSS)
Minitrack: SOCIAL NETWORKING  COMMUNITIES
Location: Kauai, Hawaii, USA
When: January 5-8, 2015
Full Papers Due: June 15, 2014
More info: http://socialmedialab.ca/?page_id=9308

Minitrack co-chairs:
 Anatoliy Gruzd, Dalhousie University
 Caroline Haythornthwaite, University of British Columbia
 Karine Nahon, University of Washington


Please contact Anatoliy Gruzd gr...@dal.ca if you have any questions
about these calls.

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Kill the bots

2014-05-18 Thread Brian Keegan
Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?
I'm interested in omitting automated revisions (sorry Stuart!) for the
purposes of building co-authorship networks.

Grabbing everything under 'Category:All Wikipedia bots' excludes some major
ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot, etc. because
these bots have changed names but the redirect is not categorized, the
account has been removed/deprecated, or a user appears to have removed the
relevant bot categories from the page.

Can anyone advise me on how to kill all the bots in my data without having
to resort to manual cleaning or hacky regex?


-- 
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet  Society, Harvard Law School

b.kee...@neu.edu
www.brianckeegan.com
M: 617.803.6971
O: 617.373.7200
Skype: bckeegan
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Amir E. Aharoni
People whose last name is Abbot will be discriminated.

And a true story: A prominent human Catalan Wikipedia editor whose name is
PauCabot skewed the results of an actual study.

So don't trust just the user names.
בתאריך 18 במאי 2014 19:34, מאת Andrew G. West west.andre...@gmail.com:

 User name policy states that *bot* names are reserved for bots. Thus,
 such a regex shouldn't be too hacky, but I cannot comment whether some
 non-automated cases might slip through new user patrol. I do think dumps
 make the 'users' table available, and I know for sure one could get a full
 list via the API.

 As a check on this, you could check that when these usernames edit,
 whether or not they set the bot flag. -AW

 --
 Andrew G. West, PhD
 Research Scientist
 Verisign Labs - Reston, VA
 Website: http://www.andrew-g-west.com


 On 05/18/2014 12:10 PM, Brian Keegan wrote:

 Is there a way to retrieve a canonical list of bots on enwiki or
 elsewhere? I'm interested in omitting automated revisions (sorry
 Stuart!) for the purposes of building co-authorship networks.

 Grabbing everything under 'Category:All Wikipedia bots' excludes some
 major ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot,
 etc. because these bots have changed names but the redirect is not
 categorized, the account has been removed/deprecated, or a user appears
 to have removed the relevant bot categories from the page.

 Can anyone advise me on how to kill all the bots in my data without
 having to resort to manual cleaning or hacky regex?


 --
 Brian C. Keegan, Ph.D.
 Post-Doctoral Research Fellow, Lazer Lab
 College of Social Sciences and Humanities, Northeastern University
 Fellow, Institute for Quantitative Social Sciences, Harvard University
 Affiliate, Berkman Center for Internet  Society, Harvard Law School

 b.kee...@neu.edu mailto:b.kee...@neu.edu
 www.brianckeegan.com http://www.brianckeegan.com
 M: 617.803.6971
 O: 617.373.7200
 Skype: bckeegan


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Wikipedia traffic: selected language versions

2014-05-18 Thread h
Dear Nemo,

As I am waiting for a more complete response, I am not sure that I
understand your last No as in No, we definitely can't means. To
clarify, take the CLDR supplement Language-Territory information for
example
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

One can suggest additions of the data point by submitting sourced
numbers for a geo-linguistic population like this:
http://unicode.org/cldr/trac/newticket?description=%3Cterritory%2c%20speaker%20population%20in%20territory%2c%20and%20references%3Esummary=Add%20territory%20to%20Traditional%20Chinese%20(zh_Hant)

In Wikipedia articles and Wikidata pages, there are many attempts to
provide more updated and better sourced data points. I see  the potentials
in exchanging such data, curating them better in Wikidata projects as more
detailed and dynamic source than the CLDR.

These data points will have extra benefits in curating traffic data.
For one, these geo-linguistic population data points would be useful to
normalize traffic data for further analysis, such as geographic
normalization.  For another, they provide important reference data for the
development strategies and policies of the Wikipedia projects.

Best,
han-teng liao





2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) nemow...@gmail.com:

 Thanks for your suggestions. Just some quick pointers below.

 h, 18/05/2014 08:26:

 (I-A). Tabulate the data points in absolute numbers first, not
 percentage numbers [...]

 (I-B). Include all language versions for the *editing traffic* report as
 well. [...]

 (I-C). Provide static data objects in more accessible format (i.e. csv
 and/or json). [...]

 (II-A).  Putting viewing traffic and editing traffic report on the same
 page. [...]

 (II-B).  Organizing and archiving the traffic reports for historical
 comparison. [...]

 (I-C). Provide dynamic data objects in more accessible format (i.e. csv
 and/or json).


 At least the first four are just changes in the WikiStats reports
 formatting, personally I encourage you to submit patches: 
 https://git.wikimedia.org/summary/analytics%2Fwikistats.git (should be
 the squids directory, but there is some ongoing refactoring of the repos).

 On archives and history rewriting/reports regeneration, see also
 https://bugzilla.wikimedia.org/show_bug.cgi?id=46198

  [...] (III-B).  Smaller (i.e more specific) geographic aggregate units.

 The country (geographic) information is often based on geo-IP databases,
 and sometimes provincial and city-level data would be available.


 http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html

  [...]


 ( I know that the Unicode Common Locale Data Repository (CLDR Version 25
 http://cldr.unicode.org/index/downloads/cldr-25)
 provides“language-territory”
 http://www.unicode.org/cldr/charts/latest/supplemental/
 language_territory_information.htmlor
 “territory-language”
 http://www.unicode.org/cldr/charts/latest/supplemental/
 territory_language_information.htmlunit-based

 charts, but I believe that the Wikimedia projects can use and build one
 better..)  [...]


 No, we definitely can't, not alone. I've asked for help, please
 contribute: https://www.mediawiki.org/wiki/Universal_Language_
 Selector/FAQ#How_does_Universal_Language_Selector_
 determine_which_languages_I_may_understand.


 Nemo

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Scott Hale
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but
whose edits did not have the bot flag set. My approach was to exclude users
who didn't have a break of more than 6 hours between edits over the entire
month I was studying. I was interested in the users who had multiple edit
sessions in the month and so when with a straight threshold. A way to keep
users with only one editing session would be to exclude users who have no
break longer than X hours in an edit session lasting at least Y hours
 (e.g., a user who doesn't break for more than 6 hours in 5-6 days is
probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing
http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix lbene...@l3q.de wrote:

 Here is a list of currently flagged bots:

 https://en.wikipedia.org/w/index.php?title=Special:ListUsersoffset=limit=2000username=group=bot

 Another good point to look for bots is here:

 https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndexprefix=Bots%2FRequests_for_approvalnamespace=4

 You should also have a look at this pages to find former bots:
 https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
 https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

 And last but not least the logging table you can access via tool labs:
 SELECT DISTINCT(log_title)
 FROM logging
 WHERE log_action = 'rights'
 AND log_params LIKE '%bot%';

 Lukas

 Am So 18.05.2014 18:34, schrieb Andrew G. West:
  User name policy states that *bot* names are reserved for bots.
  Thus, such a regex shouldn't be too hacky, but I cannot comment
  whether some non-automated cases might slip through new user patrol. I
  do think dumps make the 'users' table available, and I know for sure
  one could get a full list via the API.
 
  As a check on this, you could check that when these usernames edit,
  whether or not they set the bot flag. -AW
 


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Wikipedia traffic: selected language versions

2014-05-18 Thread Oliver Keyes
Could you give an example of what we could do better than CLDR or the
relevant ISO standards?


On 18 May 2014 10:06, h hant...@gmail.com wrote:

 Dear Nemo,

 As I am waiting for a more complete response, I am not sure that I
 understand your last No as in No, we definitely can't means. To
 clarify, take the CLDR supplement Language-Territory information for
 example

 http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html

 One can suggest additions of the data point by submitting sourced
 numbers for a geo-linguistic population like this:
 http://unicode.org/cldr/trac/newticket?description=%3Cterritory%2c%20speaker%20population%20in%20territory%2c%20and%20references%3Esummary=Add%20territory%20to%20Traditional%20Chinese%20(zh_Hant)

 In Wikipedia articles and Wikidata pages, there are many attempts to
 provide more updated and better sourced data points. I see  the potentials
 in exchanging such data, curating them better in Wikidata projects as more
 detailed and dynamic source than the CLDR.

 These data points will have extra benefits in curating traffic data.
 For one, these geo-linguistic population data points would be useful to
 normalize traffic data for further analysis, such as geographic
 normalization.  For another, they provide important reference data for the
 development strategies and policies of the Wikipedia projects.

 Best,
 han-teng liao





 2014-05-18 16:23 GMT+08:00 Federico Leva (Nemo) nemow...@gmail.com:

 Thanks for your suggestions. Just some quick pointers below.

 h, 18/05/2014 08:26:

 (I-A). Tabulate the data points in absolute numbers first, not
 percentage numbers [...]

 (I-B). Include all language versions for the *editing traffic* report as
 well. [...]

 (I-C). Provide static data objects in more accessible format (i.e. csv
 and/or json). [...]

 (II-A).  Putting viewing traffic and editing traffic report on the same
 page. [...]

 (II-B).  Organizing and archiving the traffic reports for historical
 comparison. [...]

 (I-C). Provide dynamic data objects in more accessible format (i.e. csv
 and/or json).


 At least the first four are just changes in the WikiStats reports
 formatting, personally I encourage you to submit patches: 
 https://git.wikimedia.org/summary/analytics%2Fwikistats.git (should be
 the squids directory, but there is some ongoing refactoring of the repos).

 On archives and history rewriting/reports regeneration, see also
 https://bugzilla.wikimedia.org/show_bug.cgi?id=46198

  [...] (III-B).  Smaller (i.e more specific) geographic aggregate units.

 The country (geographic) information is often based on geo-IP databases,
 and sometimes provincial and city-level data would be available.


 http://lists.wikimedia.org/pipermail/wikitech-l/2014-April/075964.html

  [...]


 ( I know that the Unicode Common Locale Data Repository (CLDR Version 25
 http://cldr.unicode.org/index/downloads/cldr-25)
 provides“language-territory”
 http://www.unicode.org/cldr/charts/latest/supplemental/
 language_territory_information.htmlor
 “territory-language”
 http://www.unicode.org/cldr/charts/latest/supplemental/
 territory_language_information.htmlunit-based

 charts, but I believe that the Wikimedia projects can use and build one
 better..)  [...]


 No, we definitely can't, not alone. I've asked for help, please
 contribute: https://www.mediawiki.org/wiki/Universal_Language_
 Selector/FAQ#How_does_Universal_Language_Selector_
 determine_which_languages_I_may_understand.


 Nemo

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Brian Keegan
How does one cite emails in ACM proceedings format? :)

On Sunday, May 18, 2014, R.Stuart Geiger sgei...@gmail.com wrote:

 Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get
 no mercy. :-)

 But seriously, my tl;dr: instead of asking if an account is or isn't a
 bot, ask if a set of edits are or are not automated

 Great responses so far: searching usernames for *bot will exclude non-bot
 users who were registered before the username policy change (although *Bot
 is a bit better), and the logging table is a great way to collect bot
 flags. However, Scott is right -- the bot flag (or *Bot username) doesn't
 signify a bot, it signifies a bureaucrat recognizing that a user account
 successfully went through the Bot Approval Group process. If I see an
 account with a bot flag, I can generally assume the edits that account
 makes are initiated by an automated software agent. This is especially the
 case in the main namespace. The inverse assumption is not nearly as easy: I
 can't assume that every edit made from an account *without* a bot flag was
 *not* an automated edit.

 About unauthorized bots: yes, there are a relatively small number of
 Wikipedians who, on occasion, run fully-automated, continuously-operating
 bots without approval. Complicating this, if someone is going to take the
 time to build and run a bot, but isn't going to create a separate account
 for it, then it is likely that they are also using that account to do
 non-automated edits. Sometimes new bot developers will run an unauthorized
 bot under their own account during the initial stages of development, and
 only later in the process will they create a separate bot account and seek
 formal approval and flagging. It can get tricky when you exclude all the
 edits from an account for being automated based on a single suspicious set
 of edits.

 More commonly, there are many more people who use automated batch tools
 like AutoWikiBrowser to support one-off tasks, like mass find-and-replace
 or category cleanup. Accounts powered by AWB are technically not bots,
 only because a human has to sit there and click save for every batch edit
 that is made. Some people will create a separate bot account for AWB work
 and get it approved and flagged, but many more will not bother. Then
 there are people using semi-automated, human-in-the-loop tools like Huggle
 to do vandal fighting. I find that the really hard question is whether
 you include or exclude these different kinds of 'cyborgs', because it
 really makes you think hard about what exactly you're measuring. Is
 someone who does a mass find-and-replace on all articles in a category a
 co-author of each article they edit? Is a vandal fighter patrolling the
 recent changes feed with Huggle a co-author of all the articles they edit
 when they revert vandalism and then move on to the next diff? What about
 somebody using rollback in the web browser? If so, what is it that makes
 these entities authors and ClueBot NG not an author?

 When you think about it, user accounts are actually pretty remarkable in
 that they allow such a diverse set of uses and agents to be attributed to a
 single entity. So when it comes to identifying automation, I personally
 think it is better to shift the unit of analysis from the user account to
 the individual edit. A bot flag lets you assume all edits from an account
 are automated, but you can use a range of approaches to identifying sets of
 automated edits from non-flagged accounts. Then I have a set of regex SQL
 queries in the Query Library [1] which parses edit summaries for the traces
 that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default.
 You can also use the edit session approach like Scott has suggested -- Aaron
 and I found a few unauthorized bots in our edit session study [2], and we
 were even using a more aggressive break, with no more than a 60 minute gap
 between edits. To catch short bursts of bulk edits, you could look at large
 numbers of edits made in a short period of time -- I'd say more than 7 main
 namespace edits a minute for 10 minutes would be a hard rate for even a
 very aggressive vandal fighter to maintain with Huggle.

 I'll conclude by saying that different kinds of automated editing
 techniques are different ways of participating in and contributing to
 Wikipedia. To systematically exclude automated edits is to remove a very
 important, meaningful, and heterogeneous kind of activity from view. These
 activities constitute a core part of what Wikipedia is, particularly
 those forms of automation which the community has explicitly authorized and
 recognized. Now, we researchers inevitably have to selectively reveal
 and occlude -- a co-authorship network based on main namespace edits also
 excludes talk page discussions and conflict resolution, and this also
 constitutes a core part of what Wikipedia is. It isn't wrong per se to
 exclude automated edits, and it is certainly much worse to not