Re: [Wiki-research-l] Kill the bots

Oliver Keyes Sun, 18 May 2014 16:35:24 -0700

Personally, I'm a big fan of Scott's method (and the associated paper,
which I've been throwing about internally ;)).  Stu's points are worth
addressing, though, and I think his per-edit approach is probably the way
to go.


TL;DR all the sensible things have already been said, I'm just +1ing them
;p.


On 18 May 2014 12:33, R.Stuart Geiger <sgei...@gmail.com> wrote:

> Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get
> no mercy. :-)
>
> But seriously, my tl;dr: instead of asking if an account is or isn't a
> bot, ask if a set of edits are or are not automated
>
> Great responses so far: searching usernames for *bot will exclude non-bot
> users who were registered before the username policy change (although *Bot
> is a bit better), and the logging table is a great way to collect bot
> flags. However, Scott is right -- the bot flag (or *Bot username) doesn't
> signify a bot, it signifies a bureaucrat recognizing that a user account
> successfully went through the Bot Approval Group process. If I see an
> account with a bot flag, I can generally assume the edits that account
> makes are initiated by an automated software agent. This is especially the
> case in the main namespace. The inverse assumption is not nearly as easy: I
> can't assume that every edit made from an account *without* a bot flag was
> *not* an automated edit.
>
> About unauthorized bots: yes, there are a relatively small number of
> Wikipedians who, on occasion, run fully-automated, continuously-operating
> bots without approval. Complicating this, if someone is going to take the
> time to build and run a bot, but isn't going to create a separate account
> for it, then it is likely that they are also using that account to do
> non-automated edits. Sometimes new bot developers will run an unauthorized
> bot under their own account during the initial stages of development, and
> only later in the process will they create a separate bot account and seek
> formal approval and flagging. It can get tricky when you exclude all the
> edits from an account for being automated based on a single suspicious set
> of edits.
>
> More commonly, there are many more people who use automated batch tools
> like AutoWikiBrowser to support one-off tasks, like mass find-and-replace
> or category cleanup. Accounts powered by AWB are technically not bots,
> only because a human has to sit there and click "save" for every batch edit
> that is made. Some people will create a separate bot account for AWB work
> and get it approved and flagged, but many more will not bother. Then
> there are people using semi-automated, human-in-the-loop tools like Huggle
> to do vandal fighting. I find that the really hard question is whether
> you include or exclude these different kinds of 'cyborgs', because it
> really makes you think hard about what exactly you're measuring. Is
> someone who does a mass find-and-replace on all articles in a category a
> co-author of each article they edit? Is a vandal fighter patrolling the
> recent changes feed with Huggle a co-author of all the articles they edit
> when they revert vandalism and then move on to the next diff? What about
> somebody using rollback in the web browser? If so, what is it that makes
> these entities authors and ClueBot NG not an author?
>
> When you think about it, user accounts are actually pretty remarkable in
> that they allow such a diverse set of uses and agents to be attributed to a
> single entity. So when it comes to identifying automation, I personally
> think it is better to shift the unit of analysis from the user account to
> the individual edit. A bot flag lets you assume all edits from an account
> are automated, but you can use a range of approaches to identifying sets of
> automated edits from non-flagged accounts. Then I have a set of regex SQL
> queries in the Query Library [1] which parses edit summaries for the traces
> that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default.
> You can also use the edit session approach like Scott has suggested -- Aaron
> and I found a few unauthorized bots in our edit session study [2], and we
> were even using a more aggressive break, with no more than a 60 minute gap
> between edits. To catch short bursts of bulk edits, you could look at large
> numbers of edits made in a short period of time -- I'd say more than 7 main
> namespace edits a minute for 10 minutes would be a hard rate for even a
> very aggressive vandal fighter to maintain with Huggle.
>
> I'll conclude by saying that different kinds of automated editing
> techniques are different ways of participating in and contributing to
> Wikipedia. To systematically exclude automated edits is to remove a very
> important, meaningful, and heterogeneous kind of activity from view. These
> activities constitute a core part of what Wikipedia is, particularly
> those forms of automation which the community has explicitly authorized and
> recognized. Now, we researchers inevitably have to selectively reveal
> and occlude -- a co-authorship network based on main namespace edits also
> excludes talk page discussions and conflict resolution, and this also
> constitutes a core part of what Wikipedia is. It isn't wrong per se to
> exclude automated edits, and it is certainly much worse to not recognize
> that they exist at all. However, I always appreciate seeing how the
> analysis would be different if bots were not excluded. The fact that
> there are these weird users which absolutely dominate a co-authorship
> network graph if you don't filter them out is pretty amazing, at least to
> me.
>
> Best,
> Stuart
>
> [1]
> https://wiki.toolserver.org/view/MySQL_queries#Automated_tool_and_bot_edits
> [2] http://stuartgeiger.com/cscw13-labor-hours.pdf
>
>
> On Sun, May 18, 2014 at 10:08 AM, Scott Hale 
> <computermacgy...@gmail.com>wrote:
>
>> Very helpful, Lukas, I didn't know about the logging table.
>>
>> In some recent work [1] I found many users that appeared to be bots but
>> whose edits did not have the bot flag set. My approach was to exclude users
>> who didn't have a break of more than 6 hours between edits over the entire
>> month I was studying. I was interested in the users who had multiple edit
>> sessions in the month and so when with a straight threshold. A way to keep
>> users with only one editing session would be to exclude users who have no
>> break longer than X hours in an edit session lasting at least Y hours
>>  (e.g., a user who doesn't break for more than 6 hours in 5-6 days is
>> probably not human)
>>
>> Cheers,
>> Scott
>>
>> [1] Multilinguals and Wikipedia Editing
>> http://www.scotthale.net/pubs/?websci2014
>>
>>
>> --
>> Scott Hale
>> Oxford Internet Institute
>> University of Oxford
>> http://www.scotthale.net/
>> scott.h...@oii.ox.ac.uk
>>
>>
>>
>> On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <lbene...@l3q.de> wrote:
>>
>>> Here is a list of currently flagged bots:
>>>
>>> https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot
>>>
>>> Another good point to look for bots is here:
>>>
>>> https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4
>>>
>>> You should also have a look at this pages to find former bots:
>>> https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
>>> https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2
>>>
>>> And last but not least the logging table you can access via tool labs:
>>> SELECT DISTINCT(log_title)
>>> FROM logging
>>> WHERE log_action = 'rights'
>>> AND log_params LIKE '%bot%';
>>>
>>> Lukas
>>>
>>> Am So 18.05.2014 18:34, schrieb Andrew G. West:
>>> > User name policy states that "*bot*" names are reserved for bots.
>>> > Thus, such a regex shouldn't be too hacky, but I cannot comment
>>> > whether some non-automated cases might slip through new user patrol. I
>>> > do think dumps make the 'users' table available, and I know for sure
>>> > one could get a full list via the API.
>>> >
>>> > As a check on this, you could check that when these usernames edit,
>>> > whether or not they set the "bot" flag. -AW
>>> >
>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>
>>
>>
>> --
>> Scott Hale
>> Oxford Internet Institute
>> University of Oxford
>> http://www.scotthale.net/
>> scott.h...@oii.ox.ac.uk
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Kill the bots

Reply via email to