Personally, I'm a big fan of Scott's method (and the associated paper, which I've been throwing about internally ;)). Stu's points are worth addressing, though, and I think his per-edit approach is probably the way to go.
TL;DR all the sensible things have already been said, I'm just +1ing them ;p. On 18 May 2014 12:33, R.Stuart Geiger <sgei...@gmail.com> wrote: > Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get > no mercy. :-) > > But seriously, my tl;dr: instead of asking if an account is or isn't a > bot, ask if a set of edits are or are not automated > > Great responses so far: searching usernames for *bot will exclude non-bot > users who were registered before the username policy change (although *Bot > is a bit better), and the logging table is a great way to collect bot > flags. However, Scott is right -- the bot flag (or *Bot username) doesn't > signify a bot, it signifies a bureaucrat recognizing that a user account > successfully went through the Bot Approval Group process. If I see an > account with a bot flag, I can generally assume the edits that account > makes are initiated by an automated software agent. This is especially the > case in the main namespace. The inverse assumption is not nearly as easy: I > can't assume that every edit made from an account *without* a bot flag was > *not* an automated edit. > > About unauthorized bots: yes, there are a relatively small number of > Wikipedians who, on occasion, run fully-automated, continuously-operating > bots without approval. Complicating this, if someone is going to take the > time to build and run a bot, but isn't going to create a separate account > for it, then it is likely that they are also using that account to do > non-automated edits. Sometimes new bot developers will run an unauthorized > bot under their own account during the initial stages of development, and > only later in the process will they create a separate bot account and seek > formal approval and flagging. It can get tricky when you exclude all the > edits from an account for being automated based on a single suspicious set > of edits. > > More commonly, there are many more people who use automated batch tools > like AutoWikiBrowser to support one-off tasks, like mass find-and-replace > or category cleanup. Accounts powered by AWB are technically not bots, > only because a human has to sit there and click "save" for every batch edit > that is made. Some people will create a separate bot account for AWB work > and get it approved and flagged, but many more will not bother. Then > there are people using semi-automated, human-in-the-loop tools like Huggle > to do vandal fighting. I find that the really hard question is whether > you include or exclude these different kinds of 'cyborgs', because it > really makes you think hard about what exactly you're measuring. Is > someone who does a mass find-and-replace on all articles in a category a > co-author of each article they edit? Is a vandal fighter patrolling the > recent changes feed with Huggle a co-author of all the articles they edit > when they revert vandalism and then move on to the next diff? What about > somebody using rollback in the web browser? If so, what is it that makes > these entities authors and ClueBot NG not an author? > > When you think about it, user accounts are actually pretty remarkable in > that they allow such a diverse set of uses and agents to be attributed to a > single entity. So when it comes to identifying automation, I personally > think it is better to shift the unit of analysis from the user account to > the individual edit. A bot flag lets you assume all edits from an account > are automated, but you can use a range of approaches to identifying sets of > automated edits from non-flagged accounts. Then I have a set of regex SQL > queries in the Query Library [1] which parses edit summaries for the traces > that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default. > You can also use the edit session approach like Scott has suggested -- Aaron > and I found a few unauthorized bots in our edit session study [2], and we > were even using a more aggressive break, with no more than a 60 minute gap > between edits. To catch short bursts of bulk edits, you could look at large > numbers of edits made in a short period of time -- I'd say more than 7 main > namespace edits a minute for 10 minutes would be a hard rate for even a > very aggressive vandal fighter to maintain with Huggle. > > I'll conclude by saying that different kinds of automated editing > techniques are different ways of participating in and contributing to > Wikipedia. To systematically exclude automated edits is to remove a very > important, meaningful, and heterogeneous kind of activity from view. These > activities constitute a core part of what Wikipedia is, particularly > those forms of automation which the community has explicitly authorized and > recognized. Now, we researchers inevitably have to selectively reveal > and occlude -- a co-authorship network based on main namespace edits also > excludes talk page discussions and conflict resolution, and this also > constitutes a core part of what Wikipedia is. It isn't wrong per se to > exclude automated edits, and it is certainly much worse to not recognize > that they exist at all. However, I always appreciate seeing how the > analysis would be different if bots were not excluded. The fact that > there are these weird users which absolutely dominate a co-authorship > network graph if you don't filter them out is pretty amazing, at least to > me. > > Best, > Stuart > > [1] > https://wiki.toolserver.org/view/MySQL_queries#Automated_tool_and_bot_edits > [2] http://stuartgeiger.com/cscw13-labor-hours.pdf > > > On Sun, May 18, 2014 at 10:08 AM, Scott Hale > <computermacgy...@gmail.com>wrote: > >> Very helpful, Lukas, I didn't know about the logging table. >> >> In some recent work [1] I found many users that appeared to be bots but >> whose edits did not have the bot flag set. My approach was to exclude users >> who didn't have a break of more than 6 hours between edits over the entire >> month I was studying. I was interested in the users who had multiple edit >> sessions in the month and so when with a straight threshold. A way to keep >> users with only one editing session would be to exclude users who have no >> break longer than X hours in an edit session lasting at least Y hours >> (e.g., a user who doesn't break for more than 6 hours in 5-6 days is >> probably not human) >> >> Cheers, >> Scott >> >> [1] Multilinguals and Wikipedia Editing >> http://www.scotthale.net/pubs/?websci2014 >> >> >> -- >> Scott Hale >> Oxford Internet Institute >> University of Oxford >> http://www.scotthale.net/ >> scott.h...@oii.ox.ac.uk >> >> >> >> On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix <lbene...@l3q.de> wrote: >> >>> Here is a list of currently flagged bots: >>> >>> https://en.wikipedia.org/w/index.php?title=Special:ListUsers&offset=&limit=2000&username=&group=bot >>> >>> Another good point to look for bots is here: >>> >>> https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=Bots%2FRequests_for_approval&namespace=4 >>> >>> You should also have a look at this pages to find former bots: >>> https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1 >>> https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2 >>> >>> And last but not least the logging table you can access via tool labs: >>> SELECT DISTINCT(log_title) >>> FROM logging >>> WHERE log_action = 'rights' >>> AND log_params LIKE '%bot%'; >>> >>> Lukas >>> >>> Am So 18.05.2014 18:34, schrieb Andrew G. West: >>> > User name policy states that "*bot*" names are reserved for bots. >>> > Thus, such a regex shouldn't be too hacky, but I cannot comment >>> > whether some non-automated cases might slip through new user patrol. I >>> > do think dumps make the 'users' table available, and I know for sure >>> > one could get a full list via the API. >>> > >>> > As a check on this, you could check that when these usernames edit, >>> > whether or not they set the "bot" flag. -AW >>> > >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >> >> >> >> -- >> Scott Hale >> Oxford Internet Institute >> University of Oxford >> http://www.scotthale.net/ >> scott.h...@oii.ox.ac.uk >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l