Okay. Methodology: *take the last 5 days of requestlogs; *Filter them down to text/html requests as a heuristic for non-API requests; *Run them through the UA parser we use; *Exclude spiders and things which reported valid browsers; *Aggregate the user agents left; *??? *Profit
It looks like there are a relatively small number of bots that browse/interact via the web - ones I can identify include WPCleaner[0], which is semi-automated, something I can't find through WP or google called "DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general framework that could be masking multiple underlying bots and has ~ 7.4m requests through the web interface in that time period. Obvious caveat is obvious; the edits from these tools may actually come through the API, but they're choosing to request content through the web interface for some weird reason. I don't know enough about the software behind each bot to comment on that. I can try explicitly looking for web-based edit attempts, but there would be far fewer observations that the bots might appear in, because the underlying dataset is sampled at a 1:1000 rate. [0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation On 20 May 2014 07:50, Oliver Keyes <oke...@wikimedia.org> wrote: > Actually, belay that, I have a pretty good idea. I'll fire the log parser > up now. > > > On 20 May 2014 01:21, Oliver Keyes <oke...@wikimedia.org> wrote: > >> I think a *lot* of them use the API, but I don't know off the top of my >> head if it's *all* of them. If only we knew somebody who has spent the >> last 3 months staring into the cthulian nightmare of our request logs and >> could look this up... >> >> More seriously; drop me a note off-list so that I can try to work out >> precisely what you need me to find out, and I'll write a quick-and-dirty >> parser of our sampled logs to drag the answer kicking and screaming into >> the light. >> >> (sorry, it's annual review season. That always gets me blithe.) >> >> >> On 19 May 2014 13:03, Scott Hale <computermacgy...@gmail.com> wrote: >> >>> Thanks all for the comments on my paper, and even more thanks to >>> everyone sharing these super helpful ideas on filtering bots: this is why I >>> love the Wikipedia research committee. >>> >>> I think Oliver is definitely right that >>> >>>> this would be a useful topic for some piece of method-comparing >>>> research, if anyone is looking for paper ideas. >>> >>> "Citation goldmine" as one friend called it, I think. >>> >>> This won't address edit logs to date, but do we know if most bots and >>> automated tools use the API to make edits? If so, would it be feasibility >>> to add a flag to each edit as to whether it came through the API or not. >>> This won't stop determined users, but might be a nice way to identify >>> cyborg edits from those made manually by the same user for many of the >>> standard tools going forward. >>> >>> The closest thing I found in the bug tracker is [1], but it doesn't >>> address the issue of 'what is a bot' which this thread has clearly shown is >>> quite complex. An API-edit vs. non-API edit might be a way forward unless >>> there are automated tools/bots that don't use the API. >>> >>> >>> 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181 >>> >>> >>> Cheers, >>> Scott >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l