Re: [Wikitech-l] User agent policy for bots

2014-07-18 Thread Amir Ladsgroup
I made a patch to add any customized user agent and (username would be
default) and if the person doesn't want to add any user agent, they just
can set it to  

https://gerrit.wikimedia.org/r/#/c/147381/

I would be happy for any comments regarding this patch
Best


On Mon, Jul 14, 2014 at 7:00 PM, Brad Jorsch (Anomie) bjor...@wikimedia.org
 wrote:

 Note this reply represents my own views, but does not represent an official
 WMF position.

 On Sun, Jul 13, 2014 at 4:25 PM, John Mark Vandenberg jay...@gmail.com
 wrote:

  It would be good to know the answer to whether the username is logged
  against API requests.  It seems like a very important piece of
  information which should be visible in server ops logging of API
  usage.
 

 The API request log does record usernames. And doesn't contain user agents,
 for that matter.

 But my guess is that at least some of the types of problems Ops would be
 concerned with are in different log files that probably do not contain
 usernames but do contain user agents.


  username is easy, if it is needed.


 I would include username. The only harm is a few extra bytes per request.


  pywiki requiring bot operators provide an email address is technically
  easy, but I suspect it isnt going to be very successful or
  appreciated, esp for non-SSL wikis, or understood as pywiki hasnt put
  this info in the user-agent since the new user-agent policy was
  introduced, so why now?
 

 I don't see any particular need for email addresses if the on-wiki username
 is provided. The key is some method of contact.


  If the main source of problems is the 'large' bots, they usually run
  many tasks, and it is likely to only be a single task causing
  problems.  With these large tasks, ideally they are paused rather than
  blocked, in which case we need to introduce a standardised way to
  pause a bot.  In these cases, the user agent could mention the task
  identifier, and that identifier could be used to pause it until an
  operator has checked their email.  The 'pause' command interface could
  be IRC or user_talk, or something new based on Flow, or a API response
  warning like replag which pywikibot honours.  I appreciate Bináris'
  point that some (most?) wikis, especially smaller wikis, do not have
  'task approval' processes with a task identifier, so this would need
  to be optional.  Large bot operators would use this feature if it
  meant that only a single task is paused rather than the bot account
  blocked.
 
  For the normal usage of pywikibot, being invoking an existing script
  which is maintained by pywikibot, we could include in the user-agent
  which script is running (e.g. move.py).
 

 Including the task name, which for pywikibot could be the script name,
 seems sensible to me. Besides the stated distinguishing which script in a
 multi-task bot is problematic, it would also help in determining that
 multiple accounts/IPs are running the same problematic script.

 I wouldn't go as far as requiring the task name to correspond to any
 particular on-wiki approval, although bots on wikis with such approval
 processes could well use the title of the approval page as their task name.

 What user agents do the other large editing frameworks use?
 

 I can tell you AnomieBOT uses AnomieBOT 1.0 ($TASKNAME; see
 [[User:$USERNAME]]). Not sure if you consider it a large editing
 framework.

 The task names the bot uses are generally listed on the bot's userpage;
 various one-off scripts I use locally will use some ad-hoc identifier, or
 no task if I forgot to have the script set a task name.

 (I should change that to start with AnomieBOT/1.0 to comply with RFC 2616,
 now that I think of it)

 --
 Brad Jorsch (Anomie)
 Software Engineer
 Wikimedia Foundation
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Amir
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-14 Thread Brad Jorsch (Anomie)
Note this reply represents my own views, but does not represent an official
WMF position.

On Sun, Jul 13, 2014 at 4:25 PM, John Mark Vandenberg jay...@gmail.com
wrote:

 It would be good to know the answer to whether the username is logged
 against API requests.  It seems like a very important piece of
 information which should be visible in server ops logging of API
 usage.


The API request log does record usernames. And doesn't contain user agents,
for that matter.

But my guess is that at least some of the types of problems Ops would be
concerned with are in different log files that probably do not contain
usernames but do contain user agents.


 username is easy, if it is needed.


I would include username. The only harm is a few extra bytes per request.


 pywiki requiring bot operators provide an email address is technically
 easy, but I suspect it isnt going to be very successful or
 appreciated, esp for non-SSL wikis, or understood as pywiki hasnt put
 this info in the user-agent since the new user-agent policy was
 introduced, so why now?


I don't see any particular need for email addresses if the on-wiki username
is provided. The key is some method of contact.


 If the main source of problems is the 'large' bots, they usually run
 many tasks, and it is likely to only be a single task causing
 problems.  With these large tasks, ideally they are paused rather than
 blocked, in which case we need to introduce a standardised way to
 pause a bot.  In these cases, the user agent could mention the task
 identifier, and that identifier could be used to pause it until an
 operator has checked their email.  The 'pause' command interface could
 be IRC or user_talk, or something new based on Flow, or a API response
 warning like replag which pywikibot honours.  I appreciate Bináris'
 point that some (most?) wikis, especially smaller wikis, do not have
 'task approval' processes with a task identifier, so this would need
 to be optional.  Large bot operators would use this feature if it
 meant that only a single task is paused rather than the bot account
 blocked.

 For the normal usage of pywikibot, being invoking an existing script
 which is maintained by pywikibot, we could include in the user-agent
 which script is running (e.g. move.py).


Including the task name, which for pywikibot could be the script name,
seems sensible to me. Besides the stated distinguishing which script in a
multi-task bot is problematic, it would also help in determining that
multiple accounts/IPs are running the same problematic script.

I wouldn't go as far as requiring the task name to correspond to any
particular on-wiki approval, although bots on wikis with such approval
processes could well use the title of the approval page as their task name.

What user agents do the other large editing frameworks use?


I can tell you AnomieBOT uses AnomieBOT 1.0 ($TASKNAME; see
[[User:$USERNAME]]). Not sure if you consider it a large editing framework.

The task names the bot uses are generally listed on the bot's userpage;
various one-off scripts I use locally will use some ad-hoc identifier, or
no task if I forgot to have the script set a task name.

(I should change that to start with AnomieBOT/1.0 to comply with RFC 2616,
now that I think of it)

-- 
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-13 Thread Bináris
John, thank you for writing tis letter, and I agree almost entirely, except:
Also, ideally bots should link to the bot task approval page with
every edit, either in the public edit summary or in the (invisible
except by ops and check-users) user-agent.

On one hand, approval rules differ from wki ti wiki even within WMF empire,
not to talk about other MW installations. There is no base to assume the
existence of such a page for each bot and each task. On the other hand,
edit summaries are too short and they are planned for human reading, and
repeating a page title or any permission is not sign but noise for a human
and prevents bot owner of filling in the summary with relevant information.
The change of UA for every task is very difficult and I doubt it would be
worth.



2014-07-11 15:34 GMT+02:00 John Mark Vandenberg jay...@gmail.com:

 On Fri, Jul 11, 2014 at 6:50 PM, Antoine Musso hashar+...@free.fr wrote:
  Le 11/07/2014 01:09, Amir Ladsgroup a écrit :
  Hello,
  As discussions in pywikipedia-l people are not sure whether is
 necessary to
  add username of bot operator in user agent or not.
 
  In user agent policy https://meta.wikimedia.org/wiki/User-agent_policy
 
  it's mentioned that people need to add contacting information, but it's
 not
  clear it's about contacting the tool-maker or tool-user.
 
  Can you clarify it?
 
  Hello,
 
  As K. Peachey said, the aim is for Wikimedia operators to be able to
  identify the user running the bot.  The bot framework might be useful.
 
  A suitable user agent could be:
 
   HasharBot (http://fr.wikipedia.org/wiki/User:Hashar; hashar at free fr)
 
 
  We most probably already have the username in our logs, doesn't harm to
  repeat it in the user-agent.  IRC nickname and email would be nice
  additions and probably save time.

 Could ops confirm they have the username of each logged in edit at
 their finger tips (i.e. roughly as easy to access as the user-agent)?
 Pywikibot doesnt permit logged out edits.

 There is some talk that if pywikibot doesnt fix its user-agent string,
 ops may need to block 'the toolserver' - could ops confirm that they
 would usually block a bot account before killing a well known IP range
 like 'the toolserver' (or 'the wmf labs')

 IMO it is pretty silly to put the username in the User-Agent for
 logged in users who are running adhoc tasks using unmodified pywikibot
 code, as they are the user, not its agent. In that scenario, a
 distinct version of pywikibot is the agent.  And an email address is
 even worse in this scenario.

 I do appreciate the need to uniquely identify different user agents,
 being any customised code.  Pywikibot already detects which (git)
 revision it is running, and includes that in the user agent.  It also
 checks versions of files, but I dont think it accurately detects I am
 a customised bot and definately doesnt include that in the user
 agent.  It should.

 Also, ideally bots should link to the bot task approval page with
 every edit, either in the public edit summary or in the (invisible
 except by ops and check-users) user-agent.

 Rather than asking bot operators to put an email address in the user
 agent, is it OK to have special:emailuser enabled on the bot and
 operator?  Or, have a master kill switch on the bot user (task) page?
 There is talk of an RFC for 'standardising' how the community can
 interact with pywikibot bots, such as disabling bot tasks or the bot
 account.
 https://gerrit.wikimedia.org/r/#/c/137980/
 Checking email is enabled, and ensuring the bot can be easily paused
 by 'the community' (inc. ops) strikes me as what is needed, rather
 than putting PII into the user *agent*.

 --
 John Vandenberg

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Bináris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-13 Thread John Mark Vandenberg
On Sat, Jul 12, 2014 at 12:07 AM, Jeremy Baron jer...@tuxmachine.com wrote:
 On Jul 11, 2014 9:45 AM, Marc A. Pelletier m...@uberbox.org wrote:
 On 07/11/2014 09:34 AM, John Mark Vandenberg wrote:
  Could ops confirm they have the username of each logged in edit at
  their finger tips (i.e. roughly as easy to access as the user-agent)?
  Pywikibot doesnt permit logged out edits.

 We do, after the fact, from the same data Checkusers have access to.

 Not if they don't make an edit.

 There's lots of options for bots to cause trouble for ops. (including
 things that effect all wikis on the cluster, not just the specific one they
 were accessing)

And it is quite reasonable to assume bots could be causing problems
when 'reading', as that is a large component of what they do. In the
case of pywikibot, it only knows how to use the API, so I inaccurately
said 'edit' when I meant 'API request'; my apologies for that.

 I'm not sure where that talk occured; I have not been made aware of it
 and it didn't filter through the normal ops channels that I've seen.

 I believe that's referring to the pywikipedia list.

Correct.

 I'm a little surprised by Antoine's suggestion that it is important that
 the bot user's information is in the UA string - it doesn't seem useful
 or necessary to me.  Bots shouldn't be editing while logged out in the
 first place, so the bot account will normally always be plain to see.

 Obviously, having the user account in the UA would help a bit in
 tracking down errant bots when they happen but that should be a rare
 occurance and we have other methods to use in those cases.

 Varnish has access to the cookies, sure. But we log UA string and not
 cookies. Or maybe analytics is doing extra logging I didn't notice?

It would be good to know the answer to whether the username is logged
against API requests.  It seems like a very important piece of
information which should be visible in server ops logging of API
usage.

 If
 you're looking at request logs or varnishtop then UA string is a convenient
 way (and the standard way we've always suggested to not operators) to
 identify the bot.
 Imagine if you've identified a specific type of bad request in logs and
 they're all from one IP and one UA string. Varnish can easily send an error
 for a certain UA string+IP address before it hits the apaches if you need
 it to. But if that UA string is generic then you may end up blocking
 collateral damage instead of just the one broken bot.

I'd like to tease out what is the most useful data for pywikibot to
include in the UA.  Apparently pywikibot needs to add something to the
user-agent to be 'gold standard' for accessing Wikimedia projects, but
pywikibot also needs to avoid collecting and publishing personal
information for wiki that don't require it, or operators who refuse to
disclose.  Many Wikimedia bot operators run a few commands a week, and
it is not likely to be useful to contact them if their script is
mishaving - they are probably scratching their heads too.

And ideally pywikibot does this automatically, or makes it really easy
to set up and/or provides a benefit to the bot operator for having
provided additional information.

username is easy, if it is needed. checking email is enabled is also
easy, and is comparable to sysops being expected to enable email on
many wikis.

pywiki requiring bot operators provide an email address is technically
easy, but I suspect it isnt going to be very successful or
appreciated, esp for non-SSL wikis, or understood as pywiki hasnt put
this info in the user-agent since the new user-agent policy was
introduced, so why now?  It also has data privacy issues, as user
agents appear in logs.  Are the user-agents completely deleted?

If the main source of problems is the 'large' bots, they usually run
many tasks, and it is likely to only be a single task causing
problems.  With these large tasks, ideally they are paused rather than
blocked, in which case we need to introduce a standardised way to
pause a bot.  In these cases, the user agent could mention the task
identifier, and that identifier could be used to pause it until an
operator has checked their email.  The 'pause' command interface could
be IRC or user_talk, or something new based on Flow, or a API response
warning like replag which pywikibot honours.  I appreciate Bináris'
point that some (most?) wikis, especially smaller wikis, do not have
'task approval' processes with a task identifier, so this would need
to be optional.  Large bot operators would use this feature if it
meant that only a single task is paused rather than the bot account
blocked.

For the normal usage of pywikibot, being invoking an existing script
which is maintained by pywikibot, we could include in the user-agent
which script is running (e.g. move.py).

Determining whether the bot code has been modified by the operator is
a bit of work, but I think that is more feasible than attempting to
convince bot operators to add their 

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread K. Peachey
It's designed so if there if there is a issue with the bot (eg: it's
malfunctioning etc) and causes issues the person whom is in control can
easily be identified.

As such, The user-agent you chose should reflect that.


On 11 July 2014 09:09, Amir Ladsgroup ladsgr...@gmail.com wrote:

 Hello,
 As discussions in pywikipedia-l people are not sure whether is necessary to
 add username of bot operator in user agent or not.

 In user agent policy https://meta.wikimedia.org/wiki/User-agent_policy
 it's mentioned that people need to add contacting information, but it's not
 clear it's about contacting the tool-maker or tool-user.

 Can you clarify it?
 Best
 --
 Amir
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread Antoine Musso
Le 11/07/2014 01:09, Amir Ladsgroup a écrit :
 Hello,
 As discussions in pywikipedia-l people are not sure whether is necessary to
 add username of bot operator in user agent or not.
 
 In user agent policy https://meta.wikimedia.org/wiki/User-agent_policy
 it's mentioned that people need to add contacting information, but it's not
 clear it's about contacting the tool-maker or tool-user.
 
 Can you clarify it?

Hello,

As K. Peachey said, the aim is for Wikimedia operators to be able to
identify the user running the bot.  The bot framework might be useful.

A suitable user agent could be:

 HasharBot (http://fr.wikipedia.org/wiki/User:Hashar; hashar at free fr)


We most probably already have the username in our logs, doesn't harm to
repeat it in the user-agent.  IRC nickname and email would be nice
additions and probably save time.

cheers,

-- 
Antoine hashar Musso


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread John Mark Vandenberg
On Fri, Jul 11, 2014 at 6:50 PM, Antoine Musso hashar+...@free.fr wrote:
 Le 11/07/2014 01:09, Amir Ladsgroup a écrit :
 Hello,
 As discussions in pywikipedia-l people are not sure whether is necessary to
 add username of bot operator in user agent or not.

 In user agent policy https://meta.wikimedia.org/wiki/User-agent_policy
 it's mentioned that people need to add contacting information, but it's not
 clear it's about contacting the tool-maker or tool-user.

 Can you clarify it?

 Hello,

 As K. Peachey said, the aim is for Wikimedia operators to be able to
 identify the user running the bot.  The bot framework might be useful.

 A suitable user agent could be:

  HasharBot (http://fr.wikipedia.org/wiki/User:Hashar; hashar at free fr)


 We most probably already have the username in our logs, doesn't harm to
 repeat it in the user-agent.  IRC nickname and email would be nice
 additions and probably save time.

Could ops confirm they have the username of each logged in edit at
their finger tips (i.e. roughly as easy to access as the user-agent)?
Pywikibot doesnt permit logged out edits.

There is some talk that if pywikibot doesnt fix its user-agent string,
ops may need to block 'the toolserver' - could ops confirm that they
would usually block a bot account before killing a well known IP range
like 'the toolserver' (or 'the wmf labs')

IMO it is pretty silly to put the username in the User-Agent for
logged in users who are running adhoc tasks using unmodified pywikibot
code, as they are the user, not its agent. In that scenario, a
distinct version of pywikibot is the agent.  And an email address is
even worse in this scenario.

I do appreciate the need to uniquely identify different user agents,
being any customised code.  Pywikibot already detects which (git)
revision it is running, and includes that in the user agent.  It also
checks versions of files, but I dont think it accurately detects I am
a customised bot and definately doesnt include that in the user
agent.  It should.

Also, ideally bots should link to the bot task approval page with
every edit, either in the public edit summary or in the (invisible
except by ops and check-users) user-agent.

Rather than asking bot operators to put an email address in the user
agent, is it OK to have special:emailuser enabled on the bot and
operator?  Or, have a master kill switch on the bot user (task) page?
There is talk of an RFC for 'standardising' how the community can
interact with pywikibot bots, such as disabling bot tasks or the bot
account.
https://gerrit.wikimedia.org/r/#/c/137980/
Checking email is enabled, and ensuring the bot can be easily paused
by 'the community' (inc. ops) strikes me as what is needed, rather
than putting PII into the user *agent*.

-- 
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread Marc A. Pelletier
On 07/11/2014 09:34 AM, John Mark Vandenberg wrote:
 Could ops confirm they have the username of each logged in edit at
 their finger tips (i.e. roughly as easy to access as the user-agent)?
 Pywikibot doesnt permit logged out edits.

We do, after the fact, from the same data Checkusers have access to.

 There is some talk that if pywikibot doesnt fix its user-agent string,
 ops may need to block 'the toolserver' - could ops confirm that they
 would usually block a bot account before killing a well known IP range
 like 'the toolserver' (or 'the wmf labs')

That's certainly what *I* would do, and the same applies at least to the
English Wikipedia (where the blocking page clearly points out
sensitive ranges which should not be blocked except in cases of dire
emergencies).

I'm not sure where that talk occured; I have not been made aware of it
and it didn't filter through the normal ops channels that I've seen.
I'm a little surprised by Antoine's suggestion that it is important that
the bot user's information is in the UA string - it doesn't seem useful
or necessary to me.  Bots shouldn't be editing while logged out in the
first place, so the bot account will normally always be plain to see.

Obviously, having the user account in the UA would help a bit in
tracking down errant bots when they happen but that should be a rare
occurance and we have other methods to use in those cases.

-- Marc


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread Jeremy Baron
On Jul 11, 2014 9:45 AM, Marc A. Pelletier m...@uberbox.org wrote:
 On 07/11/2014 09:34 AM, John Mark Vandenberg wrote:
  Could ops confirm they have the username of each logged in edit at
  their finger tips (i.e. roughly as easy to access as the user-agent)?
  Pywikibot doesnt permit logged out edits.

 We do, after the fact, from the same data Checkusers have access to.

Not if they don't make an edit.

There's lots of options for bots to cause trouble for ops. (including
things that effect all wikis on the cluster, not just the specific one they
were accessing)

 I'm not sure where that talk occured; I have not been made aware of it
 and it didn't filter through the normal ops channels that I've seen.

I believe that's referring to the pywikipedia list.

 I'm a little surprised by Antoine's suggestion that it is important that
 the bot user's information is in the UA string - it doesn't seem useful
 or necessary to me.  Bots shouldn't be editing while logged out in the
 first place, so the bot account will normally always be plain to see.

 Obviously, having the user account in the UA would help a bit in
 tracking down errant bots when they happen but that should be a rare
 occurance and we have other methods to use in those cases.

Varnish has access to the cookies, sure. But we log UA string and not
cookies. Or maybe analytics is doing extra logging I didn't notice? If
you're looking at request logs or varnishtop then UA string is a convenient
way (and the standard way we've always suggested to not operators) to
identify the bot.

Imagine if you've identified a specific type of bad request in logs and
they're all from one IP and one UA string. Varnish can easily send an error
for a certain UA string+IP address before it hits the apaches if you need
it to. But if that UA string is generic then you may end up blocking
collateral damage instead of just the one broken bot.

Coren, what you say above is a change from past statements:
I can recall more than a few past conversations about this with mzmcbride,
Tim Starling and others. (Usually comes up when someone comes asking for
whatever app they're writing for a small number of operators. not a big
framework like pwb)

-Jeremy
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] User agent policy for bots

2014-07-11 Thread Marc A. Pelletier
On 07/11/2014 10:07 AM, Jeremy Baron wrote:
 I can recall more than a few past conversations about this with mzmcbride,
 Tim Starling and others.

Keep in mind I've only been around for ~18 months, so I am going to be
unaware of some previous discussion on the subject.

But yeah, /clearly/ the more specific an UA is the more precise any
intervention can be.  I'm not sure how reasonable it is to extrapolate
from that to must have email/irc nick/etc.  You want your UA to be as
unique as possible, certainly, and the more info you give the more
likely it is that we are able to talk to you before we take drastic
measures.

-- Marc


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] User agent policy for bots

2014-07-10 Thread Amir Ladsgroup
Hello,
As discussions in pywikipedia-l people are not sure whether is necessary to
add username of bot operator in user agent or not.

In user agent policy https://meta.wikimedia.org/wiki/User-agent_policy
it's mentioned that people need to add contacting information, but it's not
clear it's about contacting the tool-maker or tool-user.

Can you clarify it?
Best
-- 
Amir
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l