On Sat, Jul 12, 2014 at 12:07 AM, Jeremy Baron <jer...@tuxmachine.com> wrote:
> On Jul 11, 2014 9:45 AM, "Marc A. Pelletier" <m...@uberbox.org> wrote:
>> On 07/11/2014 09:34 AM, John Mark Vandenberg wrote:
>> > Could ops confirm they have the username of each logged in edit at
>> > their finger tips (i.e. roughly as easy to access as the user-agent)?
>> > Pywikibot doesnt permit logged out edits.
>>
>> We do, after the fact, from the same data Checkusers have access to.
>
> Not if they don't make an edit.
>
> There's lots of options for bots to cause trouble for ops. (including
> things that effect all wikis on the cluster, not just the specific one they
> were accessing)

And it is quite reasonable to assume bots could be causing problems
when 'reading', as that is a large component of what they do. In the
case of pywikibot, it only knows how to use the API, so I inaccurately
said 'edit' when I meant 'API request'; my apologies for that.

>> I'm not sure where that talk occured; I have not been made aware of it
>> and it didn't filter through the normal ops channels that I've seen.
>
> I believe that's referring to the pywikipedia list.

Correct.

>> I'm a little surprised by Antoine's suggestion that it is important that
>> the bot user's information is in the UA string - it doesn't seem useful
>> or necessary to me.  Bots shouldn't be editing while logged out in the
>> first place, so the bot account will normally always be plain to see.
>>
>> Obviously, having the user account in the UA would help a bit in
>> tracking down errant bots when they happen but that should be a rare
>> occurance and we have other methods to use in those cases.
>
> Varnish has access to the cookies, sure. But we log UA string and not
> cookies. Or maybe analytics is doing extra logging I didn't notice?

It would be good to know the answer to whether the username is logged
against API requests.  It seems like a very important piece of
information which should be visible in server ops logging of API
usage.

> If
> you're looking at request logs or varnishtop then UA string is a convenient
> way (and the standard way we've always suggested to not operators) to
> identify the bot.
> Imagine if you've identified a specific type of bad request in logs and
> they're all from one IP and one UA string. Varnish can easily send an error
> for a certain UA string+IP address before it hits the apaches if you need
> it to. But if that UA string is generic then you may end up blocking
> collateral damage instead of just the one broken bot.

I'd like to tease out what is the most useful data for pywikibot to
include in the UA.  Apparently pywikibot needs to add something to the
user-agent to be 'gold standard' for accessing Wikimedia projects, but
pywikibot also needs to avoid collecting and publishing personal
information for wiki that don't require it, or operators who refuse to
disclose.  Many Wikimedia bot operators run a few commands a week, and
it is not likely to be useful to contact them if their script is
mishaving - they are probably scratching their heads too.

And ideally pywikibot does this automatically, or makes it really easy
to set up and/or provides a benefit to the bot operator for having
provided additional information.

username is easy, if it is needed. checking email is enabled is also
easy, and is comparable to sysops being expected to enable email on
many wikis.

pywiki requiring bot operators provide an email address is technically
easy, but I suspect it isnt going to be very successful or
appreciated, esp for non-SSL wikis, or understood as pywiki hasnt put
this info in the user-agent since the new user-agent policy was
introduced, so why now?  It also has data privacy issues, as user
agents appear in logs.  Are the user-agents completely deleted?

If the main source of problems is the 'large' bots, they usually run
many tasks, and it is likely to only be a single task causing
problems.  With these large tasks, ideally they are paused rather than
blocked, in which case we need to introduce a standardised way to
pause a bot.  In these cases, the user agent could mention the task
identifier, and that identifier could be used to pause it until an
operator has checked their email.  The 'pause' command interface could
be IRC or user_talk, or something new based on Flow, or a API response
warning like replag which pywikibot honours.  I appreciate BinĂ¡ris'
point that some (most?) wikis, especially smaller wikis, do not have
'task approval' processes with a task identifier, so this would need
to be optional.  Large bot operators would use this feature if it
meant that only a single task is paused rather than the bot account
blocked.

For the normal usage of pywikibot, being invoking an existing script
which is maintained by pywikibot, we could include in the user-agent
which script is running (e.g. move.py).

Determining whether the bot code has been modified by the operator is
a bit of work, but I think that is more feasible than attempting to
convince bot operators to add their email address to the user-agent
when they havent modified the code.  They would quite rightly expect
pywikibot dev team to be responsible if the package code has bugs.
OTOH, it would be more reasonable to ask for a bot operators email
address if pywikibot detects its code has been modified.

> Coren, what you say above is a change from past statements:
> I can recall more than a few past conversations about this with mzmcbride,
> Tim Starling and others. (Usually comes up when someone comes asking for
> whatever app they're writing for a small number of operators. not a big
> framework like pwb)

What user agents do the other large editing frameworks use?

And are there examples of unmodified pywikibot causing headaches in
the ops team?  Maybe prior problems may help us work out what
information the ops teams 'needed' in the past when these problems
occurred.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to