Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Brion Vibber
I'm not 100% convinced that the UA requirement is helpful, for two reasons:

1) Lots of requests will have default like "PHP" or "Python/urllib" or
whatever from the tool they used to build their bot. These aren't helpful
either as they contain no of how to get in touch.

2) It's trivial to work around the requirement for a non-blank UA by
setting one of the above, or worse -- cut-n-pasting the UA string from a
browser. If someone hacks this up real quick while testing, they may never
bother putting in contact information when their bot moves from a handful
of requests to gazillions.

Auto-throttling super-high-rate API clients (by IP/IP group) and giving
them an explicit "You really should contact us and, better yet, make it
possible for us to contact you" message might be nice.


We may want to seriously think about some sort of API key system... not
necessarily as mandatory for access (we love freedom and convenience!) but
perhaps as the way you get around being throttled for too many accesses.
This would give us a structured way of storing their contact information,
which might be better than unstructured names or addresses in the UA.

Does it make sense to tell people "log in to your bot's account with OAuth"
or is that too much of a pain in the ass versus "add this one parameter to
your requests with your key"? :)

-- brion


On Tue, Sep 1, 2015 at 10:23 AM, Oliver Keyes  wrote:

> Awesome; thanks for the analysis, Krinkle.
>
> Do we want to change this behaviour? From my point of view the answer
> is 'yes, not setting any kind of user agent is a violation of our API
> etiquette and we should be taking steps to alert people that it is'
> but if other people have different perspectives on this I'd love to
> hear them.
>
> On 1 September 2015 at 13:18, Krinkle  wrote:
> > I've confirmed just now that whatever requirement there was, it doesn't
> seem to be in effect.
> >
> > Both omitting the header entirely, sending it with empty string, and
> sending
> > with "-"; – all three result in a response from the MediaWiki API.
> >
> > $ curl -A '' --include -v '
> https://en.wikipedia.org/w/api.php?action=query=json' <
> https://en.wikipedia.org/w/api.php?action=query=json'>
> >> GET /w/api.php?action=query=json HTTP/1.1
> >> Host: en.wikipedia.org
> >> Accept: */*
> > < HTTP/1.1 200 OK
> > ..
> > {"batchcomplete":""}
> >
> >
> > $ curl -A '-' --include -v '
> https://en.wikipedia.org/w/api.php?action=query=json' <
> https://en.wikipedia.org/w/api.php?action=query=json'>
> >> GET /w/api.php?action=query=json HTTP/1.1
> >> User-Agent: -
> >> Host: en.wikipedia.org 
> >> Accept: */*
> > < HTTP/1.1 200 OK
> > ..
> > {"batchcomplete":""}
> >
> > In the past (2012?) these were definitely being blocked. (Ran into it
> from time to time on Toolserver)
> > It seems php file_get_contents('http://...api..' ) is
> also working fine now,
> > without having to init_set a user_agent value first.
> >
> > -- Krinkle
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Oliver Keyes
Specifically, the hypothesis that people are sending "-"?

On 1 September 2015 at 12:58, Tomasz Finc  wrote:
> Let's get a task in phab for this so that we can triage next steps.
> I'm curious about this as well.
>
> --tomasz
>
> On Tue, Sep 1, 2015 at 9:46 AM, Oliver Keyes  wrote:
>> On 1 September 2015 at 12:42, John  wrote:
>>> Could they be sending a non-standard header of "-"
>>
>> Perfectly possible although also impossible to detect :(
>>
>>>
>>> On Tuesday, September 1, 2015, Chad  wrote:
>>>
 On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes > wrote:

 > Is the
 > blocking of requests absent a user agent simply happening at a
 > 'higher' stage (in mediawiki itself?) and so not registering with the
 > varnishes,


 No, it's not done at the application level.


 > or is sending an /empty/ header simply A-OK?
 >
 >
 Shouldn't be, unless the policy changed...

 -Chad
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org 
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>> ___
>>> Wikitech-l mailing list
>>> Wikitech-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Tomasz Finc
Tracking the overall issue

On Tue, Sep 1, 2015 at 9:59 AM, Oliver Keyes  wrote:
> Specifically, the hypothesis that people are sending "-"?
>
> On 1 September 2015 at 12:58, Tomasz Finc  wrote:
>> Let's get a task in phab for this so that we can triage next steps.
>> I'm curious about this as well.
>>
>> --tomasz
>>
>> On Tue, Sep 1, 2015 at 9:46 AM, Oliver Keyes  wrote:
>>> On 1 September 2015 at 12:42, John  wrote:
 Could they be sending a non-standard header of "-"
>>>
>>> Perfectly possible although also impossible to detect :(
>>>

 On Tuesday, September 1, 2015, Chad  wrote:

> On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes  > wrote:
>
> > Is the
> > blocking of requests absent a user agent simply happening at a
> > 'higher' stage (in mediawiki itself?) and so not registering with the
> > varnishes,
>
>
> No, it's not done at the application level.
>
>
> > or is sending an /empty/ header simply A-OK?
> >
> >
> Shouldn't be, unless the policy changed...
>
> -Chad
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Count Logula
>>> Wikimedia Foundation
>>>
>>> ___
>>> Wikitech-l mailing list
>>> Wikitech-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Tomasz Finc
Let's get a task in phab for this so that we can triage next steps.
I'm curious about this as well.

--tomasz

On Tue, Sep 1, 2015 at 9:46 AM, Oliver Keyes  wrote:
> On 1 September 2015 at 12:42, John  wrote:
>> Could they be sending a non-standard header of "-"
>
> Perfectly possible although also impossible to detect :(
>
>>
>> On Tuesday, September 1, 2015, Chad  wrote:
>>
>>> On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes >> > wrote:
>>>
>>> > Is the
>>> > blocking of requests absent a user agent simply happening at a
>>> > 'higher' stage (in mediawiki itself?) and so not registering with the
>>> > varnishes,
>>>
>>>
>>> No, it's not done at the application level.
>>>
>>>
>>> > or is sending an /empty/ header simply A-OK?
>>> >
>>> >
>>> Shouldn't be, unless the policy changed...
>>>
>>> -Chad
>>> ___
>>> Wikitech-l mailing list
>>> Wikitech-l@lists.wikimedia.org 
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Trey Jones
I agree with rate-limiting those without some sort of ID (login or API key).

As Oliver said, big (ab)users can massively skew our stats, often by
themselves. But hordes of upper middle volume bots (way too high for a
human, nowhere near the max for a superstar bot) can have a large
cumulative effect, too. We can't track them down individually, or even
detect that they are there because they are "only" involved in a fraction
of a percent of traffic—but a hundred such bots add up to a significant
skew, and reasonable rate limits could knock them down to manageable levels.

While enforcing UA requirements is inherently reasonable, anyone who
doesn't know to set up a valid UA string may not know to not just copy one
from a browser to make things worse. (I've done that myself in the past
when using curl with an uncooperative site. The shame.) Maybe rate limiting
will be the 80 in the 80/20 solution, and enforcing UA reqs won't be
necessary to control traffic, leaving them as a silly but effective way of
identifying certain kinds of traffic. The flip-side case would be
bajillions of very low volume bots—mimicking roughly human levels of
traffic and so sailing under rate limits—all with blank UAs. But we could
note that after rate limiting slows down the ridiculously heavy hitters and
take action as needed.


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Sep 1, 2015 at 1:44 PM, Oliver Keyes  wrote:

> If people aren't capable of following UA guidelines I doubt they're
> going to follow voluntary login.
>
> For what it's worth I absolutely support both rate-limiting and login
> to get around this. In fact, I would argue that from an analytics
> point of view rate limiting is probably the most high-profile problem
> we have with incoming data at the moment. It's far, far too common for
> random pieces of automata to set themselves up and massively skew our
> datasets; identifying this in advance is impossible (we don't always
> have IP data) and identifying them post-hoc on an individual basis is
> massively time consuming.
>
> Why don't we have rate limiting + login? Who would work on this? Why
> /should/ we not have rate limiting?
>
> On 1 September 2015 at 13:37, Brion Vibber  wrote:
> > I'm not 100% convinced that the UA requirement is helpful, for two
> reasons:
> >
> > 1) Lots of requests will have default like "PHP" or "Python/urllib" or
> > whatever from the tool they used to build their bot. These aren't helpful
> > either as they contain no of how to get in touch.
> >
> > 2) It's trivial to work around the requirement for a non-blank UA by
> > setting one of the above, or worse -- cut-n-pasting the UA string from a
> > browser. If someone hacks this up real quick while testing, they may
> never
> > bother putting in contact information when their bot moves from a handful
> > of requests to gazillions.
> >
> > Auto-throttling super-high-rate API clients (by IP/IP group) and giving
> > them an explicit "You really should contact us and, better yet, make it
> > possible for us to contact you" message might be nice.
> >
> >
> > We may want to seriously think about some sort of API key system... not
> > necessarily as mandatory for access (we love freedom and convenience!)
> but
> > perhaps as the way you get around being throttled for too many accesses.
> > This would give us a structured way of storing their contact information,
> > which might be better than unstructured names or addresses in the UA.
> >
> > Does it make sense to tell people "log in to your bot's account with
> OAuth"
> > or is that too much of a pain in the ass versus "add this one parameter
> to
> > your requests with your key"? :)
> >
> > -- brion
> >
> >
> > On Tue, Sep 1, 2015 at 10:23 AM, Oliver Keyes 
> wrote:
> >
> >> Awesome; thanks for the analysis, Krinkle.
> >>
> >> Do we want to change this behaviour? From my point of view the answer
> >> is 'yes, not setting any kind of user agent is a violation of our API
> >> etiquette and we should be taking steps to alert people that it is'
> >> but if other people have different perspectives on this I'd love to
> >> hear them.
> >>
> >> On 1 September 2015 at 13:18, Krinkle  wrote:
> >> > I've confirmed just now that whatever requirement there was, it
> doesn't
> >> seem to be in effect.
> >> >
> >> > Both omitting the header entirely, sending it with empty string, and
> >> sending
> >> > with "-"; – all three result in a response from the MediaWiki API.
> >> >
> >> > $ curl -A '' --include -v '
> >> https://en.wikipedia.org/w/api.php?action=query=json' <
> >> https://en.wikipedia.org/w/api.php?action=query=json'>
> >> >> GET /w/api.php?action=query=json HTTP/1.1
> >> >> Host: en.wikipedia.org
> >> >> Accept: */*
> >> > < HTTP/1.1 200 OK
> >> > ..
> >> > {"batchcomplete":""}
> >> >
> >> >
> >> > $ curl -A '-' --include -v '
> >> 

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Oliver Keyes
Awesome; thanks for the analysis, Krinkle.

Do we want to change this behaviour? From my point of view the answer
is 'yes, not setting any kind of user agent is a violation of our API
etiquette and we should be taking steps to alert people that it is'
but if other people have different perspectives on this I'd love to
hear them.

On 1 September 2015 at 13:18, Krinkle  wrote:
> I've confirmed just now that whatever requirement there was, it doesn't seem 
> to be in effect.
>
> Both omitting the header entirely, sending it with empty string, and sending
> with "-"; – all three result in a response from the MediaWiki API.
>
> $ curl -A '' --include -v 
> 'https://en.wikipedia.org/w/api.php?action=query=json' 
> 
>> GET /w/api.php?action=query=json HTTP/1.1
>> Host: en.wikipedia.org
>> Accept: */*
> < HTTP/1.1 200 OK
> ..
> {"batchcomplete":""}
>
>
> $ curl -A '-' --include -v 
> 'https://en.wikipedia.org/w/api.php?action=query=json' 
> 
>> GET /w/api.php?action=query=json HTTP/1.1
>> User-Agent: -
>> Host: en.wikipedia.org 
>> Accept: */*
> < HTTP/1.1 200 OK
> ..
> {"batchcomplete":""}
>
> In the past (2012?) these were definitely being blocked. (Ran into it from 
> time to time on Toolserver)
> It seems php file_get_contents('http://...api..' ) is also 
> working fine now,
> without having to init_set a user_agent value first.
>
> -- Krinkle
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Oliver Keyes
If people aren't capable of following UA guidelines I doubt they're
going to follow voluntary login.

For what it's worth I absolutely support both rate-limiting and login
to get around this. In fact, I would argue that from an analytics
point of view rate limiting is probably the most high-profile problem
we have with incoming data at the moment. It's far, far too common for
random pieces of automata to set themselves up and massively skew our
datasets; identifying this in advance is impossible (we don't always
have IP data) and identifying them post-hoc on an individual basis is
massively time consuming.

Why don't we have rate limiting + login? Who would work on this? Why
/should/ we not have rate limiting?

On 1 September 2015 at 13:37, Brion Vibber  wrote:
> I'm not 100% convinced that the UA requirement is helpful, for two reasons:
>
> 1) Lots of requests will have default like "PHP" or "Python/urllib" or
> whatever from the tool they used to build their bot. These aren't helpful
> either as they contain no of how to get in touch.
>
> 2) It's trivial to work around the requirement for a non-blank UA by
> setting one of the above, or worse -- cut-n-pasting the UA string from a
> browser. If someone hacks this up real quick while testing, they may never
> bother putting in contact information when their bot moves from a handful
> of requests to gazillions.
>
> Auto-throttling super-high-rate API clients (by IP/IP group) and giving
> them an explicit "You really should contact us and, better yet, make it
> possible for us to contact you" message might be nice.
>
>
> We may want to seriously think about some sort of API key system... not
> necessarily as mandatory for access (we love freedom and convenience!) but
> perhaps as the way you get around being throttled for too many accesses.
> This would give us a structured way of storing their contact information,
> which might be better than unstructured names or addresses in the UA.
>
> Does it make sense to tell people "log in to your bot's account with OAuth"
> or is that too much of a pain in the ass versus "add this one parameter to
> your requests with your key"? :)
>
> -- brion
>
>
> On Tue, Sep 1, 2015 at 10:23 AM, Oliver Keyes  wrote:
>
>> Awesome; thanks for the analysis, Krinkle.
>>
>> Do we want to change this behaviour? From my point of view the answer
>> is 'yes, not setting any kind of user agent is a violation of our API
>> etiquette and we should be taking steps to alert people that it is'
>> but if other people have different perspectives on this I'd love to
>> hear them.
>>
>> On 1 September 2015 at 13:18, Krinkle  wrote:
>> > I've confirmed just now that whatever requirement there was, it doesn't
>> seem to be in effect.
>> >
>> > Both omitting the header entirely, sending it with empty string, and
>> sending
>> > with "-"; – all three result in a response from the MediaWiki API.
>> >
>> > $ curl -A '' --include -v '
>> https://en.wikipedia.org/w/api.php?action=query=json' <
>> https://en.wikipedia.org/w/api.php?action=query=json'>
>> >> GET /w/api.php?action=query=json HTTP/1.1
>> >> Host: en.wikipedia.org
>> >> Accept: */*
>> > < HTTP/1.1 200 OK
>> > ..
>> > {"batchcomplete":""}
>> >
>> >
>> > $ curl -A '-' --include -v '
>> https://en.wikipedia.org/w/api.php?action=query=json' <
>> https://en.wikipedia.org/w/api.php?action=query=json'>
>> >> GET /w/api.php?action=query=json HTTP/1.1
>> >> User-Agent: -
>> >> Host: en.wikipedia.org 
>> >> Accept: */*
>> > < HTTP/1.1 200 OK
>> > ..
>> > {"batchcomplete":""}
>> >
>> > In the past (2012?) these were definitely being blocked. (Ran into it
>> from time to time on Toolserver)
>> > It seems php file_get_contents('http://...api..' ) is
>> also working fine now,
>> > without having to init_set a user_agent value first.
>> >
>> > -- Krinkle
>> > ___
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>> --
>> Oliver Keyes
>> Count Logula
>> Wikimedia Foundation
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Brad Jorsch (Anomie)
On Tue, Sep 1, 2015 at 1:18 PM, Krinkle  wrote:

> In the past (2012?) these were definitely being blocked. (Ran into it from
> time to time on Toolserver)
> It seems php file_get_contents('http://...api..' ) is
> also working fine now,
> without having to init_set a user_agent value first.
>

I wonder if it got lost in the move from Squid to Varnish, or something
along those lines.


-- 
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Krinkle
I've confirmed just now that whatever requirement there was, it doesn't seem to 
be in effect.

Both omitting the header entirely, sending it with empty string, and sending
with "-"; – all three result in a response from the MediaWiki API.

$ curl -A '' --include -v 
'https://en.wikipedia.org/w/api.php?action=query=json' 

> GET /w/api.php?action=query=json HTTP/1.1
> Host: en.wikipedia.org
> Accept: */*
< HTTP/1.1 200 OK
..
{"batchcomplete":""}


$ curl -A '-' --include -v 
'https://en.wikipedia.org/w/api.php?action=query=json' 

> GET /w/api.php?action=query=json HTTP/1.1
> User-Agent: -
> Host: en.wikipedia.org 
> Accept: */*
< HTTP/1.1 200 OK
..
{"batchcomplete":""}

In the past (2012?) these were definitely being blocked. (Ran into it from time 
to time on Toolserver)
It seems php file_get_contents('http://...api..' ) is also 
working fine now,
without having to init_set a user_agent value first.

-- Krinkle
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Brandon Black
On Tue, Sep 1, 2015 at 10:42 PM, Platonides  wrote:
> Brad Jorsch (Anomie) wrote:
>> I wonder if it got lost in the move from Squid to Varnish, or something
>> along those lines.
> That's likely, given that it was enforced by squid.

We could easily add it back in Varnish, too, but I tend to agree with
Brion's points that it's not ultimately helpful.

I really do like the idea of moving towards smarter ratelimiting of
APIs by default, though (and have brought this up in several contexts
recently, but I'm not really aware of whatever past work we've done in
that direction).  From that relatively-ignorant perspective, I tend to
envision an architecture where the front edge ratelimits API requests
(or even possibly, all requests, but we'd probably have to exclude a
lot of common spiders...) via a simple token-bucket-filter if they're
anonymous, but lets them run free if they superficially appear to have
a legitimate cookie or API access token.  Then it's up to the app
layer to enforce limits for the seemingly-identifiable traffic and be
configurable to raise them for legitimate remote clients we've had
contact with, and to reject legitimate-looking tokens/logins that the
edge choses not to ratelimit which aren't actually legitimate.

-- Brandon

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Gergo Tisza
On Tue, Sep 1, 2015 at 4:54 PM, Brandon Black  wrote:

> I really do like the idea of moving towards smarter ratelimiting of
> APIs by default, though (and have brought this up in several contexts
> recently, but I'm not really aware of whatever past work we've done in
> that direction).  From that relatively-ignorant perspective, I tend to
> envision an architecture where the front edge ratelimits API requests
> (or even possibly, all requests, but we'd probably have to exclude a
> lot of common spiders...) via a simple token-bucket-filter if they're
> anonymous, but lets them run free if they superficially appear to have
> a legitimate cookie or API access token.  Then it's up to the app
> layer to enforce limits for the seemingly-identifiable traffic and be
> configurable to raise them for legitimate remote clients we've had
> contact with, and to reject legitimate-looking tokens/logins that the
> edge choses not to ratelimit which aren't actually legitimate.
>

Rate limiting / UA policy enforcement has to be done in Varnish, since API
responses can be cached there and so the requests don't necessarily reach
higher layers (and we wouldn't want to vary on user agent).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Gabriel Wicke
On Tue, Sep 1, 2015 at 5:54 PM, Gergo Tisza  wrote:
>
>
> Rate limiting / UA policy enforcement has to be done in Varnish, since API
> responses can be cached there and so the requests don't necessarily reach
> higher layers (and we wouldn't want to vary on user agent).



The cost / benefit trade-offs for Varnish cache hits are fairly different
from those of cache misses. Especially for in-memory (frontend) hits it
might overall be cheaper to send a regular response, rather than adding
rate limit overheads to each cache hit.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Gabriel Wicke
We recently revisited rate limiting in
https://phabricator.wikimedia.org/T107934, but came to similar conclusions
as reached in this thread:


   - Limits for weak identifiers like IPs or user agents would (at least
   initially) need to be high enough to render the limiting borderline useless
   against DDOS attacks.
   - Stronger authentication requirements have significant costs to users,
   and will require non-trivial backend work to keep things efficient on our
   end. I believe we should tackle this backend work in any case, but it will
   take some time.
   - In our benchmarks, most off-the-shelf rate limiting libraries use
   per-request network requests to a central service like Redis, which costs
   latency and throughput, and has some scaling challenges. There are
   algorithms [1] that trade some precision for performance, but we aren't
   aware of any open source implementations we could use.

The dual of rate limiting is making each API request cheaper. We have
recently made some progress towards limiting the cost of individual API
requests, and are working towards making most API end points cacheable &
backed by storage.

Gabriel

[1]:
http://yahooeng.tumblr.com/post/111288877956/cloud-bouncer-distributed-rate-limiting-at-yahoo


On Tue, Sep 1, 2015 at 4:54 PM, Brandon Black  wrote:

> On Tue, Sep 1, 2015 at 10:42 PM, Platonides  wrote:
> > Brad Jorsch (Anomie) wrote:
> >> I wonder if it got lost in the move from Squid to Varnish, or something
> >> along those lines.
> > That's likely, given that it was enforced by squid.
>
> We could easily add it back in Varnish, too, but I tend to agree with
> Brion's points that it's not ultimately helpful.
>
> I really do like the idea of moving towards smarter ratelimiting of
> APIs by default, though (and have brought this up in several contexts
> recently, but I'm not really aware of whatever past work we've done in
> that direction).  From that relatively-ignorant perspective, I tend to
> envision an architecture where the front edge ratelimits API requests
> (or even possibly, all requests, but we'd probably have to exclude a
> lot of common spiders...) via a simple token-bucket-filter if they're
> anonymous, but lets them run free if they superficially appear to have
> a legitimate cookie or API access token.  Then it's up to the app
> layer to enforce limits for the seemingly-identifiable traffic and be
> configurable to raise them for legitimate remote clients we've had
> contact with, and to reject legitimate-looking tokens/logins that the
> edge choses not to ratelimit which aren't actually legitimate.
>
> -- Brandon
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Brandon Black
On Wed, Sep 2, 2015 at 1:21 AM, Gabriel Wicke  wrote:
> On Tue, Sep 1, 2015 at 5:54 PM, Gergo Tisza  wrote:
>>
>>
>> Rate limiting / UA policy enforcement has to be done in Varnish, since API
>> responses can be cached there and so the requests don't necessarily reach
>> higher layers (and we wouldn't want to vary on user agent).
>
>
>
> The cost / benefit trade-offs for Varnish cache hits are fairly different
> from those of cache misses. Especially for in-memory (frontend) hits it
> might overall be cheaper to send a regular response, rather than adding
> rate limit overheads to each cache hit.

Yeah I was mostly thinking of uncacheable API accesses.  If we can
cache it, we don't mind (as much) in terms of load/abuse.  By having
the simpler outer check in varnish, though, it takes the big load from
anonymous spikes away from being handled at the applayer for those
uncacheable hits.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Christian Aistleitner
On Tue, Sep 01, 2015 at 12:42:35PM -0400, John wrote:
> Could they be sending a non-standard header of "-"

They could.

But if a request comes in without a User-Agent header, the logging
pipeline silently translates it into "-".

Have fun,
Christian



P.S.: The relevant configuration (for webrequests) is at

https://github.com/wikimedia/operations-puppet/blob/production/modules/role/manifests/cache/kafka/webrequest.pp#L26

That long line contains '%{User-Agent@user_agent}i', which means

  log the request's User-Agent header

but no default value is provided. As no default value is provided,
varnishkafka uses the pre-set default value, which is "-":

https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L246

This conversion from the empty string to "-" does not kill relevant
information and is useful for some researchers when manually
inspecting TSVs, or manually browsing Hive output.



-- 
 quelltextlich e.U.  \\  Christian Aistleitner 
   Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email:  christ...@quelltextlich.at
4293 Gutau, Austria  Phone:  +43 7946 / 20 5 81
 Fax:+43 7946 / 20 5 81
 Homepage: http://quelltextlich.at/
---


signature.asc
Description: Digital signature
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Platonides

Brad Jorsch (Anomie) wrote:

I wonder if it got lost in the move from Squid to Varnish, or something
along those lines.


That's likely, given that it was enforced by squid.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Legoktm
On 09/01/2015 10:37 AM, Brion Vibber wrote:
> I'm not 100% convinced that the UA requirement is helpful, for two reasons:

For those of us who looked for the initial rationale on the UA
requirement, the announcement and resulting discussion is at [1].

[1] http://www.gossamer-threads.com/lists/wiki/wikitech/189275

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Chad
On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes  wrote:

> Is the
> blocking of requests absent a user agent simply happening at a
> 'higher' stage (in mediawiki itself?) and so not registering with the
> varnishes,


No, it's not done at the application level.


> or is sending an /empty/ header simply A-OK?
>
>
Shouldn't be, unless the policy changed...

-Chad
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Oliver Keyes
On 1 September 2015 at 12:41, Chad  wrote:
> On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes  wrote:
>
>> Is the
>> blocking of requests absent a user agent simply happening at a
>> 'higher' stage (in mediawiki itself?) and so not registering with the
>> varnishes,
>
>
> No, it's not done at the application level.
>
>
>> or is sending an /empty/ header simply A-OK?
>>
>>
> Shouldn't be, unless the policy changed...

Well I'm looking at millions of requests from
API-users-who-I-am-not-big-fans-of[0] with a blank UA sooo..

[0] actual term far more expletive-laden than this

>
> -Chad
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread Oliver Keyes
On 1 September 2015 at 12:42, John  wrote:
> Could they be sending a non-standard header of "-"

Perfectly possible although also impossible to detect :(

>
> On Tuesday, September 1, 2015, Chad  wrote:
>
>> On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes > > wrote:
>>
>> > Is the
>> > blocking of requests absent a user agent simply happening at a
>> > 'higher' stage (in mediawiki itself?) and so not registering with the
>> > varnishes,
>>
>>
>> No, it's not done at the application level.
>>
>>
>> > or is sending an /empty/ header simply A-OK?
>> >
>> >
>> Shouldn't be, unless the policy changed...
>>
>> -Chad
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org 
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] What happened to our user agent requirements?

2015-09-01 Thread John
Could they be sending a non-standard header of "-"

On Tuesday, September 1, 2015, Chad  wrote:

> On Tue, Sep 1, 2015 at 9:24 AM Oliver Keyes  > wrote:
>
> > Is the
> > blocking of requests absent a user agent simply happening at a
> > 'higher' stage (in mediawiki itself?) and so not registering with the
> > varnishes,
>
>
> No, it's not done at the application level.
>
>
> > or is sending an /empty/ header simply A-OK?
> >
> >
> Shouldn't be, unless the policy changed...
>
> -Chad
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l