[Wikitech-l] 403 with content to Python?

2009-01-23 Thread Andre Engels
Through a message on another list, I found that when one tries to
reach wikipedia (or at least wikipedia-en) specifying the User Agent
as "Python-urllib/1.17", the server gives a "403 Forbidden" response,
together with the content of the page.

Two questions:
1. Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot,
when Brion wanted to block it because it had a programming error
causing it to fetch each page twice (sometimes even more?). If that is
the actual reason, I see no reason why it should still be active years
afterward...
2. If this User Agent is really to be blocked, why do we still provide
the content of the page that is forbidden?

-- 
André Engels, andreeng...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] 403 with content to Python?

2009-01-23 Thread Daniel Kinzler
Andre Engels schrieb:
> 1. Why is this User Agent getting this response? If I remember
> correctly, this was installed in the early days of the pywikipediabot,
> when Brion wanted to block it because it had a programming error
> causing it to fetch each page twice (sometimes even more?). If that is
> the actual reason, I see no reason why it should still be active years
> afterward...

The default UA-Strings of many popular libraries (pythion, perl, java, php...)
are blocked from accessing wikipedia.

The idea is to force people to provide a descriptive UA string for their
particular tool, so it can be blocked selectively when it breaks. Ideally, the
UA string should give some way of contacting the operator, or at least the 
author.

Good netizenship dictates: don't use default UA strings, use something unique
and  descriptive. Always, not only when accessing wikipedia.

As to whythe content is served anyway: I don't know. May be a bug even. or it's
intentional. Would be interesting to hear about this.

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-23 Thread Brion Vibber
On 1/23/09 2:36 AM, Andre Engels wrote:
> Two questions:
> 1. Why is this User Agent getting this response? If I remember
> correctly, this was installed in the early days of the pywikipediabot,
> when Brion wanted to block it because it had a programming error
> causing it to fetch each page twice (sometimes even more?). If that is
> the actual reason, I see no reason why it should still be active years
> afterward...

This has nothing to do with pywikipediabot.

We too frequently encountered poorly-written bots and site-scrapers 
which slammed the servers too hard and caused problems. Blocking default 
UAs of common libraries cut these incidents down dramatically, and helps 
encourage thoughtful bot writers to put specific information into their 
user-agent string, making it possible to track them down more easily if 
they are problematic.


> 2. If this User Agent is really to be blocked, why do we still provide
> the content of the page that is forbidden?

We don't; you get a big fat Wikimedia-customized error page with a 
generic multilingual message, and this bit somewhere in the middle:



 
  Request: GET http://en.wikipedia.org/wiki/Foo, from 69.17.48.227 
via sq24.wikimedia.org (squid/2.6.STABLE21) to  ()
  Error: ERR_ACCESS_DENIED, errno [No Error] at Fri, 23 Jan 2009 
17:59:46 GMT
 
 


-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-24 Thread Marco Schuster
On Fri, Jan 23, 2009 at 7:03 PM, Brion Vibber  wrote:
> On 1/23/09 2:36 AM, Andre Engels wrote:
>> Two questions:
>> 1. Why is this User Agent getting this response? If I remember
>> correctly, this was installed in the early days of the pywikipediabot,
>> when Brion wanted to block it because it had a programming error
>> causing it to fetch each page twice (sometimes even more?). If that is
>> the actual reason, I see no reason why it should still be active years
>> afterward...
>
> This has nothing to do with pywikipediabot.
>
> We too frequently encountered poorly-written bots and site-scrapers
> which slammed the servers too hard and caused problems. Blocking default
> UAs of common libraries cut these incidents down dramatically, and helps
> encourage thoughtful bot writers to put specific information into their
> user-agent string, making it possible to track them down more easily if
> they are problematic.
>
Is there any list of those UAs or UA parts available?
I had this problem some time ago with my bot which used a custom UA
string and got access denied, so I changed its UA to Firefox as I had
no nerves to track down WHICH part of the UA triggered the filter.

Marco

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-24 Thread Platonides
Marco Schuster wrote:
> Is there any list of those UAs or UA parts available?
> I had this problem some time ago with my bot which used a custom UA
> string and got access denied, so I changed its UA to Firefox as I had
> no nerves to track down WHICH part of the UA triggered the filter.
> 
> Marco

Perhaps they were blocking *your* bot?
Faking your user agent to match a browser make sysadmins assume bad faith...


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-24 Thread Marco Schuster
On Sat, Jan 24, 2009 at 3:48 PM, Platonides  wrote:
> Marco Schuster wrote:
>> Is there any list of those UAs or UA parts available?
>> I had this problem some time ago with my bot which used a custom UA
>> string and got access denied, so I changed its UA to Firefox as I had
>> no nerves to track down WHICH part of the UA triggered the filter.
>>
>> Marco
>
> Perhaps they were blocking *your* bot?
> Faking your user agent to match a browser make sysadmins assume bad faith...
No, as the bot was not active before (and I'm pretty sure the UA also).

Marco

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-24 Thread Aryeh Gregor
On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster
 wrote:
> Is there any list of those UAs or UA parts available?
> I had this problem some time ago with my bot which used a custom UA
> string and got access denied, so I changed its UA to Firefox as I had
> no nerves to track down WHICH part of the UA triggered the filter.

Just change it to something like "YourBotName, run by Marco Schuster
".  That will certainly avoid any filters, and
provide the desired info.

I don't know why the error page doesn't give this info already.  The
current message only confuses people and -- if they can figure out
it's UA-based -- tempts them to mimic browser UA strings.  That stands
a good chance of getting your IP address blocked if it's noticed (and
it's pretty easy to tell when a script is pretending to be a browser,
if you look at the whole HTTP request).

The error message is in SVN, but it's the same message provided for
all errors.  I don't know what sort of config would needed to be done
to get a custom message for this error.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-24 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 25, 2009 at 1:11 AM, Aryeh Gregor  wrote:
> On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster
>  wrote:
>> Is there any list of those UAs or UA parts available?
>> I had this problem some time ago with my bot which used a custom UA
>> string and got access denied, so I changed its UA to Firefox as I had
>> no nerves to track down WHICH part of the UA triggered the filter.
>
> Just change it to something like "YourBotName, run by Marco Schuster
> ".  That will certainly avoid any filters, and
> provide the desired info.
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
the filters.

> I don't know why the error page doesn't give this info already.  The
> current message only confuses people and -- if they can figure out
> it's UA-based -- tempts them to mimic browser UA strings.
Anyone skilled enough to write a bot is skilled enough to find that out, IMO.
Anyway, it should also be in the error message what part of the UA is forbidden.

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: http://getfiregpg.org

iD8DBQFJe7C4W6S2GapJUuQRAvcgAJ9YY1N0ckE9DzqG21K45teAiG1QVQCfcGBJ
hFtOQisDPnYlLyXjTwKaTTI=
=iuTY
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Platonides
Simetrical wrote:
> Just change it to something like "YourBotName, run by Marco Schuster
> ".  That will certainly avoid any filters, and
> provide the desired info.

The email should be at a From: header. Although I don't know if it's
logged or not.
In general, anyone responsible enough to set a From: header (with their
valid email) shouldn't get automatically blocked.


Marco Schuster wrote:
> I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
> the filters.

Perhaps the mention to "php", although I'm not being blocked when using
that UA, so can't test.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Aryeh Gregor
On Sun, Jan 25, 2009 at 8:50 AM, Platonides  wrote:
> The email should be at a From: header. Although I don't know if it's
> logged or not.
> In general, anyone responsible enough to set a From: header (with their
> valid email) shouldn't get automatically blocked.

A From: header?  In HTTP?  What standard specifies that header's
existence and semantics?  It's not at [[List of HTTP headers]].

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Marco Schuster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 25, 2009 at 2:50 PM, Platonides  wrote:
> Marco Schuster wrote:
>> I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered
>> the filters.
>
> Perhaps the mention to "php", although I'm not being blocked when using
> that UA, so can't test.

Yeah, I'm also not blocked anymore...nice to hear that. But again,
it'd be nice to see in an error message what part of the UA triggered
the filter and why this part is blocked.
Brion, do you have a list of blocked UA (parts)?

Marco
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: http://getfiregpg.org

iD4DBQFJfJOQW6S2GapJUuQRAiwgAJdXucmjZ4d9BToMAnK3uKuzq3ooAJ4mFGFZ
AeFuiPnC+cSzTuseHDtAUg==
=OwNP
-END PGP SIGNATURE-

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Platonides
Aryeh Gregor wrote:
> On Sun, Jan 25, 2009 at 8:50 AM, Platonides  wrote:
>> The email should be at a From: header. Although I don't know if it's
>> logged or not.
>> In general, anyone responsible enough to set a From: header (with their
>> valid email) shouldn't get automatically blocked.
> 
> A From: header?  In HTTP?  What standard specifies that header's
> existence and semantics?  It's not at [[List of HTTP headers]].

I also thought that it was a confusion when I first saw it on HTTP
article at wikipedia.

RFC 2616 (HTTP/1.1) section 14.22

   The From request-header field, if given, SHOULD contain an Internet
   e-mail address for the human user who controls the requesting user
   agent. The address SHOULD be machine-usable, as defined by "mailbox"
   in RFC 822 [9] as updated by RFC 1123 [8]:

   From   = "From" ":" mailbox

   An example is:

   From: webmas...@w3.org

   This header field MAY be used for logging purposes and as a means for
   identifying the source of invalid or unwanted requests. It SHOULD NOT
   be used as an insecure form of access protection. The interpretation
   of this field is that the request is being performed on behalf of the
   person given, who accepts responsibility for the method performed. In
   particular, robot agents SHOULD include this header so that the
   person responsible for running the robot can be contacted if problems
   occur on the receiving end.

   The Internet e-mail address in this field MAY be separate from the
   Internet host which issued the request. For example, when a request
   is passed through a proxy the original issuer's address SHOULD be
   used.

   The client SHOULD NOT send the From header field without the user's
   approval, as it might conflict with the user's privacy interests or
   their site's security policy. It is strongly recommended that the
   user be able to disable, enable, and modify the value of this field
   at any time prior to a request.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Andrew Garrett
On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster
 wrote:

> Brion, do you have a list of blocked UA (parts)?

Squid configuration files are available at
http://noc.wikimedia.org/conf. It should be in there.

-- 
Andrew Garrett

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Aryeh Gregor
On Sun, Jan 25, 2009 at 4:42 PM, Platonides  wrote:
> I also thought that it was a confusion when I first saw it on HTTP
> article at wikipedia.
>
> RFC 2616 (HTTP/1.1) section 14.22
>
>   The From request-header field, if given, SHOULD contain an Internet
>   e-mail address for the human user who controls the requesting user
>   agent. The address SHOULD be machine-usable, as defined by "mailbox"
>   in RFC 822 [9] as updated by RFC 1123 [8]:
> ...

Well, since I doubt most people have ever heard of that, it's probably
not logged.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 403 with content to Python?

2009-01-25 Thread Platonides
Andrew Garrett wrote:
> On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster
>  wrote:
> 
>> Brion, do you have a list of blocked UA (parts)?
> 
> Squid configuration files are available at
> http://noc.wikimedia.org/conf. It should be in there.

Which of them are for the squids? I think they server config there is
just for the apaches.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l