[Wikitech-l] 403 with content to Python?
Through a message on another list, I found that when one tries to reach wikipedia (or at least wikipedia-en) specifying the User Agent as "Python-urllib/1.17", the server gives a "403 Forbidden" response, together with the content of the page. Two questions: 1. Why is this User Agent getting this response? If I remember correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward... 2. If this User Agent is really to be blocked, why do we still provide the content of the page that is forbidden? -- André Engels, andreeng...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Andre Engels schrieb: > 1. Why is this User Agent getting this response? If I remember > correctly, this was installed in the early days of the pywikipediabot, > when Brion wanted to block it because it had a programming error > causing it to fetch each page twice (sometimes even more?). If that is > the actual reason, I see no reason why it should still be active years > afterward... The default UA-Strings of many popular libraries (pythion, perl, java, php...) are blocked from accessing wikipedia. The idea is to force people to provide a descriptive UA string for their particular tool, so it can be blocked selectively when it breaks. Ideally, the UA string should give some way of contacting the operator, or at least the author. Good netizenship dictates: don't use default UA strings, use something unique and descriptive. Always, not only when accessing wikipedia. As to whythe content is served anyway: I don't know. May be a bug even. or it's intentional. Would be interesting to hear about this. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On 1/23/09 2:36 AM, Andre Engels wrote: > Two questions: > 1. Why is this User Agent getting this response? If I remember > correctly, this was installed in the early days of the pywikipediabot, > when Brion wanted to block it because it had a programming error > causing it to fetch each page twice (sometimes even more?). If that is > the actual reason, I see no reason why it should still be active years > afterward... This has nothing to do with pywikipediabot. We too frequently encountered poorly-written bots and site-scrapers which slammed the servers too hard and caused problems. Blocking default UAs of common libraries cut these incidents down dramatically, and helps encourage thoughtful bot writers to put specific information into their user-agent string, making it possible to track them down more easily if they are problematic. > 2. If this User Agent is really to be blocked, why do we still provide > the content of the page that is forbidden? We don't; you get a big fat Wikimedia-customized error page with a generic multilingual message, and this bit somewhere in the middle: Request: GET http://en.wikipedia.org/wiki/Foo, from 69.17.48.227 via sq24.wikimedia.org (squid/2.6.STABLE21) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Fri, 23 Jan 2009 17:59:46 GMT -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Fri, Jan 23, 2009 at 7:03 PM, Brion Vibber wrote: > On 1/23/09 2:36 AM, Andre Engels wrote: >> Two questions: >> 1. Why is this User Agent getting this response? If I remember >> correctly, this was installed in the early days of the pywikipediabot, >> when Brion wanted to block it because it had a programming error >> causing it to fetch each page twice (sometimes even more?). If that is >> the actual reason, I see no reason why it should still be active years >> afterward... > > This has nothing to do with pywikipediabot. > > We too frequently encountered poorly-written bots and site-scrapers > which slammed the servers too hard and caused problems. Blocking default > UAs of common libraries cut these incidents down dramatically, and helps > encourage thoughtful bot writers to put specific information into their > user-agent string, making it possible to track them down more easily if > they are problematic. > Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter. Marco ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Marco Schuster wrote: > Is there any list of those UAs or UA parts available? > I had this problem some time ago with my bot which used a custom UA > string and got access denied, so I changed its UA to Firefox as I had > no nerves to track down WHICH part of the UA triggered the filter. > > Marco Perhaps they were blocking *your* bot? Faking your user agent to match a browser make sysadmins assume bad faith... ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Sat, Jan 24, 2009 at 3:48 PM, Platonides wrote: > Marco Schuster wrote: >> Is there any list of those UAs or UA parts available? >> I had this problem some time ago with my bot which used a custom UA >> string and got access denied, so I changed its UA to Firefox as I had >> no nerves to track down WHICH part of the UA triggered the filter. >> >> Marco > > Perhaps they were blocking *your* bot? > Faking your user agent to match a browser make sysadmins assume bad faith... No, as the bot was not active before (and I'm pretty sure the UA also). Marco ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster wrote: > Is there any list of those UAs or UA parts available? > I had this problem some time ago with my bot which used a custom UA > string and got access denied, so I changed its UA to Firefox as I had > no nerves to track down WHICH part of the UA triggered the filter. Just change it to something like "YourBotName, run by Marco Schuster ". That will certainly avoid any filters, and provide the desired info. I don't know why the error page doesn't give this info already. The current message only confuses people and -- if they can figure out it's UA-based -- tempts them to mimic browser UA strings. That stands a good chance of getting your IP address blocked if it's noticed (and it's pretty easy to tell when a script is pretending to be a browser, if you look at the whole HTTP request). The error message is in SVN, but it's the same message provided for all errors. I don't know what sort of config would needed to be done to get a custom message for this error. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Jan 25, 2009 at 1:11 AM, Aryeh Gregor wrote: > On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster > wrote: >> Is there any list of those UAs or UA parts available? >> I had this problem some time ago with my bot which used a custom UA >> string and got access denied, so I changed its UA to Firefox as I had >> no nerves to track down WHICH part of the UA triggered the filter. > > Just change it to something like "YourBotName, run by Marco Schuster > ". That will certainly avoid any filters, and > provide the desired info. I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters. > I don't know why the error page doesn't give this info already. The > current message only confuses people and -- if they can figure out > it's UA-based -- tempts them to mimic browser UA strings. Anyone skilled enough to write a bot is skilled enough to find that out, IMO. Anyway, it should also be in the error message what part of the UA is forbidden. Marco -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: http://getfiregpg.org iD8DBQFJe7C4W6S2GapJUuQRAvcgAJ9YY1N0ckE9DzqG21K45teAiG1QVQCfcGBJ hFtOQisDPnYlLyXjTwKaTTI= =iuTY -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Simetrical wrote: > Just change it to something like "YourBotName, run by Marco Schuster > ". That will certainly avoid any filters, and > provide the desired info. The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked. Marco Schuster wrote: > I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered > the filters. Perhaps the mention to "php", although I'm not being blocked when using that UA, so can't test. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Sun, Jan 25, 2009 at 8:50 AM, Platonides wrote: > The email should be at a From: header. Although I don't know if it's > logged or not. > In general, anyone responsible enough to set a From: header (with their > valid email) shouldn't get automatically blocked. A From: header? In HTTP? What standard specifies that header's existence and semantics? It's not at [[List of HTTP headers]]. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sun, Jan 25, 2009 at 2:50 PM, Platonides wrote: > Marco Schuster wrote: >> I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered >> the filters. > > Perhaps the mention to "php", although I'm not being blocked when using > that UA, so can't test. Yeah, I'm also not blocked anymore...nice to hear that. But again, it'd be nice to see in an error message what part of the UA triggered the filter and why this part is blocked. Brion, do you have a list of blocked UA (parts)? Marco -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: http://getfiregpg.org iD4DBQFJfJOQW6S2GapJUuQRAiwgAJdXucmjZ4d9BToMAnK3uKuzq3ooAJ4mFGFZ AeFuiPnC+cSzTuseHDtAUg== =OwNP -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Aryeh Gregor wrote: > On Sun, Jan 25, 2009 at 8:50 AM, Platonides wrote: >> The email should be at a From: header. Although I don't know if it's >> logged or not. >> In general, anyone responsible enough to set a From: header (with their >> valid email) shouldn't get automatically blocked. > > A From: header? In HTTP? What standard specifies that header's > existence and semantics? It's not at [[List of HTTP headers]]. I also thought that it was a confusion when I first saw it on HTTP article at wikipedia. RFC 2616 (HTTP/1.1) section 14.22 The From request-header field, if given, SHOULD contain an Internet e-mail address for the human user who controls the requesting user agent. The address SHOULD be machine-usable, as defined by "mailbox" in RFC 822 [9] as updated by RFC 1123 [8]: From = "From" ":" mailbox An example is: From: webmas...@w3.org This header field MAY be used for logging purposes and as a means for identifying the source of invalid or unwanted requests. It SHOULD NOT be used as an insecure form of access protection. The interpretation of this field is that the request is being performed on behalf of the person given, who accepts responsibility for the method performed. In particular, robot agents SHOULD include this header so that the person responsible for running the robot can be contacted if problems occur on the receiving end. The Internet e-mail address in this field MAY be separate from the Internet host which issued the request. For example, when a request is passed through a proxy the original issuer's address SHOULD be used. The client SHOULD NOT send the From header field without the user's approval, as it might conflict with the user's privacy interests or their site's security policy. It is strongly recommended that the user be able to disable, enable, and modify the value of this field at any time prior to a request. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster wrote: > Brion, do you have a list of blocked UA (parts)? Squid configuration files are available at http://noc.wikimedia.org/conf. It should be in there. -- Andrew Garrett ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
On Sun, Jan 25, 2009 at 4:42 PM, Platonides wrote: > I also thought that it was a confusion when I first saw it on HTTP > article at wikipedia. > > RFC 2616 (HTTP/1.1) section 14.22 > > The From request-header field, if given, SHOULD contain an Internet > e-mail address for the human user who controls the requesting user > agent. The address SHOULD be machine-usable, as defined by "mailbox" > in RFC 822 [9] as updated by RFC 1123 [8]: > ... Well, since I doubt most people have ever heard of that, it's probably not logged. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] 403 with content to Python?
Andrew Garrett wrote: > On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster > wrote: > >> Brion, do you have a list of blocked UA (parts)? > > Squid configuration files are available at > http://noc.wikimedia.org/conf. It should be in there. Which of them are for the squids? I think they server config there is just for the apaches. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l