Impersonating other broswers...
So I wrote a quick python program (my first ever) that needs to download pages off the web. I'm using urlopen, and it works fine. But I'd like to be able to change my browser string from Python-urllib/1.15 to instead impersonate Internet Explorer. I know this can be done very easily with Perl, so I'm assuming it's also easy in Python. How do I do it? -- http://mail.python.org/mailman/listinfo/python-list
Re: Impersonating other broswers...
[EMAIL PROTECTED] wrote: So I wrote a quick python program (my first ever) that needs to download pages off the web. I'm using urlopen, and it works fine. But I'd like to be able to change my browser string from Python-urllib/1.15 to instead impersonate Internet Explorer. I know this can be done very easily with Perl, so I'm assuming it's also easy in Python. How do I do it? from the urllib docs: ''' class URLopener( [proxies[, **x509]]) Base class for opening and reading URLs. Unless you need to support opening objects using schemes other than http:, ftp:, gopher: or file:, you probably want to use FancyURLopener. By default, the URLopener class sends a User-Agent: header of urllib/VVV, where VVV is the urllib version number. Applications can define their own User-Agent: header by subclassing URLopener or FancyURLopener and setting the instance attribute version to an appropriate string value before the open() method is called. The optional proxies parameter should be a dictionary mapping scheme names to proxy URLs, where an empty dictionary turns proxies off completely. Its default value is None, in which case environmental proxy settings will be used if present, as discussed in the definition of urlopen(), above. Additional keyword parameters, collected in x509, are used for authentication with the https: scheme. The keywords key_file and cert_file are supported; both are needed to actually retrieve a resource at an https: URL. ''' -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list
Re: Impersonating other broswers...
sboyle I'm using urlopen, and it works fine. But I'd like to be able sboyle to change my browser string from Python-urllib/1.15 to instead sboyle impersonate Internet Explorer. sboyle I know this can be done very easily with Perl, so I'm assuming sboyle it's also easy in Python. How do I do it? Easy is in the eye of the beholder I suppose. It doesn't look as straightforward as I would have thought. You can subclass the FancyURLopener class like so: class MSIEURLopener(urllib.FancyURLopener): version = Internet Exploder then set urllib._urlopener to it: urllib._urlopener = MSIEURLopener After that, urllib.urlopen() should spit out your user-agent string. Seems like FancyURLopener should support setting the user agent string directly. You can accomplish that with something like this: class FlexibleUAopener(urllib.FancyURLopener): def set_user_agent(self, user_agent): ua = [(hdr, val) for (hdr, val) in self.addheaders if hdr == User-agent] while ua: self.addheaders.remove(ua[0]) ua.pop() self.addheader((User-agent, user_agent)) You'd then be able to set the user agent, but have to use your new opener class directly: opener = FlexibleUAopener(...) opener.set_user_agent(Internet Exploder) f = opener.open(url) print f.read() It doesn't look any easier to do this using urllib2. Seems like a semi-obvious oversight for both modules. That suggests few people have ever desired this capability. Skip -- http://mail.python.org/mailman/listinfo/python-list
Re: Impersonating other broswers...
Skip Montanaro [EMAIL PROTECTED] wrote It doesn't look any easier to do this using urllib2. Seems like a semi-obvious oversight for both modules. That suggests few people have ever desired this capability. my $.02: I have trouble believing few people have not desired this for two reasons: (1) some web sites will shut out user agents they do not recognize to preserve bandwidth or for other reasons; the right User Agent ID can be required to get the data one wants; (2) It seems like it is a worthwhile courtesy to identify oneself when spidering or data scraping, and the User Agent ID seems like the obvious way to do that. I'd guess (and like to think) that Python users are generally a little more concerned with such courtesies than the user population of some other languages. e.g. Your website might get a hit from: Mozilla/5.0 (Songzilla MP3 Blog, http://songzilla.blogspot.com) Gecko/20041107 Firefox/1.0 And you'll get to decide whether to shut them out or not, but at least it won't seem like the black hats are attacking. Eric Pederson http://www.songzilla.blogspot.com ::: domainNot=@something.com domainIs=domainNot.replace(s,z) ePrefix=.join([chr(ord(x)+1) for x in do]) mailMeAt=ePrefix+domainIs ::: -- http://mail.python.org/mailman/listinfo/python-list