Brett Cannon <br...@python.org> wrote: > On Tue, Feb 3, 2009 at 15:50, Tres Seaver <tsea...@palladion.com> wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA1 > > > > Brett Cannon wrote: > >> On Tue, Feb 3, 2009 at 11:08, Brad Miller <millb...@luther.edu> wrote: > >>> I'm just getting ready to start the semester using my new book (Python > >>> Programming in Context) and noticed that I somehow missed all the changes > >>> to > >>> urllib in python 3.0. ARGH to say the least. I like using urllib in the > >>> intro class because we can get data from places that are more > >>> interesting/motivating/relevant to the students. > >>> Here are some of my observations on trying to do very basic stuff with > >>> urllib: > >>> 1. urllib.urlopen is now urllib.request.urlopen > >> > >> Technically urllib2.urlopen became urllib.request.urlopen. See PEP > >> 3108 for the details of the reorganization. > >> > >>> 2. The object returned by urlopen is no longer iterable! no more for > >>> line > >>> in url. > >> > >> That is probably a difference between urllib2 and urllib. > >> > >>> 3. read, readline, readlines now return bytes objects or arrays of bytes > >>> instead of a str and array of str > >> > >> Correct. > >> > >>> 4. Taking the naive approach to converting a bytes object to a str does > >>> not > >>> work as you would expect. > >>> > >>>>>> import urllib.request > >>>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') > >>>>>> page > >>> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>> > >>>>>> line = page.readline() > >>>>>> line > >>> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n' > >>>>>> str(line) > >>> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\'' > >>> As you can see from the example the 'b' becomes part of the string! It > >>> seems like this should be a bug, is it? > >>> > >> > >> No because you are getting back the repr for the bytes object. Str > >> does not know what the encoding is for the bytes so it has no way of > >> performing the decoding. > > > > The encoding information *is* available in the response headers, e.g.: > > > > - ---------------------- %< --------------------------------- > > $ wget -S --spider http://knuth.luther.edu/test.html > > - --18:46:24-- http://knuth.luther.edu/test.html > > => `test.html' > > Resolving knuth.luther.edu... 192.203.196.71 > > Connecting to knuth.luther.edu|192.203.196.71|:80... connected. > > HTTP request sent, awaiting response... > > HTTP/1.1 200 OK > > Date: Tue, 03 Feb 2009 23:46:28 GMT > > Server: Apache/2.0.50 (Linux/SUSE) > > Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT > > ETag: "2fcd8-1d8-43b2bf40" > > Accept-Ranges: bytes > > Content-Length: 472 > > Keep-Alive: timeout=15, max=100 > > Connection: Keep-Alive > > Content-Type: text/html; charset=ISO-8859-1 > > Length: 472 [text/html] > > 200 OK > > - ---------------------- %< --------------------------------- > > > > Right, but he was asking about why passing bytes to str() led to it > returning the repr. > > > So, the OP's use case *could* be satisfied, assuming that the Py3K > > version of urllib sprouted a means of leveraging that header. In this > > sense, fetching the resource over HTTP is *better* than loading it from > > a file: information about the character set is explicit, and highly > > likely to be correct, at least for any resource people expect to render > > cleanly in a browser. > > Right. And even if the header lacks the info as Content-Type is not > guaranteed to contain the charset there is also the chance for the > HTML or DOCTYPE declaration to say. > > But as Bill pointed out, urllib just fetches data via HTTP, so a > character encoding will not always be valuable. Best solution would be > to provide something in html that can take what urllib.request.urlopen > returns and handle the decoding.
Yes, that sounds like the right solution to me, too. Bill _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com