-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Brett Cannon wrote: > On Tue, Feb 3, 2009 at 11:08, Brad Miller <millb...@luther.edu> wrote: >> I'm just getting ready to start the semester using my new book (Python >> Programming in Context) and noticed that I somehow missed all the changes to >> urllib in python 3.0. ARGH to say the least. I like using urllib in the >> intro class because we can get data from places that are more >> interesting/motivating/relevant to the students. >> Here are some of my observations on trying to do very basic stuff with >> urllib: >> 1. urllib.urlopen is now urllib.request.urlopen > > Technically urllib2.urlopen became urllib.request.urlopen. See PEP > 3108 for the details of the reorganization. > >> 2. The object returned by urlopen is no longer iterable! no more for line >> in url. > > That is probably a difference between urllib2 and urllib. > >> 3. read, readline, readlines now return bytes objects or arrays of bytes >> instead of a str and array of str > > Correct. > >> 4. Taking the naive approach to converting a bytes object to a str does not >> work as you would expect. >> >>>>> import urllib.request >>>>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') >>>>> page >> <addinfourl at 16419792 whose fp = <socket.SocketIO object at 0xfa8570>> >>>>> line = page.readline() >>>>> line >> b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\n' >>>>> str(line) >> 'b\'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"\\n\'' >> As you can see from the example the 'b' becomes part of the string! It >> seems like this should be a bug, is it? >> > > No because you are getting back the repr for the bytes object. Str > does not know what the encoding is for the bytes so it has no way of > performing the decoding.
The encoding information *is* available in the response headers, e.g.: - ---------------------- %< --------------------------------- $ wget -S --spider http://knuth.luther.edu/test.html - --18:46:24-- http://knuth.luther.edu/test.html => `test.html' Resolving knuth.luther.edu... 192.203.196.71 Connecting to knuth.luther.edu|192.203.196.71|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Tue, 03 Feb 2009 23:46:28 GMT Server: Apache/2.0.50 (Linux/SUSE) Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT ETag: "2fcd8-1d8-43b2bf40" Accept-Ranges: bytes Content-Length: 472 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=ISO-8859-1 Length: 472 [text/html] 200 OK - ---------------------- %< --------------------------------- So, the OP's use case *could* be satisfied, assuming that the Py3K version of urllib sprouted a means of leveraging that header. In this sense, fetching the resource over HTTP is *better* than loading it from a file: information about the character set is explicit, and highly likely to be correct, at least for any resource people expect to render cleanly in a browser. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJiNhU+gerLs4ltQ4RAjalAKC6BcbTIFjUIBg51IbVtSd8dZsoDACggw1O +1Zlt7RlzdieQjoAw8AeScE= =lvtX -----END PGP SIGNATURE----- _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com