Re: [Python-Dev] teaching the new urllib
On Tue, Feb 03, 2009 at 06:50:44PM -0500, Tres Seaver wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > > The encoding information *is* available in the response headers, e.g.: > > - -- %< - > $ wget -S --spider http://knuth.luther.edu/test.html > - --18:46:24-- http://knuth.luther.edu/test.html >=> `test.html' > Resolving knuth.luther.edu... 192.203.196.71 > Connecting to knuth.luther.edu|192.203.196.71|:80... connected. > HTTP request sent, awaiting response... > HTTP/1.1 200 OK > Date: Tue, 03 Feb 2009 23:46:28 GMT > Server: Apache/2.0.50 (Linux/SUSE) > Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT > ETag: "2fcd8-1d8-43b2bf40" > Accept-Ranges: bytes > Content-Length: 472 > Keep-Alive: timeout=15, max=100 > Connection: Keep-Alive > Content-Type: text/html; charset=ISO-8859-1 > Length: 472 [text/html] > 200 OK > - -- %< - > > So, the OP's use case *could* be satisfied, assuming that the Py3K > version of urllib sprouted a means of leveraging that header. In this > sense, fetching the resource over HTTP is *better* than loading it from > a file: information about the character set is explicit, and highly > likely to be correct, at least for any resource people expect to render > cleanly in a browser. First of all, as it was noted, Content-Type may have no charset parameter, or be omitted at all. But the most important and the worst is that charset in Content-Type may have no relation to charset in document. And even worse - charset specified in document may have no relation to charset used to encode the document. :( Remember, that headers are supplied by HTTP server and it have to read document from just a file, so there is no difference, since there is no magic in being a HTTP server. Ofcourse it will be correct to provide web-server with some hints about charset of byte-encoded text documents, but web-server will not stop working without charset specified or with incorrect charset. This use case is really important for those international segments of Internet, which have two or more conflicting character sets for their (single) alphabet. As an example - every Russian Internet user can tell you that a browser, that have no menu option to select explicitly what encoding to use for current document, is completely unusable. -- Alexey Shpagin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
Brett Cannon wrote: > On Tue, Feb 3, 2009 at 15:50, Tres Seaver wrote: > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > Brett Cannon wrote: > >> On Tue, Feb 3, 2009 at 11:08, Brad Miller wrote: > >>> I'm just getting ready to start the semester using my new book (Python > >>> Programming in Context) and noticed that I somehow missed all the changes > >>> to > >>> urllib in python 3.0. ARGH to say the least. I like using urllib in the > >>> intro class because we can get data from places that are more > >>> interesting/motivating/relevant to the students. > >>> Here are some of my observations on trying to do very basic stuff with > >>> urllib: > >>> 1. urllib.urlopen is now urllib.request.urlopen > >> > >> Technically urllib2.urlopen became urllib.request.urlopen. See PEP > >> 3108 for the details of the reorganization. > >> > >>> 2. The object returned by urlopen is no longer iterable! no more for > >>> line > >>> in url. > >> > >> That is probably a difference between urllib2 and urllib. > >> > >>> 3. read, readline, readlines now return bytes objects or arrays of bytes > >>> instead of a str and array of str > >> > >> Correct. > >> > >>> 4. Taking the naive approach to converting a bytes object to a str does > >>> not > >>> work as you would expect. > >>> > >> import urllib.request > >> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') > >> page > >>> > > >> line = page.readline() > >> line > >>> b' >> str(line) > >>> 'b\' >>> As you can see from the example the 'b' becomes part of the string! It > >>> seems like this should be a bug, is it? > >>> > >> > >> No because you are getting back the repr for the bytes object. Str > >> does not know what the encoding is for the bytes so it has no way of > >> performing the decoding. > > > > The encoding information *is* available in the response headers, e.g.: > > > > - -- %< - > > $ wget -S --spider http://knuth.luther.edu/test.html > > - --18:46:24-- http://knuth.luther.edu/test.html > > => `test.html' > > Resolving knuth.luther.edu... 192.203.196.71 > > Connecting to knuth.luther.edu|192.203.196.71|:80... connected. > > HTTP request sent, awaiting response... > > HTTP/1.1 200 OK > > Date: Tue, 03 Feb 2009 23:46:28 GMT > > Server: Apache/2.0.50 (Linux/SUSE) > > Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT > > ETag: "2fcd8-1d8-43b2bf40" > > Accept-Ranges: bytes > > Content-Length: 472 > > Keep-Alive: timeout=15, max=100 > > Connection: Keep-Alive > > Content-Type: text/html; charset=ISO-8859-1 > > Length: 472 [text/html] > > 200 OK > > - -- %< - > > > > Right, but he was asking about why passing bytes to str() led to it > returning the repr. > > > So, the OP's use case *could* be satisfied, assuming that the Py3K > > version of urllib sprouted a means of leveraging that header. In this > > sense, fetching the resource over HTTP is *better* than loading it from > > a file: information about the character set is explicit, and highly > > likely to be correct, at least for any resource people expect to render > > cleanly in a browser. > > Right. And even if the header lacks the info as Content-Type is not > guaranteed to contain the charset there is also the chance for the > HTML or DOCTYPE declaration to say. > > But as Bill pointed out, urllib just fetches data via HTTP, so a > character encoding will not always be valuable. Best solution would be > to provide something in html that can take what urllib.request.urlopen > returns and handle the decoding. Yes, that sounds like the right solution to me, too. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
On Tue, Feb 3, 2009 at 15:50, Tres Seaver wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Brett Cannon wrote: >> On Tue, Feb 3, 2009 at 11:08, Brad Miller wrote: >>> I'm just getting ready to start the semester using my new book (Python >>> Programming in Context) and noticed that I somehow missed all the changes to >>> urllib in python 3.0. ARGH to say the least. I like using urllib in the >>> intro class because we can get data from places that are more >>> interesting/motivating/relevant to the students. >>> Here are some of my observations on trying to do very basic stuff with >>> urllib: >>> 1. urllib.urlopen is now urllib.request.urlopen >> >> Technically urllib2.urlopen became urllib.request.urlopen. See PEP >> 3108 for the details of the reorganization. >> >>> 2. The object returned by urlopen is no longer iterable! no more for line >>> in url. >> >> That is probably a difference between urllib2 and urllib. >> >>> 3. read, readline, readlines now return bytes objects or arrays of bytes >>> instead of a str and array of str >> >> Correct. >> >>> 4. Taking the naive approach to converting a bytes object to a str does not >>> work as you would expect. >>> >> import urllib.request >> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') >> page >>> > >> line = page.readline() >> line >>> b'> str(line) >>> 'b\'>> As you can see from the example the 'b' becomes part of the string! It >>> seems like this should be a bug, is it? >>> >> >> No because you are getting back the repr for the bytes object. Str >> does not know what the encoding is for the bytes so it has no way of >> performing the decoding. > > The encoding information *is* available in the response headers, e.g.: > > - -- %< - > $ wget -S --spider http://knuth.luther.edu/test.html > - --18:46:24-- http://knuth.luther.edu/test.html > => `test.html' > Resolving knuth.luther.edu... 192.203.196.71 > Connecting to knuth.luther.edu|192.203.196.71|:80... connected. > HTTP request sent, awaiting response... > HTTP/1.1 200 OK > Date: Tue, 03 Feb 2009 23:46:28 GMT > Server: Apache/2.0.50 (Linux/SUSE) > Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT > ETag: "2fcd8-1d8-43b2bf40" > Accept-Ranges: bytes > Content-Length: 472 > Keep-Alive: timeout=15, max=100 > Connection: Keep-Alive > Content-Type: text/html; charset=ISO-8859-1 > Length: 472 [text/html] > 200 OK > - -- %< - > Right, but he was asking about why passing bytes to str() led to it returning the repr. > So, the OP's use case *could* be satisfied, assuming that the Py3K > version of urllib sprouted a means of leveraging that header. In this > sense, fetching the resource over HTTP is *better* than loading it from > a file: information about the character set is explicit, and highly > likely to be correct, at least for any resource people expect to render > cleanly in a browser. Right. And even if the header lacks the info as Content-Type is not guaranteed to contain the charset there is also the chance for the HTML or DOCTYPE declaration to say. But as Bill pointed out, urllib just fetches data via HTTP, so a character encoding will not always be valuable. Best solution would be to provide something in html that can take what urllib.request.urlopen returns and handle the decoding. -Brett ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
Tres Seaver wrote: > Brett Cannon wrote: >> No because you are getting back the repr for the bytes object. Str >> does not know what the encoding is for the bytes so it has no way of >> performing the decoding. > > The encoding information *is* available in the response headers, e.g.: [snip] That's the target of http://bugs.python.org/issue4733 cited by Benjamin: 'Add a "decode to declared encoding" version of urlopen to urllib' . I think it's an important use case, but the current patch is pretty awful. Improvements/feedback welcome :) Daniel ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Brett Cannon wrote: > On Tue, Feb 3, 2009 at 11:08, Brad Miller wrote: >> I'm just getting ready to start the semester using my new book (Python >> Programming in Context) and noticed that I somehow missed all the changes to >> urllib in python 3.0. ARGH to say the least. I like using urllib in the >> intro class because we can get data from places that are more >> interesting/motivating/relevant to the students. >> Here are some of my observations on trying to do very basic stuff with >> urllib: >> 1. urllib.urlopen is now urllib.request.urlopen > > Technically urllib2.urlopen became urllib.request.urlopen. See PEP > 3108 for the details of the reorganization. > >> 2. The object returned by urlopen is no longer iterable! no more for line >> in url. > > That is probably a difference between urllib2 and urllib. > >> 3. read, readline, readlines now return bytes objects or arrays of bytes >> instead of a str and array of str > > Correct. > >> 4. Taking the naive approach to converting a bytes object to a str does not >> work as you would expect. >> > import urllib.request > page = urllib.request.urlopen('http://knuth.luther.edu/test.html') > page >> > > line = page.readline() > line >> b' str(line) >> 'b\'> As you can see from the example the 'b' becomes part of the string! It >> seems like this should be a bug, is it? >> > > No because you are getting back the repr for the bytes object. Str > does not know what the encoding is for the bytes so it has no way of > performing the decoding. The encoding information *is* available in the response headers, e.g.: - -- %< - $ wget -S --spider http://knuth.luther.edu/test.html - --18:46:24-- http://knuth.luther.edu/test.html => `test.html' Resolving knuth.luther.edu... 192.203.196.71 Connecting to knuth.luther.edu|192.203.196.71|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Tue, 03 Feb 2009 23:46:28 GMT Server: Apache/2.0.50 (Linux/SUSE) Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT ETag: "2fcd8-1d8-43b2bf40" Accept-Ranges: bytes Content-Length: 472 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html; charset=ISO-8859-1 Length: 472 [text/html] 200 OK - -- %< - So, the OP's use case *could* be satisfied, assuming that the Py3K version of urllib sprouted a means of leveraging that header. In this sense, fetching the resource over HTTP is *better* than loading it from a file: information about the character set is explicit, and highly likely to be correct, at least for any resource people expect to render cleanly in a browser. Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFJiNhU+gerLs4ltQ4RAjalAKC6BcbTIFjUIBg51IbVtSd8dZsoDACggw1O +1Zlt7RlzdieQjoAw8AeScE= =lvtX -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
İsmail Dönmez wrote: > Hi, > > On Tue, Feb 3, 2009 at 21:56, Brett Cannon wrote: > > Probably the biggest issue will be having to explain string encoding. > > Obviously you can gloss over it or provide students with a simple > > library that just automatically converts the strings. Or even better, > > provide some code for the standard library that can take the HTML, > > figure out the encoding, and then return the decoded strings (might > > actually already be something for that that I am not aware of). > > http://chardet.feedparser.org/ should work fine for most auto-encoding > detection needs. Remember that the return value from urlopen() need not be HTML or XML. It could be, say, an image or PDF or Word, or pretty much anything. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
On Tue, Feb 3, 2009 at 2:08 PM, Brad Miller wrote: > Here's the iteration problem: > 'b\'>>> for line in page: > print(line) > Traceback (most recent call last): > File "", line 1, in > for line in page: > TypeError: 'addinfourl' object is not iterable > Why is this not iterable anymore? Is this too a bug? What the heck is an > addinfourl object? See http://bugs.python.org/issue4608. > > 5. Finally, I see that a bytes object has some of the same methods as > strings. But the error messages are confusing. line > b' "http://www.w3.org/TR/html4/loose.dtd";>\n' line.find('www') > Traceback (most recent call last): > File "", line 1, in > line.find('www') > TypeError: expected an object with the buffer interface line.find(b'www') > 11 > Why couldn't find take string as a parameter? See http://bugs.python.org/issue4733 -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
Hi, On Tue, Feb 3, 2009 at 21:56, Brett Cannon wrote: > Probably the biggest issue will be having to explain string encoding. > Obviously you can gloss over it or provide students with a simple > library that just automatically converts the strings. Or even better, > provide some code for the standard library that can take the HTML, > figure out the encoding, and then return the decoded strings (might > actually already be something for that that I am not aware of). http://chardet.feedparser.org/ should work fine for most auto-encoding detection needs. Regards, ismail ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] teaching the new urllib
On Tue, Feb 3, 2009 at 11:08, Brad Miller wrote: > I'm just getting ready to start the semester using my new book (Python > Programming in Context) and noticed that I somehow missed all the changes to > urllib in python 3.0. ARGH to say the least. I like using urllib in the > intro class because we can get data from places that are more > interesting/motivating/relevant to the students. > Here are some of my observations on trying to do very basic stuff with > urllib: > 1. urllib.urlopen is now urllib.request.urlopen Technically urllib2.urlopen became urllib.request.urlopen. See PEP 3108 for the details of the reorganization. > 2. The object returned by urlopen is no longer iterable! no more for line > in url. That is probably a difference between urllib2 and urllib. > 3. read, readline, readlines now return bytes objects or arrays of bytes > instead of a str and array of str Correct. > 4. Taking the naive approach to converting a bytes object to a str does not > work as you would expect. > import urllib.request page = urllib.request.urlopen('http://knuth.luther.edu/test.html') page > > line = page.readline() line > b'>>> str(line) > 'b\'>>> > As you can see from the example the 'b' becomes part of the string! It > seems like this should be a bug, is it? > No because you are getting back the repr for the bytes object. Str does not know what the encoding is for the bytes so it has no way of performing the decoding. > Here's the iteration problem: > 'b\'>>> for line in page: > print(line) > Traceback (most recent call last): > File "", line 1, in > for line in page: > TypeError: 'addinfourl' object is not iterable > Why is this not iterable anymore? Is this too a bug? What the heck is an > addinfourl object? > > 5. Finally, I see that a bytes object has some of the same methods as > strings. But the error messages are confusing. line > b' "http://www.w3.org/TR/html4/loose.dtd";>\n' line.find('www') > Traceback (most recent call last): > File "", line 1, in > line.find('www') > TypeError: expected an object with the buffer interface line.find(b'www') > 11 > Why couldn't find take string as a parameter? Once again, encoding. The bytes object doesn't know what to encode the string to in order to do an apples-to-apples search of bytes. > If folks have advice on which, if any, of these are bugs please let me know > and I'll file them, and if possible work on fixes for them too. While not a bug, adding iterator support wouldn't hurt. And for the better TypeError messages, you could try submitting a patch to change to tack on something like "(e.g. bytes)", although I am not sure if anyone else would agree on that decision. > If you have advice on how I should better be teaching this new urllib that > would be great to hear as well. Probably the biggest issue will be having to explain string encoding. Obviously you can gloss over it or provide students with a simple library that just automatically converts the strings. Or even better, provide some code for the standard library that can take the HTML, figure out the encoding, and then return the decoded strings (might actually already be something for that that I am not aware of). -Brett ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] teaching the new urllib
I'm just getting ready to start the semester using my new book (Python Programming in Context) and noticed that I somehow missed all the changes to urllib in python 3.0. ARGH to say the least. I like using urllib in the intro class because we can get data from places that are more interesting/motivating/relevant to the students. Here are some of my observations on trying to do very basic stuff with urllib: 1. urllib.urlopen is now urllib.request.urlopen 2. The object returned by urlopen is no longer iterable! no more for line in url. 3. read, readline, readlines now return bytes objects or arrays of bytes instead of a str and array of str 4. Taking the naive approach to converting a bytes object to a str does not work as you would expect. >>> import urllib.request >>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') >>> page > >>> line = page.readline() >>> line b'>> str(line) 'b\'>> As you can see from the example the 'b' becomes part of the string! It seems like this should be a bug, is it? Here's the iteration problem: 'b\'>> for line in page: print(line) Traceback (most recent call last): File "", line 1, in for line in page: TypeError: 'addinfourl' object is not iterable Why is this not iterable anymore? Is this too a bug? What the heck is an addinfourl object? 5. Finally, I see that a bytes object has some of the same methods as strings. But the error messages are confusing. >>> line b' "http://www.w3.org/TR/html4/loose.dtd";>\n' >>> line.find('www') Traceback (most recent call last): File "", line 1, in line.find('www') TypeError: expected an object with the buffer interface >>> line.find(b'www') 11 Why couldn't find take string as a parameter? If folks have advice on which, if any, of these are bugs please let me know and I'll file them, and if possible work on fixes for them too. If you have advice on how I should better be teaching this new urllib that would be great to hear as well. Thanks, Brad -- Brad Miller Assistant Professor, Computer Science Luther College -- Brad Miller Assistant Professor, Computer Science Luther College ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] teaching the new urllib
I'm just getting ready to start the semester using my new book (Python Programming in Context) and noticed that I somehow missed all the changes to urllib in python 3.0. ARGH to say the least. I like using urllib in the intro class because we can get data from places that are more interesting/motivating/relevant to the students. Here are some of my observations on trying to do very basic stuff with urllib: 1. urllib.urlopen is now urllib.request.urlopen 2. The object returned by urlopen is no longer iterable! no more for line in url. 3. read, readline, readlines now return bytes objects or arrays of bytes instead of a str and array of str 4. Taking the naive approach to converting a bytes object to a str does not work as you would expect. >>> import urllib.request >>> page = urllib.request.urlopen('http://knuth.luther.edu/test.html') >>> page > >>> line = page.readline() >>> line b'>> str(line) 'b\'>> As you can see from the example the 'b' becomes part of the string! It seems like this should be a bug, is it? Here's the iteration problem: 'b\'>> for line in page: print(line) Traceback (most recent call last): File "", line 1, in for line in page: TypeError: 'addinfourl' object is not iterable Why is this not iterable anymore? Is this too a bug? What the heck is an addinfourl object? 5. Finally, I see that a bytes object has some of the same methods as strings. But the error messages are confusing. >>> line b' "http://www.w3.org/TR/html4/loose.dtd";>\n' >>> line.find('www') Traceback (most recent call last): File "", line 1, in line.find('www') TypeError: expected an object with the buffer interface >>> line.find(b'www') 11 Why couldn't find take string as a parameter? If folks have advice on which, if any, of these are bugs please let me know and I'll file them, and if possible work on fixes for them too. If you have advice on how I should better be teaching this new urllib that would be great to hear as well. Thanks, Brad -- Brad Miller Assistant Professor, Computer Science Luther College ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com