Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Friday 12 December 2008, Adam Olsen wrote: Only pages like this, which indicate the underlying API is an array of WCHAR: http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx Hmm, true. So even there, the encoding isn't known... char * is just fine. You need only pass a length along with it. All internal APIs *must* already do this, as they support nul bytes. Also note that the underlying POSIX APIs prohibit nul bytes in filenames, so it's irrelevant for them. Hmmm, I see things like Py_GetPath() in the 2.7 sourcecode, which returns a plain char*. I really need to check if 3.0 is better. thanks for the info Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, 12 Dec 2008 06:33:28 pm Toshio Kuratomi wrote: Also interesting, if you point your browser at: http://toshio.fedorapeople.org/u/ You should see two other test files. They're both (one-half)(enyei).html but one's encoded in utf-8 and the other in latin-1. For what it's worth, Konquorer 3.5 displays the two files as (1/2)(n+tilde).html (A+caret)(1/2)(A+tilde)(plusminus).html It doesn't seem to have any trouble opening either of them. -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thursday 11 December 2008, Steve Holden wrote: Ulrich Eckhardt wrote: If readdir() returned Unicode text, people would start taking that for granted. If it returned bytes, just the same. Returning a completely unrelated type will give them enough hint that for this thing they have to rethink their assumptions. This runs along the lines of In the face of ambiguity, refuse the temptation to guess., as it makes guessing rather impossible. So you are suggesting this special object be used only to represent files to users? Now I understand. Not only files, the same problem crops up when handling sys.argv and os.environ. I just don't see a case where using a separate path class would break things. Further, the special handling that is required would be made even clearer by using such a class. But it does have to be implemented ... Well, it isn't really terribly difficult to do so, after all its just a container for either a byte string or Unicode string plus some helper code to convert it to/from Unicode. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thursday 11 December 2008, Adam Olsen wrote: The simplest solution there is to have windows bytes APIs that return raw UTF-16 bytes (note that windows does NOT guaranteed to be valid unicode, despite being much more likely than on linux). Actually, I'm not aware of this case. I only know that the OS refuses to mount media it can't decode, but that is on the OS-level. Can you give me a hint? The only real issue I see is that UTF-16 isn't an ASCII superset, so it won't print nicely. True, but I personally couldn't care less. Actually, I would even prefer if printing a byte string always produced \x escaped byte values, that way it would at least be consistent. In other words, bytes can be your special type. That would actually be a lot of work to do, but I do agree that it would be a way. The problem though is that I have seen quite a few places in Python where such a byte string is passed as 'char*' and treated with the assumption that strlen() would yield a meaningful value there, so this calls at least for a distinct 'Py_Byte' type. Also, this still doesn't even remotely handle the problem that you do have two valid encodings on win32, even though the MBCS one could be called deprecated. People will try to interface to other libraries that use win32 CHAR strings and that will be much harder or even impossible. Further, and that is IMHO the worst part of it, things will fail too silently and programmers aren't encouraged to write portable code, but maybe I'm just too pessimistic. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Toshio Kuratomi writes: Adam Olsen wrote: On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. With all the focus on backup tools and file managers I think we've lost perspective. They're an important use case, but hardly the dominant one. True. Please, as a user, if your app is creating new files, do NOT use bytes! You have no excuse for creating garbage, and garbage doesn't help the user any. Getting the encoding right, use the unicode APIs, and don't pass the buck on to everything else. Uhmmm That's good advice but doesn't solve any problems :-(. Exactly. Furthermore, the problems *already exist*. My current locale is UTF-8 and all files dated since about 2002 have UTF-8 names, *except* in my MIME-bodies garbage can, where only recently have I got around to coercing my MUA to doing the right thing. And of course there are still legacy files names in EUC-JP, which I suppose I could search for but since I only access a directory containing one once in a pale blue moon, I'm not gonna bother. It's just not reasonable to expect users or even sysadminns to go around cleaning up legacy data. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
* Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. Duh! The address bar should contain the URL, which *is* the intended text. The escapes are there for a reason. If I pass some octets using percent escapes via the query string or request body, it's not text, not even intended. It's still a collection of octets. Translating them back (and forth when I press enter in the address bar) is a pretty ambigious operation and therefore pretty wrong. The defacto standard does not exist. There's a real one instead: RFC 2396. nd ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Friday 12 December 2008, Stephen J. Turnbull wrote: I gather that the BFDL's line on this thread of discussion is that forcing programmers to think about encodings every time they call out to the OS is unacceptable Exactly that is not necessary. for n in os.readdir('.'): f = open(n) if grep('foo', f): print('found foo!') Now, if you actually wanted to output the filename, you could never do so reliably anyway, because even though it is supposed to be text, the encoding isn't known. So, an archiving program will probably do something like this: try: for n in os.readdir(): b = n.encode('UTF-8') f = open(n) archive.write_file_header(b) archive.write_file(f) catch ... print oops, couldn't decode file '%s' % n.unicode(error='replace') If you're writing a filemanager, you would store the path alongside an approximated Unicode representation. when most programs will work acceptably almost all of the time with a rather naive approach. This means that almost all Python programs will be technically broken for the forseeable future, sorry, Ulrich. Actually, they are already broken, only that few people notice it. :| And for the same pragmatic reasons, these functions are going to return strings (ie, Unicode), not bytes, I expect. Sorry, Steve. What needs to be determined here is the best way to provide reliability to those who will go to the effort of asking for it if it's available. I don't think just return bytes fits the bill for the reason above. What I would like to see is a type that is derived from string (so if you present it to an API expecting string, it is silently treated as string), but from which the original bytes can always be extracted on request. I like that idea, this type would behave pretty much like the env_string I proposed. The main difference is that it does several implicit conversions where I personally would rather see explicit conversions. Other than that, I'm all for it. If the original bytes cannot be sensibly decoded to a string, then the string field in the object would either contain something that should normally cause an error in a string API, or some made-up string (presumably it would attempt to be a more or less faithful representation of the bytes) at the caller's option. Probably they'd also contain some metadata useful in guessing encodings (the read time locale in particular). Well, I wouldn't provide an approximation. Considering the archiving software above, you would end up with a file name undecodable file name in an archive. For that kind of software, it would be fatal. But, and that is much more important than my preference, at least your approach would allow writing reliable software that properly handles such environment strings. Further, and that is where it differs from just returning bytes, it even makes it easy by the using a distinct type. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. Duh! The address bar should contain the URL, which *is* the intended text. The escapes are there for a reason. If I pass some octets using percent escapes via the query string or request body, it's not text, not even intended. It's still a collection of octets. Translating them back (and forth when I press enter in the address bar) is a pretty ambigious operation and therefore pretty wrong. The defacto standard does not exist. There's a real one instead: RFC 2396. All the heaps of people using non-english wikipedia sites might disagree with you. There's only, what, a few *million* pages that would be affected? It'd be very interesting if someone at Google could provide some statistics on URL encodings. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Curt Hagenlocher curt at hagenlocher.org writes: On Thu, Dec 11, 2008 at 10:19 PM, Adam Olsen rhamph at gmail.com wrote: I doubt that UTF-16 is used very much (other than on windows). There's this other obscure platform called Java... ;) Does it have a filesystem? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 12, 2008 at 5:06 AM, Antoine Pitrou solip...@pitrou.net wrote: Curt Hagenlocher curt at hagenlocher.org writes: There's this other obscure platform called Java... ;) Does it have a filesystem? No, but it also has to interact with filesystems of possibly invalid or indeterminate encodings. What does java.io do? -- Curt Hagenlocher c...@hagenlocher.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Curt Hagenlocher curt at hagenlocher.org writes: No, but it also has to interact with filesystems of possibly invalid or indeterminate encodings. What does java.io do? My point was that Python doesn't have to interact with the Java IO libraries, while it has to interact with the Unix and Windows IO APIs. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net wrote: Curt Hagenlocher curt at hagenlocher.org writes: No, but it also has to interact with filesystems of possibly invalid or indeterminate encodings. What does java.io do? My point was that Python doesn't have to interact with the Java IO libraries, while it has to interact with the Unix and Windows IO APIs. Of course. But the Java IO libraries have to interact with the Unix and Windows IO APIs as well. It might be interesting to know how they handle similar situations. -- Curt Hagenlocher c...@hagenlocher.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Curt Hagenlocher wrote: On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net wrote: Curt Hagenlocher curt at hagenlocher.org writes: No, but it also has to interact with filesystems of possibly invalid or indeterminate encodings. What does java.io do? My point was that Python doesn't have to interact with the Java IO libraries, while it has to interact with the Unix and Windows IO APIs. Of course. But the Java IO libraries have to interact with the Unix and Windows IO APIs as well. It might be interesting to know how they handle similar situations. See the following email for a summary of existing practice (as of 2004): http://www.mail-archive.com/unic...@unicode.org/msg27352.html -Scott -- Scott Dial sc...@scottdial.com scod...@cs.indiana.edu ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. IOW, inconsistent behaviour is a bug, but translating into UTF-8 is not. ;) I think we should let this tangent drop because it's about bugs in firefox bug, not in python :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 12, 2008 at 16:21, Scott Dial scott+python-...@scottdial.com wrote: See the following email for a summary of existing practice (as of 2004): http://www.mail-archive.com/unic...@unicode.org/msg27352.html Interesting. Quite a lot of them do just drop the undecodable filenames. The Java solution with replacing it seems to be a better idea at first glance, but what if you then end up with two filenames that are the same? Possibly replacing with the ? character is a good idea to notify that the file is there, but fail then fail to open it. -- Lennart Regebro: Zope and Plone consulting. http://www.colliberty.com/ +33 661 58 14 64 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 02:23 pm, c...@hagenlocher.org wrote: On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net wrote: Curt Hagenlocher curt at hagenlocher.org writes: No, but it also has to interact with filesystems of possibly invalid or indeterminate encodings. What does java.io do? My point was that Python doesn't have to interact with the Java IO libraries, while it has to interact with the Unix and Windows IO APIs. Of course. But the Java IO libraries have to interact with the Unix and Windows IO APIs as well. It might be interesting to know how they handle similar situations. Apparently Java has the facilities to do the right thing, but actually it's just broken. My locale says UTF-8. However, if I create a non-decodable file with Python (2), there are three ways I can tell Java to open it: I can ask for it with a string (that won't work, because no valid UTF-8 string maps to an undecodable string, pretty much by definition). I can list the directory that it's in (presuming that *that's* a directory) and get a java.io.File, which could be retaining all the interesting information, or I can use a URI, which is a string that resolves to octets before it resolves to characters again. However, it looks like Java screws up in every case. Here's a transcript from the ever-helpful jython: gl...@nhuvasarim:~/tmp$ python Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 Type help, copyright, credits or license for more information. file(\xff\xff, wb).write(lolz\n) gl...@nhuvasarim:~/tmp$ jython Jython 2.2.1 on java1.6.0_07 Type copyright, credits or license for more information. from java.io import File fileList = File(.).listFiles() fileList array(java.io.File,[./ fileList[0].__class__ jclass java.io.File 1 from java.io import FileReader FileReader(fileList[0]) Traceback (innermost last): File console, line 1, in ? at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:106) at java.io.FileReader.init(FileReader.java:55) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) java.io.FileNotFoundException: java.io.FileNotFoundException: ./ÿFDÿFD (No such file or directory) from java.net import URI u = URI(file:///home/glyph/tmp/%ff%ff) FileReader(File(u)) Traceback (innermost last): File console, line 1, in ? at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:106) at java.io.FileReader.init(FileReader.java:55) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) java.io.FileNotFoundException: java.io.FileNotFoundException: /home/glyph/tmp/ÿFDÿFD (No such file or directory) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
* Adam Olsen wrote: On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. Duh! The address bar should contain the URL, which *is* the intended text. The escapes are there for a reason. If I pass some octets using percent escapes via the query string or request body, it's not text, not even intended. It's still a collection of octets. Translating them back (and forth when I press enter in the address bar) is a pretty ambigious operation and therefore pretty wrong. The defacto standard does not exist. There's a real one instead: RFC 2396. All the heaps of people using non-english wikipedia sites might disagree with you. There's only, what, a few *million* pages that would be affected? I'm not sure what you're trying to pull here. Is that supposed to be an argument? There's no page affected at all. It's a browser UI issue, not a page issue. And even if it were interesting at all, how the URL escapes are displayed in the address bar, those millions of people would favourite KOI8-R or Big 5 over UTF-8 if you would ask them. Which leads to the exact point: The browser cannot know, nor should it even. It's opaque. The only entity which needs to understand the encoding of URL percent escapes in query or request body is the *server* selecting the resource. But I'm sure I'm not telling you any news here. nd -- Das Verhalten von Gates hatte mir bewiesen, dass ich auf ihn und seine beiden Gefährten nicht zu zählen brauchte -- Karl May, Winnetou III Im Westen was neues: http://pub.perlig.de/books.html#apache2 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 12, 2008 at 9:47 PM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. Duh! The address bar should contain the URL, which *is* the intended text. The escapes are there for a reason. If I pass some octets using percent escapes via the query string or request body, it's not text, not even intended. It's still a collection of octets. Translating them back (and forth when I press enter in the address bar) is a pretty ambigious operation and therefore pretty wrong. The defacto standard does not exist. There's a real one instead: RFC 2396. All the heaps of people using non-english wikipedia sites might disagree with you. There's only, what, a few *million* pages that would be affected? I'm not sure what you're trying to pull here. Is that supposed to be an argument? There's no page affected at all. It's a browser UI issue, not a page issue. And even if it were interesting at all, how the URL escapes are displayed in the address bar, those millions of people would favourite KOI8-R or Big 5 over UTF-8 if you would ask them. Which leads to the exact point: The browser cannot know, nor should it even. It's opaque. The only entity which needs to understand the encoding of URL percent escapes in query or request body is the *server* selecting the resource. But I'm sure I'm not telling you any news here. You're arguing that text should be an opaque entity.. We've wasted enough of everybody's time on this already, I'm not going to continue on this thread. Send me a private email if you think it's really important. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
* Adam Olsen wrote: On Fri, Dec 12, 2008 at 9:47 PM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote: * Adam Olsen wrote: UTF-8 in percent encodings is becoming a defacto standard. Otherwise the browser has to display the percent escapes in the address bar, rather than the intended text. Duh! The address bar should contain the URL, which *is* the intended text. The escapes are there for a reason. If I pass some octets using percent escapes via the query string or request body, it's not text, not even intended. It's still a collection of octets. Translating them back (and forth when I press enter in the address bar) is a pretty ambigious operation and therefore pretty wrong. The defacto standard does not exist. There's a real one instead: RFC 2396. All the heaps of people using non-english wikipedia sites might disagree with you. There's only, what, a few *million* pages that would be affected? I'm not sure what you're trying to pull here. Is that supposed to be an argument? There's no page affected at all. It's a browser UI issue, not a page issue. And even if it were interesting at all, how the URL escapes are displayed in the address bar, those millions of people would favourite KOI8-R or Big 5 over UTF-8 if you would ask them. Which leads to the exact point: The browser cannot know, nor should it even. It's opaque. The only entity which needs to understand the encoding of URL percent escapes in query or request body is the *server* selecting the resource. But I'm sure I'm not telling you any news here. You're arguing that text should be an opaque entity.. No, actually I'm not. I'm arguing that escapes are opaque. We've wasted enough of everybody's time on this already, I'm not going to continue on this thread. Agreed. nd -- Da fällt mir ein, wieso gibt es eigentlich in Unicode kein i mit einem Herzchen als Tüpfelchen? Das wär sooo süüss! -- Björn Höhrmann in darw ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Wednesday 10 December 2008, Adam Olsen wrote: On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: On Tuesday 09 December 2008, Adam Olsen wrote: The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. Well, I see two cases: 1. Converting from an uncertain representation to a known one. 2. Converting from a known representation to a known one. Not quite: 1. Using a garbage file name locally (within a single process, not talking to any libs) 2. Using a unicode filename everywhere (libs, saved to config files, displayed to the user, etc.) I think there is some misunderstanding. I was referring to conversions and whether it is good to perform them implicitly. For that, I saw the above two cases. On linux the bytes/unicode separation is perfect for this. You decide which approach you're using and use it consistently. If you mess up (mixing bytes and unicode) you'll consistently get an error. We currently don't follow this model on windows, so a garbage file name gets passed around as if it was unicode, but fails when passed to a lib, saved to a config file, is displayed to a user, etc. I'm not sure I agree with this. Facts I know are: 1. On POSIX systems, there is no reliable encoding for filenames while the system APIs use char/byte strings. 2. On MS Windows, the encoding for filenames is Unicode/UTF-16. Returning Unicode strings from readdir() is wrong because it can't handle the case 1 above. Returning byte strings is wrong because it can't handle case 2 above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, worst case, to the locale-dependent MBCS. Returning something different depending on the system us also broken because that would make Python code that uses this function and assumes a certain type unportable. Note that this doesn't get much better if you provide a separate readdirb() API or one that simply returns a byte string or Unicode string depending on its argument. It just shifts the brokenness from readdir() to the code that uses it, unless this code makes a distinction between the target systems. Since way too many programmers are not aware of the problem, they will not handle these systems differently, so code will become non-portable. What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir(). In order to use this, a programmer will have to convert it explicitly, otherwise e.g. printing it will just produce env_string at 0x01234567. This will immediately bump each programmer with their heads on the issue of unknown encodings and they will have to make the application-specific choice whether an approximation of the filename, an exception or ignoring the file is the right choice. Also, it presents the options for doing this conversion in a single class, which I personally find much better than providing overloads for hundreds of functions. Sorry for ranting, but I'm a bit confused and desperate, because either I'm unable to explain what I mean or I'm really not understanding something that everybody else here seems to agree upon. I just know that using a distinct path type has helped me in C++ in the past, and I don't see why it shouldn't in Python. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Ulrich Eckhardt wrote: On Wednesday 10 December 2008, Adam Olsen wrote: On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: On Tuesday 09 December 2008, Adam Olsen wrote: The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. Well, I see two cases: 1. Converting from an uncertain representation to a known one. 2. Converting from a known representation to a known one. Not quite: 1. Using a garbage file name locally (within a single process, not talking to any libs) 2. Using a unicode filename everywhere (libs, saved to config files, displayed to the user, etc.) I think there is some misunderstanding. I was referring to conversions and whether it is good to perform them implicitly. For that, I saw the above two cases. On linux the bytes/unicode separation is perfect for this. You decide which approach you're using and use it consistently. If you mess up (mixing bytes and unicode) you'll consistently get an error. We currently don't follow this model on windows, so a garbage file name gets passed around as if it was unicode, but fails when passed to a lib, saved to a config file, is displayed to a user, etc. I'm not sure I agree with this. Facts I know are: 1. On POSIX systems, there is no reliable encoding for filenames while the system APIs use char/byte strings. 2. On MS Windows, the encoding for filenames is Unicode/UTF-16. Returning Unicode strings from readdir() is wrong because it can't handle the case 1 above. Returning byte strings is wrong because it can't handle case 2 above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, worst case, to the locale-dependent MBCS. Returning something different depending on the system us also broken because that would make Python code that uses this function and assumes a certain type unportable. Note that this doesn't get much better if you provide a separate readdirb() API or one that simply returns a byte string or Unicode string depending on its argument. It just shifts the brokenness from readdir() to the code that uses it, unless this code makes a distinction between the target systems. Since way too many programmers are not aware of the problem, they will not handle these systems differently, so code will become non-portable. What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir(). In order to use this, a programmer will have to convert it explicitly, otherwise e.g. printing it will just produce env_string at 0x01234567. This will immediately bump each programmer with their heads on the issue of unknown encodings and they will have to make the application-specific choice whether an approximation of the filename, an exception or ignoring the file is the right choice. Also, it presents the options for doing this conversion in a single class, which I personally find much better than providing overloads for hundreds of functions. Sorry for ranting, but I'm a bit confused and desperate, because either I'm unable to explain what I mean or I'm really not understanding something that everybody else here seems to agree upon. I just know that using a distinct path type has helped me in C++ in the past, and I don't see why it shouldn't in Python. Seems to me this just threatens to add to the confusion. If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user. If you don't, then at least if you have the string in its bytes form you can re-present it to the filesystem to manipulate the file. What are we supposed to do with the special type? regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thursday 11 December 2008, Steve Holden wrote: Ulrich Eckhardt wrote: What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir(). In order to use this, a programmer will have to convert it explicitly, otherwise e.g. printing it will just produce env_string at 0x01234567. This will immediately bump each programmer with their heads on the issue of unknown encodings and they will have to make the application-specific choice whether an approximation of the filename, an exception or ignoring the file is the right choice. Also, it presents the options for doing this conversion in a single class, which I personally find much better than providing overloads for hundreds of functions. [...] Seems to me this just threatens to add to the confusion. If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user. If you don't, then at least if you have the string in its bytes form you can ^^^ There are operating systems that don't use bytes to represent a file path, namely all the MS Windows variants. Even worse, when you use a byte string there, it typically means that you want to use the obsolete encoding that is based on codepages. Why can we not preserve the representation of a path as it is? Why do we _have_ to convert it to anything at all, without even knowing if this conversion is needed? I just want to do something to a file's content, why does its path have to be converted to something and then be converted back in order for the system to digest it? re-present it to the filesystem to manipulate the file. What are we supposed to do with the special type? You receive from readdir() and pass it to stat(), simple as that. No conversions from the native representation needed. If you need a textual representation, then you have to convert it and you have to do so explicitly according to whatever logic your application requires. If readdir() returned Unicode text, people would start taking that for granted. If it returned bytes, just the same. Returning a completely unrelated type will give them enough hint that for this thing they have to rethink their assumptions. This runs along the lines of In the face of ambiguity, refuse the temptation to guess., as it makes guessing rather impossible. I just don't see a case where using a separate path class would break things. Further, the special handling that is required would be made even clearer by using such a class. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thu, 11 Dec 2008, Ulrich Eckhardt wrote: On Thursday 11 December 2008, Steve Holden wrote: Ulrich Eckhardt wrote: Seems to me this just threatens to add to the confusion. If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user. If you don't, then at least if you have the string in its bytes form you can ^^^ There are operating systems that don't use bytes to represent a file path, namely all the MS Windows variants. Even worse, when you use a byte string there, it typically means that you want to use the obsolete encoding that is based on codepages. Why can we not preserve the representation of a path as it is? Why do we _have_ to convert it to anything at all, without even knowing if this conversion is needed? I just want to do something to a file's content, why does its path have to be converted to something and then be converted back in order for the system to digest it? re-present it to the filesystem to manipulate the file. What are we supposed to do with the special type? You receive from readdir() and pass it to stat(), simple as that. No conversions from the native representation needed. If you need a textual representation, then you have to convert it and you have to do so explicitly according to whatever logic your application requires. Not only would this address the issue with the local filesystem, it would also provide a principled way to deal with remote filesystems. For example, an FTP interface library for Python could use this type to returns paths of the sort actually supported by the raw FTP protocol. Thinking of the filesystem is actually a misconception - always referring to a filesystem opens up all sorts of possibilities. There is a lot of coding to do to allow this, but allowing programs to work with paths and files in the local filesystem, remote filesystems, and filesystems constructed from others (e.g., by expanding symlinks, changing the root similar to chroot, or encoding/unencoding pathnames) would open up lots of possibilities, including better test environments. This is an interesting case of separating byte strings from character strings. As long as the two are conflated, everything appears simple. But when they are separated, not only are there two types where before there was only one, it turns out that which type is correct in some circumstances depends on the platform. Also, many objects which are byte strings at the protocol level are usually or always meant to be character strings of some sort, but how to translate them simply cannot be nailed down once and for all. Isaac Morland CSCF Web Guru DC 2554C, x36650WWW Software Specialist ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thu, Dec 11, 2008 at 6:41 AM, Ulrich Eckhardt eckha...@satorlaser.com wrote: On Thursday 11 December 2008, Steve Holden wrote: re-present it to the filesystem to manipulate the file. What are we supposed to do with the special type? You receive from readdir() and pass it to stat(), simple as that. No conversions from the native representation needed. If you need a textual representation, then you have to convert it and you have to do so explicitly according to whatever logic your application requires. The simplest solution there is to have windows bytes APIs that return raw UTF-16 bytes (note that windows does NOT guaranteed to be valid unicode, despite being much more likely than on linux). The only real issue I see is that UTF-16 isn't an ASCII superset, so it won't print nicely. In other words, bytes can be your special type. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Steve Holden writes: Ulrich Eckhardt writes: What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir(). This is presumably unacceptable on the grounds that it will break existing code that does something more or less useful more or less some of the time.wink If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user. Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. The implementation issue is why you want bytes, but I don't think it is going to overcome the tide of (semantically-oriented) pragmatism. If you don't, then at least if you have the string in its bytes form you can re-present it to the filesystem to manipulate the file. What are we supposed to do with the special type? Trivially convert it back to bytes and re-present it to the filesystem, of course. I gather that the BFDL's line on this thread of discussion is that forcing programmers to think about encodings every time they call out to the OS is unacceptable when most programs will work acceptably almost all of the time with a rather naive approach. This means that almost all Python programs will be technically broken for the forseeable future, sorry, Ulrich. And for the same pragmatic reasons, these functions are going to return strings (ie, Unicode), not bytes, I expect. Sorry, Steve. What needs to be determined here is the best way to provide reliability to those who will go to the effort of asking for it if it's available. I don't think just return bytes fits the bill for the reason above. What I would like to see is a type that is derived from string (so if you present it to an API expecting string, it is silently treated as string), but from which the original bytes can always be extracted on request. If the original bytes cannot be sensibly decoded to a string, then the string field in the object would either contain something that should normally cause an error in a string API, or some made-up string (presumably it would attempt to be a more or less faithful representation of the bytes) at the caller's option. Probably they'd also contain some metadata useful in guessing encodings (the read time locale in particular). These objects probably shouldn't support string-like operations in a general way (ie, maintaining both the string representation and the bytes correctly). Rather, using proper string operations on them would use the string content and produce strings. People who really want to handle mixed-encoding pathnames and the like would have to keep collections of these objects and handle them in an ad-hoc way. Unfortunate implementing this is way beyond my skills and time availability. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. With all the focus on backup tools and file managers I think we've lost perspective. They're an important use case, but hardly the dominant one. Please, as a user, if your app is creating new files, do NOT use bytes! You have no excuse for creating garbage, and garbage doesn't help the user any. Getting the encoding right, use the unicode APIs, and don't pass the buck on to everything else. The fact that the unicode is easier is a bonus for doing the right thing. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. With all the focus on backup tools and file managers I think we've lost perspective. They're an important use case, but hardly the dominant one. Please, as a user, if your app is creating new files, do NOT use bytes! You have no excuse for creating garbage, and garbage doesn't help the user any. Getting the encoding right, use the unicode APIs, and don't pass the buck on to everything else. Uhmmm That's good advice but doesn't solve any problems :-(. No matter what I create, the filenames will be bytes when the next person reads them in. If my locale is shift-js and the person I'm sharing the file with uses utf-8 things won't work. Even if my locale is utf-8 (since I come from a European nation) and their locale is utf-16 (because they're from an Asian nation) the Unicode API won't work. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thu, Dec 11, 2008 at 10:41 PM, Toshio Kuratomi a.bad...@gmail.com wrote: Adam Olsen wrote: On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote: Unfortunately, even programmers experienced in I18N like Martin, and those with intuition-that-has-the-force-of-lawwink like Guido, express deliberate disbelief on this point. They say that filesystem names and environment variable values are text, which is true from the semantic viewpoint but can't be fully supported by any implementation. With all the focus on backup tools and file managers I think we've lost perspective. They're an important use case, but hardly the dominant one. Please, as a user, if your app is creating new files, do NOT use bytes! You have no excuse for creating garbage, and garbage doesn't help the user any. Getting the encoding right, use the unicode APIs, and don't pass the buck on to everything else. Uhmmm That's good advice but doesn't solve any problems :-(. No matter what I create, the filenames will be bytes when the next person reads them in. If my locale is shift-js and the person I'm sharing the file with uses utf-8 things won't work. Even if my locale is utf-8 (since I come from a European nation) and their locale is utf-16 (because they're from an Asian nation) the Unicode API won't work. So you'll open up the dir and find this collection: ??.txt .png ???.html .html ???.png ??.txt ??.txt ??.txt A half-broken setup is still a broken setup. Eventually you have to tell people to stop screwing around and pick one encoding. I doubt that UTF-16 is used very much (other than on windows). I haven't found any statistics on what distros use, but did find this one of the web itself: http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html I can't wait for next year's statistics. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Thu, Dec 11, 2008 at 11:25 PM, Curt Hagenlocher c...@hagenlocher.org wrote: On Thu, Dec 11, 2008 at 10:19 PM, Adam Olsen rha...@gmail.com wrote: I doubt that UTF-16 is used very much (other than on windows). There's this other obscure platform called Java... ;) Sorry, I should have said for interchange. :) (CPython doesn't use UTF-8 internally either. It uses UTF-16 or UTF-32.) -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: A half-broken setup is still a broken setup. Eventually you have to tell people to stop screwing around and pick one encoding. But it's not a broken setup. It's the way the world is because people share things with each other. I doubt that UTF-16 is used very much (other than on windows). I haven't found any statistics on what distros use, but did find this one of the web itself: http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html UTF-16 is popular in Asian locales for the same reason that shift-js and big-5 are hanging in there. utf-8 takes many more bytes to encode Asian Unicode characters than utf-16. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen wrote: As a data point, firefox (when pointed at my home dir) DOES skip over garbage files. That's not true. However, it looks like Firefox is actually broken. Take a look at this screenshot: firefox.png That shows a directory with a folder that's not decodable in my utf-8 locale. What's interesting to note is that I actually have two nondecodable folders there but only one of them showed up. So firefox is inconsistent with its treatment, rendering some non-decodable files and ignoring others. Also interesting, if you point your browser at: http://toshio.fedorapeople.org/u/ You should see two other test files. They're both (one-half)(enyei).html but one's encoded in utf-8 and the other in latin-1. Firefox has some bugs in it related to this. For instance, if you mouseover the two links you'll see that firefox displays the same symbolic names for each of the files (even though they're in two different encodings). Sometimes firefox is able to load both files and sometimes it only loads one of them. Firefox seems to be translating the characters from ASCII percent encoding of bytes into their unicode symbols and back to utf-8 in some circumstances related to whether it has the pages in its cache or not. In this case, it should be leaving things as percent encoded bytes as it's the only way that apache is going to know what to retrieve. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Tuesday 09 December 2008, Adam Olsen wrote: On Tue, Dec 9, 2008 at 11:31 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: On Monday 08 December 2008, Adam Olsen wrote: At this point someone suggests we have a type that can store an arbitrary mix of unicode and bytes, so the undecodable portions stay in their original form. :P Well, not an arbitrary mix, but a type that just stores whatever comes from the system without further specifying it as either bytes or Unicode: * If you want a string for displaying it, you first have to extract a string from that thing and there you optionally specify the encoding and error behaviour. * If you want to append a string to it, it is automatically encoded in the default encoding, which obviously can fail. So the 2.x str, but with a more interesting default encoding than ASCII. It'll work fine on the developer's system, but one day a user will present it with strange input, and boom. If the system's representation of filenames can not represent a Unicode codepoint that the user entered, trying to open such a file must fail. If it can be represented, for convenience I would allow an implicit conversion. for i in readdir(): copy( i, i+.backup) ... You have to be pessimistic here. The default operations should either always work or never work. Using unicode internally and skipping garbage input means the operations always work. Using a bytes API means mixing with unicode never works, unless the programmer explicitly converts, in which case the onus is on them to use proper error handling. So, if I understand you correctly, you would prefer an explicit conversion to the system's representation: for i in readdir(): copy( i, i+path(.backup)) ... The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. Well, I see two cases: 1. Converting from an uncertain representation to a known one. 2. Converting from a known representation to a known one. The uncertain one is the one used by the filesystem or environment. The known representations are the expected(!) encoding for filesystem and environment and the internal text in Unicode. For case 1, I would require an explicit conversion to make the programmer really aware of the fact that it can fail. For the second case, I would allow an implicit conversion even though it can fail. Anyhow, that is a matter of taste, and I can actually live with your point of view. However, one question still remains: What about the approach in general, i.e. that these texts with an uncertain representation are handled as a separate type? I find this much more appealing that duplicating APIs like readdir() using either overloading on the arguments or a separate readdirb(). Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: On Tuesday 09 December 2008, Adam Olsen wrote: The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. Well, I see two cases: 1. Converting from an uncertain representation to a known one. 2. Converting from a known representation to a known one. Not quite: 1. Using a garbage file name locally (within a single process, not talking to any libs) 2. Using a unicode filename everywhere (libs, saved to config files, displayed to the user, etc.) Note that if you have a GUI doing the former, all you technically need is a placeholder like undecodable filename. You might try to extract some ASCII out of it, but that's just a minor bonus. On linux the bytes/unicode separation is perfect for this. You decide which approach you're using and use it consistently. If you mess up (mixing bytes and unicode) you'll consistently get an error. We currently don't follow this model on windows, so a garbage file name gets passed around as if it was unicode, but fails when passed to a lib, saved to a config file, is displayed to a user, etc. (Depending on the API, as many won't validate either.) -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) Instead of a codecs error handler name, how about a callback for converting bytes to str? os.listdir(somedir, decoder=bytes.decode) os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding, errors='xmlcharrefreplace')) os.listdir(somedir, decoder=repr) ISTM that would be simpler and more flexible than going over the codecs registry. One caveat though is that there's no obvious way of telling listdir to skip a name. But if the default behaviour for decoder=None is to skip with a warning, then the need to explicitly ask for files to be skipped would be small. Terry's example would then be: try: files = os.listdir(somedir, decoder=bytes.decode) except UnicodeDecodeError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) - Anders ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Glenn Linderman wrote: On approximately 12/8/2008 9:30 AM, came the following characters from the keyboard of [EMAIL PROTECTED]: PS: I'd like to see a similar warning issued when an access attempt is made through os.environ to a variable that cannot be decoded. And argv ? Seems like the warning technique could be useful for _any_ interface that has been traditionally bytes, because that's the kind of characters that were, but now should move to (Unicode) characters. The warnings could be the same, or very similar. The question is if one global control should handle all types of bytes problems, or if there should be individual controls for each bytes problem, or both. I tend to believe in both; the paranoid can set exactly the ones they've coded for, the aggressive can set the global one. In this manner, new cases can be added to the global settings over time, if more are discovered -- it should be documented to handle future similar issues in a similar manner. The warnings system provides that level of granularity for 'free' (so long as we set the stack level appropriately in the C-API warnings call). Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-09 09:41, Anders J. Munch wrote: On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) Instead of a codecs error handler name, how about a callback for converting bytes to str? os.listdir(somedir, decoder=bytes.decode) os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding, errors='xmlcharrefreplace')) os.listdir(somedir, decoder=repr) ISTM that would be simpler and more flexible than going over the codecs registry. One caveat though is that there's no obvious way of telling listdir to skip a name. But if the default behaviour for decoder=None is to skip with a warning, then the need to explicitly ask for files to be skipped would be small. Terry's example would then be: try: files = os.listdir(somedir, decoder=bytes.decode) except UnicodeDecodeError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) Well, this is not too far away from just putting the whole decoding logic into the application directly: files = [filename.decode(filesystemencoding, errors='warnreplace') for filename in os.listdir(dir)] (or os.listdirb() if that's where the discussion is heading) ... and that also tells us something about this discussion: we're trying to come up with some magic to work around writing two lines of Python code. I'd just have all the os APIs return bytes and leave whatever conversion to Unicode might be necessary to a higher level API. Think of it: You really only need the Unicode values if you ever want to output those values in text form somewhere. In those cases, it's usually a human reading a log file or screen output. Most other cases, just care about getting some form of file identifier in order to open the file and don't really care about the encoding of the file name at all. It's probably better to have a two helper functions in the os module that take care of the conversion on demand rather than trying to force this conversion even in cases where the application never really needs to write the filename somewhere, e.g. os.decodefilename() and os.encodefilename(). These should then provide some reasonable default logic, e.g. use a 'warnreplace' error handler. Applications are then free to use these converters or implement their own. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 09 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
* M.-A. Lemburg wrote: On 2008-12-09 09:41, Anders J. Munch wrote: On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) Instead of a codecs error handler name, how about a callback for converting bytes to str? os.listdir(somedir, decoder=bytes.decode) os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding, errors='xmlcharrefreplace')) os.listdir(somedir, decoder=repr) ISTM that would be simpler and more flexible than going over the codecs registry. One caveat though is that there's no obvious way of telling listdir to skip a name. But if the default behaviour for decoder=None is to skip with a warning, then the need to explicitly ask for files to be skipped would be small. Terry's example would then be: try: files = os.listdir(somedir, decoder=bytes.decode) except UnicodeDecodeError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) Well, this is not too far away from just putting the whole decoding logic into the application directly: files = [filename.decode(filesystemencoding, errors='warnreplace') for filename in os.listdir(dir)] (or os.listdirb() if that's where the discussion is heading) ... and that also tells us something about this discussion: we're trying to come up with some magic to work around writing two lines of Python code. I'd just have all the os APIs return bytes and leave whatever conversion to Unicode might be necessary to a higher level API. [...] What I'm saying ;-) +1. nd ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
M.-A. Lemburg wrote: Well, this is not too far away from just putting the whole decoding logic into the application directly: files = [filename.decode(filesystemencoding, errors='warnreplace') for filename in os.listdir(dir)] (or os.listdirb() if that's where the discussion is heading) I see what you mean, and yes, I think os.listdirb will do just as well. There is no need for any extra parameters to os.listdir. The typical application will just obliviously use os.listdir(dir) and get the default elide-and-warn behaviour for un-decodable names. That rare special application that needs more control can use os.listdirb and handle decoding itself. Using a global registry of error handlers would just get in the way of an application that needs more control. - Anders ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Monday 08 December 2008, Adam Olsen wrote: At this point someone suggests we have a type that can store an arbitrary mix of unicode and bytes, so the undecodable portions stay in their original form. :P Well, not an arbitrary mix, but a type that just stores whatever comes from the system without further specifying it as either bytes or Unicode: * If you want a string for displaying it, you first have to extract a string from that thing and there you optionally specify the encoding and error behaviour. * If you want to append a string to it, it is automatically encoded in the default encoding, which obviously can fail. * Similarly, e.g. globbing is done on the underlying representation's level, so *.py will first have to be converted according to the default encoding. * If you just print it, you will get something that you can make out the decodable parts from, but it will probably be like {Unicode:u'abcde'} or {bytes:b'ab\xf0\x0fcd'}. * If you don't want to display it, but just want to pass it to the system, just use it as is. Yes, this puts an inconvenience on application programmers that up to now always assumed that they received a list of strings from os.readdir(), but that's the way with false assumptions. In any case, they will be aware (from reading the docs) of what the problem is and why there is no way to return a text. Further, they will get tools to convert these paths or environment vars to texts, so it will be simply replacing os.readdir() with map(to_unicode,os.readdir()). Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Tue, Dec 9, 2008 at 11:31 AM, Ulrich Eckhardt [EMAIL PROTECTED] wrote: On Monday 08 December 2008, Adam Olsen wrote: At this point someone suggests we have a type that can store an arbitrary mix of unicode and bytes, so the undecodable portions stay in their original form. :P Well, not an arbitrary mix, but a type that just stores whatever comes from the system without further specifying it as either bytes or Unicode: * If you want a string for displaying it, you first have to extract a string from that thing and there you optionally specify the encoding and error behaviour. * If you want to append a string to it, it is automatically encoded in the default encoding, which obviously can fail. So the 2.x str, but with a more interesting default encoding than ASCII. It'll work fine on the developer's system, but one day a user will present it with strange input, and boom. You have to be pessimistic here. The default operations should either always work or never work. Using unicode internally and skipping garbage input means the operations always work. Using a bytes API means mixing with unicode never works, unless the programmer explicitly converts, in which case the onus is on them to use proper error handling. The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
James Y Knight wrote: On Dec 9, 2008, at 6:04 AM, Anders J. Munch wrote: The typical application will just obliviously use os.listdir(dir) and get the default elide-and-warn behaviour for un-decodable names. That rare special application I guess this is a new definition of rare special application: an application which deals with user-specified files. This is the problem I see in having two parallel APIs: people keep saying most applications can just go ahead and use the [broken] unicode string API. If there was a unicode API and a bytes API, but everyone was clear that always use the bytes API is the right thing to do, that'd be okay... But, since even python-dev members are saying that only a rare special app needs to care about working with users' existing files, I'm rather worried this API design will cause most programs written in python to be broken. Which seems a shame. I agree with you which was part of why I raised this subject but I also think that using the warnings module to issue a warning and ignore the entire problematic entry is a reasonable compromise. Hopefully it will become obvious to people that it's a python3 wart at some point in the future and we'll re-examine the default. But until then, having a printed warning that individual apps can turn into an exception seems like it is less broken than the other alternatives the rare special application people can live with :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Glenn Linderman writes: significantly seems to be the only word at question; it seems that there are a fair number of validation checks that could be performed; the numeric part of UTF-8 decoding is just a sequence of shifts, masks, and ORs, so can be coded pretty tightly in C or assembly language. Anything extra would be slower; how much slower is hard to predict prior to the implementation. Not much, see my previous response. This also seems to be supported by Stephen's comment That's a lot to ask, as it turns out. Not what I meant. Inefficiency is not an objection to checking for validity at the level a codec can handle. The objection is that we don't want *any* exceptions thrown that we didn't explicitly ask for, and adding validation certainly will violate that. So I don't understand how this is responsive to the decoding removes many insecurities issue? Because you have to recheck every time the data crosses from Python into your code. To the extent that Python codecs promise validation and keep that promise, internal code *never* has to make those checks. That is a significant savings in programmer effort, because auditing a large body of code for *any* I/O from Python is going to be costly. So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not. Python is precisely such a component that people will choose to use, or not, based on whether they can expect that when Python hands them a Unicode object freshly input from the outside world, it won't contain lone surrogates, or invalid UTF-8 characters that got through a 3rd-party spam filter, or whatever. This whole discussion about libraries seems somewhat irrelevant to the question at hand, No, it's the *only* point that matters. IMO, speed is not relevant here. The question is whether throwing a Unicode exception on invalid encoding by default generally does more good than harm. Guido seems to think not!, which gives me pause.wink I still disagree, though. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Friday 05 December 2008, James Y Knight wrote: On Dec 5, 2008, at 5:27 AM, Ulrich Eckhardt wrote: Using the byte variant is equally fubar, because e.g. on MS Windows it is not supported, except through a very lossy roundtrip through the locale's codepage, limiting your functionality. Yeah, IMO whole mess could have been avoided by keeping the filename/ args/environ simply *bytes*, like it really is, on unix. Then, make the Windows version of python use (always! not dependent upon locale!) utf-8 to decode the utf-8 bytestring to the UTF-16 that the Windows platform APIs expect (and vice versa). If possible, I would try to avoid this useless roundtrip from UTF-16 to UTF-8 and back. And never use the ASCII variant of the windows APIs. That's okay, but I'm afraid it's not possible. The problem is not so much doing it, but finding all those places where it is currently done. Those could be outside of Python itself. So, even to Python code, there could still be APIs that would need the MBCS-encoded strings. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. ** ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On approximately 12/8/2008 12:57 AM, came the following characters from the keyboard of Stephen J. Turnbull: Internal decoding is (or should be) an oxymoron. Why would your software be passing around text in any format other than internal? So decoding will happen (a) on I/O, which is itself almost certainly slower than making a few checks for Unicode hygiene, or (b) on receipt of data from other software that whose sanitation you shouldn't trust more than you trust the Internet. Encoding isn't a problem, AFAICS. So I can see validating user supplied data, which always comes in via I/O. But during manipulation of internal data, including file and database I/O, there is a need for encoding and decoding also. If all the data has already been validated, then there would be no need to revalidate on every conversion. I hear you when you say that clever coding can make the validation nearly free, and I applaud that: the UTF-8 coder that I wrote predated most of the rules that have been created since, so I didn't attempt to be clever in that regard. Thanks to you and Adam for your explanations; I see your points, and if it is nearly free, I withdraw most of my negativity on this topic. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Terry Reedy wrote: This to be is an argument for keeping the default the current behavior, but not for rejecting flexibility. The computing world seems to be messier than we would like and worse that I realized until this week. As you say below, people need to better anticipate the future, and an errors parameter would help do that. It just occurred to me that this seems like a perfect situation to address via the warning system. The normal warnings mechanics can then be used to turn it into an exception if so desired, and this can be done once per application rather than having to pass a separate argument every time the affected APIs are called. And the decoding problems don't pass silently either - they just get emitted as a warning by default instead of causing the application to crash. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sunday 07 December 2008, Guido van Rossum wrote: My problem with raising exceptions *by default* when an undecodable name exists is that it may render an app completely useless in a situation where the developer is no longer around. This happened all the time with the 2.x Unicode API, where the developer hadn't anticipated a particular input potentially containing non-ASCII bytes, and the user fed the application non-ASCII text. Making os.listdir raise an exception when a directory contains a single undecodable file means that the entire directory can't be read, and most likely the entire app crashes at that point. Most likely the developer never anticipated this situation (since in most places it is either impossible or very unlikely) -- after all, if they had anticipated it they would have used the bytes API in the first place. There is another way to handle this that noisily signals errors but doesn't cause programs to suddenly fail. Using os.listdir as example, the problem there is that the OS actually returns a list of strings that can not be reliably decoded, so I would propose to simply not decode them. Now, the idea is what if this function simply returned neither a byte string nor a Unicode string, but e.g. an environment string type (called env_str)? os.listdir would only fail if it really failed to read the dir. If a user wants to display an element from the returned list, they would get something akin to what repr() returns, i.e. a recognisable string that can be written to a logfile. However, this thing will also include additional markup that makes it clear that it is not just a piece of text and not suitable to display to the end user. This type distinction is important, because it means that any developer will immediately see that something unexpected is going on here. They will invoke type(lst[0]) and see the unexpected type env_str, which will (via documentation) redirect them to the issue with different encodings and that all they have to do is 'map( unicode, lst)' in order to get at a list of real text strings, but they will also read that this operation might fail, forcing an informed decision. If they don't care about a textual representation at all but only want to invoke os.popen with arguments received from the commandline, then everything is fine, too, because that function will take the strings as they are and just give them back to the OS. This allows roundtripping from OS over Python and back to the OS without any conversions and thus without any conversions that could fail. In the case of e.g. a backup program, this is exactly what is needed. Now, if you have any hard-coded strings in your program but a function like os.popen needs an env_str object, this string is converted via a default encoding, i.e. the same that is used when converting an env_str object to Unicode. In this case, I would go so far to say that os.popen should accept normal str strings, too, and perform that conversion itself. An alternative way would be to reject the string because it is the wrong type, but since this internal string's encoding is known, there is no reason to force users to convert explicitly, it is just that the conversion might fail. Similarly, when modifying such an env_str object, like e.g. bak = sys.argv[1]+'.backup'. In this case, the string '.backup' is converted according to the default encoding and then appended to the commandline argument, the result would again be an env_str object. Note: There is an option in this design, and that is to make the default behaviour in case of nonconvertable env_str objects configurable. A filemanager would then replace the undecodable bytes by an approximation, a backup program would use strict mode and a music player would perhaps simply skip and ignore such strings. The problem there is that changing this option would possibly affect other library code that one doesn't even know about because it is only used indirectly and its implementation is unknown. For that reason, I would rather not make this policy a configurable element. If you want that, you can easily code it yourself. BTW: there was a PEP that proposed a new path class, which was rejected. This class was actually pretty similar, except that it also included several other features (globbing, path handling, opening files and the kitchen sink) which eventually made it too bloated. Otherwise, the idea of creating a separate type for these strings is the same. Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ** Visit our website at http://www.satorlaser.de/ ** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-06 01:48, Nick Coghlan wrote: You can't display a non-decodable filename to the user, hence the user will have no idea what they're working on. Non-filesystem related apps have no business trying to deal with insane filenames. This is not entirely true: OSes, shells, and applications will typically represent the file names using either ?-replacements or some form of hex or decimal escapes for the characters they can't decode. Since humans are usually very good at pattern recognition, this goes a long way. Of course, how the application maps that partially converted file name back to the real thing is another issue and that's something that Python should not make harder than it should be. Linux is moving towards a standard of UTF-8 for filenames, and once we get to the point where the idea of encoding filenames and environment variables any other way is seen as crazy, then the Python 3 approach will work seamlessly. It's going to take a long time before file names, environment variables and command line parameters are all encoded using UTF-8, so practicality beats purity will have to get more attention in this thread. Python APIs should work out of the box most of the time. Currently, if you live in a non-ASCII and non-pure-UTF-8 environment, you have to deal with different and mixed encodings on a regular basis. Whether that's a USB stick, you're trying to read, a ZIP file you're trying to open, a mounted network drive, etc. the problem pops up in many different kinds of areas. If I write do_something.py * I expect Python to indeed work on all the files in my directory, not just the one that happen to fit a particular encoding. If I hook up a CGI script written in Python with a web server, I expect all data to be received by the script, not just data that happens to be UTF-8 encoded. In the meantime, raw bytes APIs will provide an alternative for those that disagree with that philosophy. I think that's a wrong way to put it: The problems are not made up by people who disagree with the one-encoding-for-everything strategy. The problems occur in real-life IT processing all the time - maybe not so much in places where English scripts dominate, but certainly in most other places with non-English scripts. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 08 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, 7 Dec 2008 at 13:33, Guido van Rossum wrote: My problem with raising exceptions *by default* when an undecodable name exists is that it may render an app completely useless in a situation where the developer is no longer around. This happened all I think Nick Coghlan's suggestion of emitting warnings would be an excellent solution that addresses both your concerns and the concerns Toshio has expressed (and with which I agree 100%). The above is the only use case I've heard in this thread for ignoring files with names that can't be decoded: so that a user can use the program on those files whose names can be decoded even when the user does not have the resources to get the program fixed to handle undecodable filenames. I agree that that is a worthwhile goal. If warnings were emitted, then files would not be silently ignored, yet the program could still be used. --RDM PS: I'd like to see a similar warning issued when an access attempt is made through os.environ to a variable that cannot be decoded. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan [EMAIL PROTECTED] wrote: - I think the binary and Unicode APIs should be available (and fully functional) on all platforms (including Windows) so that app developers don't create portability problems for themselves when they make the decision as to which API to use +1 I'm perhaps biased here; most of my Python programs don't have user interfaces, because they don't talk to people, they talk to other programs. The binary APIs for the OS are essential. I use and deeply appreciate all the string handling features in Python, particularly its firm grip on Unicode issues, but that's *useful* instead of *essential*. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: Terry Reedy wrote: This to be is an argument for keeping the default the current behavior, but not for rejecting flexibility. The computing world seems to be messier than we would like and worse that I realized until this week. As you say below, people need to better anticipate the future, and an errors parameter would help do that. It just occurred to me that this seems like a perfect situation to address via the warning system. I disagree. The normal warnings mechanics can then be used to turn it into an exception if so desired, and this can be done once per application rather than having to pass a separate argument every time the affected APIs are called. The warning mechanism, as far as I know, because I have never dealt with it (and do not want to) is for version issues. In any case, the snippet that you clipped try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) specifically requires a per call parameter. And the decoding problems don't pass silently either - they just get emitted as a warning by default instead of causing the application to crash. Do they get automatically logged? In any case, the errors parameter has an in between option to neither ignore or raise but to replace and give *something* printable. This situation seems like an ideal situation for a parameter which gives the application program who uses Python a range of options to working with an un-ideal world. I am really flabbergasted why there is so much opposition to doing so in favor of more difficult or less functional alternatives. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: Guido van Rossum wrote: On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote: Toshio Kuratomi wrote: - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. Your examples (snipped) pretty well convince me that there is a use case for raising exceptions. We should move beyond arguing over which one way is right. I think there should be a second argument 'ignorebad=False' to ignore undecodable files rather than raise the exception (or 'strict=True' to stop and raise exception on non-decodable names -- then code is 'if strict: raise ...'). I believe other functions have a similar parameter. I was thinking of the normal Unicode 'errors' parameter, as described by Nick. If you want the exceptions, just use the bytes API and try to decode the byte strings using the system encoding. If it was a matter of adding a new method, I might agree. But: 1. We already have a method that does exactly what you describe. It is only a matter of adding flexibility to the response to problems, for which there is already precedent. 2. Suggesting that people who want strings and not bytes should have to deal with bytes, just to get an error notification, seems to negate that point of moving to 3.0 3. A builtin would probably do so better than most programmers would, with little touches such as the one suggested below. 4. An error parameter would ALERT programmers to the possibility of a PROBLEM, both in the present and future. As you say below, people need to better anticipate the future. My problem with raising exceptions *by default* when an undecodable name exists is that it may render an app completely useless in a situation where the developer is no longer around. This happened all the time with the 2.x Unicode API, where the developer hadn't anticipated a particular input potentially containing non-ASCII bytes, and the user fed the application non-ASCII text. Making os.listdir raise an exception when a directory contains a single undecodable file means that the entire directory can't be read, and most likely the entire app crashes at that point. Most likely the developer never anticipated this situation (since in most places it is either impossible or very unlikely) -- after all, if they had anticipated it they would have used the bytes API in the first place. (It's worse because the exception being raised would be UnicodeError -- most people expect os.listdir to raise OSError, not other errors.) This to be is an argument for keeping the default the current behavior, but not for rejecting flexibility. The computing world seems to be messier than we would like and worse that I realized until this week. As you say below, people need to better anticipate the future, and an errors parameter would help do that. I'm fine with whatever API enhancements you can come up with (assuming others like them too :-) as long as the default remains the current behavior. Is Windows really immune? What about when it reads the directory of possibly old removable media with whatever byte name encodings? Is this a possible source of 'unanticipated' problems? As to your last sentence, os.listdir() with an errors parameter could convert a decoding UnicodeError to OSError: undecodable file name ascii+hex repr, thereby supplying the expected exception as well as an extractable representation of problematical the raw bytes Here is a possible use case: I want filenames as 3.0 strings and I anticipate no problems at present but, as you say above, something might happen years in the future. I am using 3.0 *because* of the strings == unicode feature. I would like to write try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) and go one without the problem file but not without logging the problem so a future maintainer can consider what to do about it, but only when there is an actual need to think about it. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 10:34 AM, [EMAIL PROTECTED] wrote: On Mon, 8 Dec 2008 at 13:16, Terry Reedy wrote: And the decoding problems don't pass silently either - they just get emitted as a warning by default instead of causing the application to crash. Do they get automatically logged? In any case, the errors parameter has an in between option to neither ignore or raise but to replace and give *something* printable. This situation seems like an ideal situation for a parameter which gives the application program who uses Python a range of options to working with an un-ideal world. I am really flabbergasted why there is so much opposition to doing so in favor of more difficult or less functional alternatives. I'm in favor of an option to control what happens. I just really really don't want the _default_ to be ignore. Defaulting to a warning is fine with me, as would be defaulting to a traceback. But defaulting to silently ignore, as we have now, is just asking for user confusion and debugging headaches, as detailed by Toshio. A _worse_ user experience, IMO, than having a program fail when undecodable filenames match the selection criteria. Do you really not care about the risk where apps that weren't written to be prepared to handle this will be rendered completely useless if a single file in a directory has an unencodable name? This is similar to an issue that Python had for a long time where it wouldn't start up if the current directory contained non-ASCII characters. Given that most developers will not have this issue in their own environment, most apps will not be prepared for this issue, and that makes it worse for the app's user! -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: On Mon, Dec 8, 2008 at 10:34 AM, [EMAIL PROTECTED] wrote: On Mon, 8 Dec 2008 at 13:16, Terry Reedy wrote: And the decoding problems don't pass silently either - they just get emitted as a warning by default instead of causing the application to crash. Do they get automatically logged? In any case, the errors parameter has an in between option to neither ignore or raise but to replace and give *something* printable. I just really really don't want the _default_ to be ignore. Defaulting to a warning is fine with me, as would be defaulting to a traceback. Do you really not care about the risk where apps that weren't written to be prepared to handle this will be rendered completely useless if a single file in a directory has an unencodable name? Since when do warnings cause apps to be rendered completely useless? I think it's easy to agree that defaulting to an exception is not good for the reason you give, but I don't see how that applies to a warning. And, it seems like a warning covers the issues that the other people want as well. If there is a warning, then there is at least a record of the fact that some filenames were ignored. Presumably if I was responsible for the correctness of some piece of code, I would see the warning in a log of some sort and could investigate it further (if I cared), otherwise I could choose to ignore it. I don't see os.listdir(name) to be one of those situations that emitting a warning is a nuisance at all. -Scott -- Scott Dial [EMAIL PROTECTED] [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
I'm perhaps biased here; most of my Python programs don't have user interfaces, because they don't talk to people, they talk to other programs. The binary APIs for the OS are essential. I use and deeply appreciate all the string handling features in Python, particularly its firm grip on Unicode issues, but that's *useful* instead of *essential*. Exactly! Another +1. Larry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote: On Mon, Dec 8, 2008 at 10:34 AM, [EMAIL PROTECTED] wrote: I'm in favor of an option to control what happens. I just really really don't want the _default_ to be ignore. Defaulting to a warning is fine with me, as would be defaulting to a traceback. But defaulting to silently ignore, as we have now, is just asking for user confusion and debugging headaches, as detailed by Toshio. A _worse_ user experience, IMO, than having a program fail when undecodable filenames match the selection criteria. Do you really not care about the risk where apps that weren't written to be prepared to handle this will be rendered completely useless if a single file in a directory has an unencodable name? This is similar to an issue that Python had for a long time where it wouldn't start up if the current directory contained non-ASCII characters. No, I do care. In another message I agreed with you that having the ap not fail was a reasonable goal. What I'm saying is that having it ignore the undecodable files fail _silently_ is bad. And not picking up a file that matches some selection criteria (ex: *.py) because it is undecodable is a _failure_, in my opinion, that is _worse_ than getting a traceback because there's an undecodable file in the directory. But I'm happy with just issuing a warning by default. That would mean it doesn't fail silently, but neither does it crash. Seems like the best compromise with the broken nature of the real world IT environment. Given that most developers will not have this issue in their own environment, most apps will not be prepared for this issue, and that makes it worse for the app's user! It is exactly because most developers won't have the issue in their own environment that ignoring files silently is a problem. If they did, they'd fix their code before it went out the door. Since they don't, when their code is used by somebody in a mixed encoding environment, the programs _will_ fail by ignoring files that they should process. The question, it seems to me, is do they fail silently and mysteriously by failing to process files they are supposed to, or do they fail with at least a little bit of noise? --RDM ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 12:07 PM, [EMAIL PROTECTED] wrote: On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote: On Mon, Dec 8, 2008 at 10:34 AM, [EMAIL PROTECTED] wrote: I'm in favor of an option to control what happens. I just really really don't want the _default_ to be ignore. Defaulting to a warning is fine with me, as would be defaulting to a traceback. But defaulting to silently ignore, as we have now, is just asking for user confusion and debugging headaches, as detailed by Toshio. A _worse_ user experience, IMO, than having a program fail when undecodable filenames match the selection criteria. Do you really not care about the risk where apps that weren't written to be prepared to handle this will be rendered completely useless if a single file in a directory has an unencodable name? This is similar to an issue that Python had for a long time where it wouldn't start up if the current directory contained non-ASCII characters. No, I do care. In another message I agreed with you that having the ap not fail was a reasonable goal. What I'm saying is that having it ignore the undecodable files fail _silently_ is bad. And not picking up a file that matches some selection criteria (ex: *.py) because it is undecodable is a _failure_, in my opinion, that is _worse_ than getting a traceback because there's an undecodable file in the directory. But I'm happy with just issuing a warning by default. That would mean it doesn't fail silently, but neither does it crash. Seems like the best compromise with the broken nature of the real world IT environment. OK, I can live with that too. Given that most developers will not have this issue in their own environment, most apps will not be prepared for this issue, and that makes it worse for the app's user! It is exactly because most developers won't have the issue in their own environment that ignoring files silently is a problem. If they did, they'd fix their code before it went out the door. Since they don't, when their code is used by somebody in a mixed encoding environment, the programs _will_ fail by ignoring files that they should process. The question, it seems to me, is do they fail silently and mysteriously by failing to process files they are supposed to, or do they fail with at least a little bit of noise? A warning is fine. Whether the app *fails* or *succeeds* when the warning is issued depends on what the app is trying to do and what the user expects. There certainly are valid use cases for both, but I expect that succeeding noisily is going to be at least as common as failing (in the sense of not doing the right thing, not necessarily crashing) noisily. This is an improvement over always crashing. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-08 19:26, Guido van Rossum wrote: On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: Here is a possible use case: I want filenames as 3.0 strings and I anticipate no problems at present but, as you say above, something might happen years in the future. I am using 3.0 *because* of the strings == unicode feature. I would like to write try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) and go one without the problem file but not without logging the problem so a future maintainer can consider what to do about it, but only when there is an actual need to think about it. If that error parameter is the same as in unicode(value, errors), then this would be a useful feature: People could then choose among the already existing error handlers ('strict', 'ignore', 'replace', 'xmlcharrefreplace') or register their own ones via the codecs module. Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. Perhaps we should also add an ''encoding'' parameter that can be set on a per directory basis (if necessary) and defaults to the global file system encoding. If an application hits directory that is known to cause problems, it could then chose to receive the file names in a different, more suitable encoding. This allows implementing fallback mechanisms with a list of common encodings for a locale. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 08 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-08 21:45, Antoine Pitrou wrote: M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Fair enough. We could add some e.g. * a round-trip safe escape error handler that uses a Unicode private code point area which we officially reserve for the Python interpreter * a human readable escape error handler that encodes the problem bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1 encoded directory name instead of failing * a warning error handler that replaces the problem cases with a question mark and issues a warning through the warning framework -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 08 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Terry Reedy wrote: Nick Coghlan wrote: Terry Reedy wrote: This to be is an argument for keeping the default the current behavior, but not for rejecting flexibility. The computing world seems to be messier than we would like and worse that I realized until this week. As you say below, people need to better anticipate the future, and an errors parameter would help do that. It just occurred to me that this seems like a perfect situation to address via the warning system. I disagree. The normal warnings mechanics can then be used to turn it into an exception if so desired, and this can be done once per application rather than having to pass a separate argument every time the affected APIs are called. The warning mechanism, as far as I know, because I have never dealt with it (and do not want to) is for version issues. No, it's just DeprecationWarning in particular that is specific to versioning issues. That's obviously the one that comes up most often for core development, but there are other warnings as well (e.g. the off-by-default ImportWarning when potential packages are skipped because __init__.py is missing). For this particular case, I would suggest adding something like EnvironmentWarning (to parallel the EnvironmentError that is the common parent of OSError and IOError). In any case, the snippet that you clipped try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) specifically requires a per call parameter. True, but the decision to have errors=warn as the default behaviour is independent of the decision of whether or not to allow the behaviour to be changed on a case-by-case basis. There is nothing stopping us from doing both. And the decoding problems don't pass silently either - they just get emitted as a warning by default instead of causing the application to crash. Do they get automatically logged? By default warnings are written to sys.stderr. Whether that gets logged or not will depend on the nature of the application There are also mechanisms in warnings that allow an application to override the handling of warnings (and for 2.7/3.1, there are mechanisms in logging to make it easy to hook the warning system and the logging system together, so that warnings are automatically logged). In any case, the errors parameter has an in between option to neither ignore or raise but to replace and give *something* printable. That's true, and why I would actually support doing both. Adding the warning is a more pressing need though, since it is what will prevent the errors from passing silently in the default case. This situation seems like an ideal situation for a parameter which gives the application program who uses Python a range of options to working with an un-ideal world. I am really flabbergasted why there is so much opposition to doing so in favor of more difficult or less functional alternatives. A warning will stop the failure from passing silently in the default case - that's solving a different problem to the one that the error handling argument will solve. I do agree that being able to override the handling on a per-call basis could be a useful feature. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 1:45 PM, Antoine Pitrou [EMAIL PROTECTED] wrote: M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Except they're clearly NOT part of the unicode spec. Moreover, whatever tricks you use vary depending on if your garbage input is from UTF-8, UTF-16, or UTF-32 (or any other arbitrary encoding, like CP-1252 or Shift-JIS.) At this point someone suggests we have a type that can store an arbitrary mix of unicode and bytes, so the undecodable portions stay in their original form. :P -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Adam Olsen rhamph at gmail.com writes: Except they're clearly NOT part of the unicode spec. This is always the same discussion going in circles. I know they're not part of the unicode spec, but practicality beats purity and if the said error handler comes with an appropriate warning in the official doc, then why not? In any case, +1 to Marc-André's proposal. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote: On 2008-12-08 21:45, Antoine Pitrou wrote: M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Fair enough. We could add some e.g. * a round-trip safe escape error handler that uses a Unicode private code point area which we officially reserve for the Python interpreter This would of course alter the behaviour of those private code points, preventing them from round-tripping properly. I don't think round-tripping can be done from an error handler. You need a full codec to do it. A simple option is 8859-1. Or, ya know, bytes. This has long since gotten repetitive.. * a human readable escape error handler that encodes the problem bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1 encoded directory name instead of failing Similar to 'ö'.encode('ascii', 'backslashreplace')? I'm +1 on making that work. * a warning error handler that replaces the problem cases with a question mark and issues a warning through the warning framework I dub thee errors='warnreplace'. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: On Mon, Dec 8, 2008 at 12:07 PM, [EMAIL PROTECTED] wrote: On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote: But I'm happy with just issuing a warning by default. That would mean it doesn't fail silently, but neither does it crash. Seems like the best compromise with the broken nature of the real world IT environment. OK, I can live with that too. Same here. This lets the application specify globally what should happen (exception, warning, ignore via the warnings filters) and should give enough context that it doesn't become a mysterious error in the program. The per method addition of an errors argument so that this isoverridable locally as well as globally is also a nice touch but can be done separately from this step. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-08 22:32, Adam Olsen wrote: On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote: On 2008-12-08 21:45, Antoine Pitrou wrote: M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Fair enough. We could add some e.g. * a round-trip safe escape error handler that uses a Unicode private code point area which we officially reserve for the Python interpreter This would of course alter the behaviour of those private code points, preventing them from round-tripping properly. I don't think round-tripping can be done from an error handler. You need a full codec to do it. A simple option is 8859-1. Or, ya know, bytes. This has long since gotten repetitive.. The error handler would just map the problem bytes to the private area. The application would then have to decide what to do with them, ie. the error handler only provides one half of the round- tripping. And that's on purpose: I don't believe we can come up with some magic solution for the encodings problem. This is essentially something that applications will have to solve on a case-by-case basis. * a human readable escape error handler that encodes the problem bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1 encoded directory name instead of failing Similar to 'ö'.encode('ascii', 'backslashreplace')? I'm +1 on making that work. Yes. * a warning error handler that replaces the problem cases with a question mark and issues a warning through the warning framework I dub thee errors='warnreplace'. Yep, something along those lines. Perhaps there are more and better alternatives. These suggestions are just to show how the idea could be put to some real-life use. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 08 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 1:12 PM, Guido van Rossum [EMAIL PROTECTED] wrote: On Mon, Dec 8, 2008 at 12:07 PM, [EMAIL PROTECTED] wrote: But I'm happy with just issuing a warning by default. That would mean it doesn't fail silently, but neither does it crash. Seems like the best compromise with the broken nature of the real world IT environment. OK, I can live with that too. +1 -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 2008-12-08 22:39, Victor Stinner wrote: ('strict', 'ignore', 'replace', 'xmlcharrefreplace') replace (or xmlcharrefreplace) is just useless because you will not be unable to open or rename the file... You just know that there is a strange file in the directory. Right, but that's already a lot better than not knowing of the file's existence at all :-) Note that the above are standard error handlers for Unicode conversions. The rest of the email you cut away has more useful error handlers for the purpose in question. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Dec 08 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ 2008-12-02: Released mxODBC.Connect 1.0.0 http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Mon, Dec 8, 2008 at 2:44 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote: On 2008-12-08 22:32, Adam Olsen wrote: On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote: On 2008-12-08 21:45, Antoine Pitrou wrote: M.-A. Lemburg mal at egenix.com writes: Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. I'd argue that such fancy round-trip safe error handler should be provided by Python. It's not reasonable to expect application coders to come up with their own codec variation based on subtle details of the unicode spec. Fair enough. We could add some e.g. * a round-trip safe escape error handler that uses a Unicode private code point area which we officially reserve for the Python interpreter This would of course alter the behaviour of those private code points, preventing them from round-tripping properly. I don't think round-tripping can be done from an error handler. You need a full codec to do it. A simple option is 8859-1. Or, ya know, bytes. This has long since gotten repetitive.. The error handler would just map the problem bytes to the private area. The application would then have to decide what to do with them, ie. the error handler only provides one half of the round- tripping. By that point it's already too late. You've already conflated garbage PUA with legitimate PUA. To make it work you need to treat those legitimate PUA scalars as errors too, transforming them. A common example is how escaping replaces a single '\' with '\\'. Hrm. nul-escaping should work. Obviously it can't be used outside the filesystem though, as they may introduce a legitimate nul. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
M.-A. Lemburg wrote: On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote: try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) If that error parameter is the same as in unicode(value, errors), then this would be a useful feature: Except that unicode becomes str in 3.0, that is exactly my intention. People could then choose among the already existing error handlers ('strict', 'ignore', 'replace', 'xmlcharrefreplace') or register their own ones via the codecs module. These could be passed through from listdir or getenv to str. [Side questions: 1. 'xmlcharrefreplace' is not in the 3.0 LibRef doc or doc string. Should it be or is 'xmlcharrefreplace' an addition for a later version. 2. A garbage value for errors (such as 'blah') is silently ignored (so I cannot test the above). Intended or a bug?] Someone else proposed a new option 'warn', which Guido has accepted to be the default instead of the current 'ignore'. It could not be passed through (unless str were changed or something registered). I believe the implementation of that would be to call str with 'strict' but catch errors and warn instead. Whether there should be 1 warning for each problematic bytes encountered or 1 for each listdir (or whatever) call, possibly with the number of problems, I leave to others to decide. Such application specific error handlers could then also apply whatever fancy round-trip safe encoding of non-decodable bytes to Unicode escapes, private code points, etc. as seen fit by the application. Perhaps we should also add an ''encoding'' parameter that can be set on a per directory basis (if necessary) and defaults to the global file system encoding. That could also be passed through, but I will lets others make the argument for it. If an application hits directory that is known to cause problems, it could then chose to receive the file names in a different, more suitable encoding. This allows implementing fallback mechanisms with a list of common encodings for a locale. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On approximately 12/8/2008 9:30 AM, came the following characters from the keyboard of [EMAIL PROTECTED]: If warnings were emitted, then files would not be silently ignored, yet the program could still be used. Yep, this is sounding useful. PS: I'd like to see a similar warning issued when an access attempt is made through os.environ to a variable that cannot be decoded. And argv ? Seems like the warning technique could be useful for _any_ interface that has been traditionally bytes, because that's the kind of characters that were, but now should move to (Unicode) characters. The warnings could be the same, or very similar. The question is if one global control should handle all types of bytes problems, or if there should be individual controls for each bytes problem, or both. I tend to believe in both; the paranoid can set exactly the ones they've coded for, the aggressive can set the global one. In this manner, new cases can be added to the global settings over time, if more are discovered -- it should be documented to handle future similar issues in a similar manner. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
If the Unicode APIs only have correct unicode, sure. If not you'll get errors translating to UTF-8 (and the byte APIs are supposed to pass bad names through unaltered.) Kinda ironic, no? As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows. - Hagen ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 2:35 AM, Hagen Fürstenau [EMAIL PROTECTED] wrote: As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows. Python is broken and needs to be fixed. http://bugs.python.org/issue3672 http://bugs.python.org/issue3297 But the question of whether Python should care about lone surrogates or not is at best tangential to the issue at hand. If you have lone surrogates in the Unicode API (and didn't raise an exception on the way getting there), then the sensible thing is to encode them into lone UTF-8 surrogates. Even if you wanted to prevent lone surrogates, encoding to UTF-8 for the binary API would not be the place to enforce it. No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
[EMAIL PROTECTED] wrote: On 06:07 am, [EMAIL PROTECTED] wrote: Most apps aren't file managers or ftp clients but when they interact with files (for instance, a file selection dialog) they need to be able to show the user all the relevant files. So on an app-by-app basis the need for this is high. While I tend to agree emphatically with this, the *real* solution here is a path-abstraction library. Why don't you send me some information offlist. I'm not sure I agree that a path-abstraction library can work correctly but if it can it would be nice to have that at a level higher than the file-dialog libraries that I was envisioning. [snip] ... but that still doesn't help me identify when someone would expect that asking python for a list of all files in a directory or a specific set of files in a directory should, without warning, return only a subset of them. In what situations is this appropriate behaviour? If you say listdir(unicode) on a POSIX OS, your program is saying I only know how to deal with unicode results from this function, so please only give me those.. No. (explained below) If your program is smart enough to deal with bytes, then you would have asked for bytes, no? Yes (explained below) Returning only filenames which can be properly decoded makes sense. Otherwise everyone needs to learn about this highly confusing issue, even for the simplest scripts. os.listdir(unicode) (currently) means that the *programmer* is asking that the stdlib return the decodable filenames from this directory. The question is whether the programmer understood that this is what they were asking for and whether it is what they most likely want. I would make the following statements WRT to this: 1) The programmer most likely does not want decodable filenames and only decodable filename. If they were, we'd see a lot of python2.x code that turns pathnames into unicode and discards everything that wasn't decodable. No one has given a use case for finding only the *decodable* subset of files. If I request to see all *.py files in a directory, I want to see all of the *.py files in the directory, decodable or not. If you can show how programmers intend 90% of their calls to os.listdir()/glob.glob('*.txt') to show only the decodable subset of the results, then the foundation of my arguments is gone. So please, give examples to prove this wrong. - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. 2) For the programmer to understand the difference between os.listdir(type 'bytes') and os.listdir(type 'str') they have to understand the highly confusing issue and what it means for their code. So the current method is forcing programmers to understand it even for the simplest scripts if their environment is not uniform with no clue from the interpreter that there is an issue. - Similarly, raising an exception on undecodable values means that the programmer can ignore the issue in any scripts in sane environments and will be told that they need to deal with it (via an exception) when their script runs in a non-sane environment. 3) The usage of unicode vs bytes is easy to miss for someone starting with py2.x or windows and moving to a multi-platform or unix project. Even simple testing won't reveal the problem unless the programmer knows that they have to test what happens when encodings are mixed. Once again, this is requiring the programmer to understand the encoding issue without help from the interpreter. Skipping undecodable values is good enough that it will work 90% of the time. You and Guido have now made this claim to defend not raising an exception but I still don't have a use case. Here are use cases that I see: * Bill is coding an application for use inside his company. His company only uses utf-8. His code naively uses os.listdir(type 'str'). - The code does not throw an exception whether we use the current os.listdir() or one that could throw an exception because the system admins have sanitised the environment. Bill did not need to understand the implications of encoding for his code to work in this script whether simple or complex. * Mary is coding an application for use inside her company. It finds all html files on a system and updates her company's copyright, privacy policy, and other legal boilerplate. Her expectation is that after her program runs every file will have been updated. Her environment is a mixture of different filename encodings due to having many legacy documents for users in different locales. Mary's code also naively uses os.listdir(type 'str'). Her test case checks that the code does
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 11:35, Adam Olsen [EMAIL PROTECTED] wrote: http://bugs.python.org/issue3672 http://bugs.python.org/issue3297 No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8. I suspect it is a common and convenient but (according to what you say) misconceived expectation that using UTF-8 to encode any Unicode string will not raise an exception. This behavior is not something which should be discarded lightly. I see little reason that this couldn't be a new codec or error handler that allowed people to choose between correct pure UTF-8 behavior or the technically incorrect but very practical behavior it currently has. [My apologies, Adam, for sending this only to you the first time] -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman [EMAIL PROTECTED] wrote: On Sun, Dec 7, 2008 at 11:35, Adam Olsen [EMAIL PROTECTED] wrote: http://bugs.python.org/issue3672 http://bugs.python.org/issue3297 No. Unicode *requires* them to be treated as errors. If you want to pass them through then you're creating a custom encoding... which you might argue for in this case, but it needs to be clearly separate from the real UTF-8. I suspect it is a common and convenient but (according to what you say) misconceived expectation that using UTF-8 to encode any Unicode string will not raise an exception. This behavior is not something which should be discarded lightly. It is *not* a valid Unicode string in the first place. Therein lies the problem. I see little reason that this couldn't be a new codec or error handler that allowed people to choose between correct pure UTF-8 behavior or the technically incorrect but very practical behavior it currently has. Note that many of the restrictions were added for security reasons. You might receive a UTF-8 encoded file name from a malicious user, check if it contains something dangerous (like ../../../../../etc/password), then decode it. If your decoder isn't compliant (ie doesn't check for overly long sequences) then a b'\xC0\xAF' gets translated into u'/', bypassing your previous check. However, in this context we only need to allow lone surrogates. CESU-8 comes to mind. (It is a perverse world we live in.) -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Toshio Kuratomi wrote: - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. Your examples (snipped) pretty well convince me that there is a use case for raising exceptions. We should move beyond arguing over which one way is right. I think there should be a second argument 'ignorebad=False' to ignore undecodable files rather than raise the exception (or 'strict=True' to stop and raise exception on non-decodable names -- then code is 'if strict: raise ...'). I believe other functions have a similar parameter. tjr ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote: Toshio Kuratomi wrote: - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. Your examples (snipped) pretty well convince me that there is a use case for raising exceptions. We should move beyond arguing over which one way is right. I think there should be a second argument 'ignorebad=False' to ignore undecodable files rather than raise the exception (or 'strict=True' to stop and raise exception on non-decodable names -- then code is 'if strict: raise ...'). I believe other functions have a similar parameter. If you want the exceptions, just use the bytes API and try to decode the byte strings using the system encoding. My problem with raising exceptions *by default* when an undecodable name exists is that it may render an app completely useless in a situation where the developer is no longer around. This happened all the time with the 2.x Unicode API, where the developer hadn't anticipated a particular input potentially containing non-ASCII bytes, and the user fed the application non-ASCII text. Making os.listdir raise an exception when a directory contains a single undecodable file means that the entire directory can't be read, and most likely the entire app crashes at that point. Most likely the developer never anticipated this situation (since in most places it is either impossible or very unlikely) -- after all, if they had anticipated it they would have used the bytes API in the first place. (It's worse because the exception being raised would be UnicodeError -- most people expect os.listdir to raise OSError, not other errors.) -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Terry Reedy wrote: Toshio Kuratomi wrote: - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. Your examples (snipped) pretty well convince me that there is a use case for raising exceptions. We should move beyond arguing over which one way is right. I think there should be a second argument 'ignorebad=False' to ignore undecodable files rather than raise the exception (or 'strict=True' to stop and raise exception on non-decodable names -- then code is 'if strict: raise ...'). I believe other functions have a similar parameter. If we were going to do anything like that for os.listdir() and other filesystem APIs (like glob) that return multiple paths, we'd probably be best advised to just have a normal Unicode 'errors' parameter which allowed: 'strict' - raise an Exception for malformed binary data 'replace' - insert '?' or some other symbol in place of malformed binary data 'ignore' - simply leave out the malformed binary data 'skip' - run the underlying codec in strict mode, but skip over any items which raise UnicodeDecodeError (default/current Py3k behaviour) Obviously, 'skip' doesn't make any sense for APIs like getcwd() that return a single value - a case could be made for those defaulting to either replace or strict. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: For binary wrappers around the Windows Unicode APIs, I was thinking specifically of using UTF-8, since that should be able to encode anything the Unicode APIs can handle. Why shouldn't the binary interface just expose the raw utf16 as bytes? -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Guido van Rossum wrote: On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote: Toshio Kuratomi wrote: - If this is true, a definition of os.listdir(type 'str') that would better meet programmer expectation would be: Give me all files in a directory with the output as str type. The definition of os.listdir(type 'bytes') would be Give me all files in a directory with the output as bytes type. Raising an exception when the filenames are undecodable is perfectly reasonable in this situation. Your examples (snipped) pretty well convince me that there is a use case for raising exceptions. We should move beyond arguing over which one way is right. I think there should be a second argument 'ignorebad=False' to ignore undecodable files rather than raise the exception (or 'strict=True' to stop and raise exception on non-decodable names -- then code is 'if strict: raise ...'). I believe other functions have a similar parameter. I was thinking of the normal Unicode 'errors' parameter, as described by Nick. If you want the exceptions, just use the bytes API and try to decode the byte strings using the system encoding. If it was a matter of adding a new method, I might agree. But: 1. We already have a method that does exactly what you describe. It is only a matter of adding flexibility to the response to problems, for which there is already precedent. 2. Suggesting that people who want strings and not bytes should have to deal with bytes, just to get an error notification, seems to negate that point of moving to 3.0 3. A builtin would probably do so better than most programmers would, with little touches such as the one suggested below. 4. An error parameter would ALERT programmers to the possibility of a PROBLEM, both in the present and future. As you say below, people need to better anticipate the future. My problem with raising exceptions *by default* when an undecodable name exists is that it may render an app completely useless in a situation where the developer is no longer around. This happened all the time with the 2.x Unicode API, where the developer hadn't anticipated a particular input potentially containing non-ASCII bytes, and the user fed the application non-ASCII text. Making os.listdir raise an exception when a directory contains a single undecodable file means that the entire directory can't be read, and most likely the entire app crashes at that point. Most likely the developer never anticipated this situation (since in most places it is either impossible or very unlikely) -- after all, if they had anticipated it they would have used the bytes API in the first place. (It's worse because the exception being raised would be UnicodeError -- most people expect os.listdir to raise OSError, not other errors.) This to be is an argument for keeping the default the current behavior, but not for rejecting flexibility. The computing world seems to be messier than we would like and worse that I realized until this week. As you say below, people need to better anticipate the future, and an errors parameter would help do that. Is Windows really immune? What about when it reads the directory of possibly old removable media with whatever byte name encodings? Is this a possible source of 'unanticipated' problems? As to your last sentence, os.listdir() with an errors parameter could convert a decoding UnicodeError to OSError: undecodable file name ascii+hex repr, thereby supplying the expected exception as well as an extractable representation of problematical the raw bytes Here is a possible use case: I want filenames as 3.0 strings and I anticipate no problems at present but, as you say above, something might happen years in the future. I am using 3.0 *because* of the strings == unicode feature. I would like to write try: files = os.listdir(somedir, errors = strict) except OSError as e: log(verbose error message that includes somedir and e) files = os.listdir(somedir) and go one without the problem file but not without logging the problem so a future maintainer can consider what to do about it, but only when there is an actual need to think about it. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On approximately 12/7/2008 10:56 AM, came the following characters from the keyboard of Adam Olsen: You might receive a UTF-8 encoded file name from a malicious user, check if it contains something dangerous (like ../../../../../etc/password), then decode it. If your decoder isn't compliant (ie doesn't check for overly long sequences) then a b'\xC0\xAF' gets translated into u'/', bypassing your previous check. You might indeed. But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form, specifying what sorts of Unicode strictness (such as overlong sequences) to check for during the decode process, and once the string is in canonical form, _then_ do checks for various attacks, such as the ../ sequence you mention? And with that order of operation, even if you don't reject overlong sequences, you have canonized them, and can recognize the resulting characters as good or bad. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Glenn Linderman writes: But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form, Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want. If you want the convenience and risk, I believe you should ask for it by name (I suggest a name like own_me for the relaxed decoding flagwink). Failing that, it would be nice to have a global flag to change the default. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull: Glenn Linderman writes: But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form, Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want. I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster. I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the vUTF-8 decoder for strict validation, and the fUTF-8 decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, UTF-8 already exists... as fUTF-8, so for compatibility, I guess it shouldn't change... but it could be deprecated. You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors? -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 12/7/2008 8:13 PM, came the following characters from the keyboard of Stephen J. Turnbull: Glenn Linderman writes: But if you are interested in checking for security issues, shouldn't you _first_ decode into some canonical form, Yes. That's all that is being asked for: that Python do strict decoding to a canonical form by default. That's a lot to ask, as it turns out, but that is what we (the minority of strict Unicode adherents, that is) want. I have no problem with having strict validation available. But doesn't validation take significantly longer than decoding? So I think it should be logically decoupled... do validation when/where it is needed for security reasons, and allow internal [de]coding to be faster. I'd like to see benchmarks of such a claim. I'm mostly indifferent about which should be the default... maybe there shouldn't be a default! Use the vUTF-8 decoder for strict validation, and the fUTF-8 decoder for the faster, non-validating version. Or something like that. With appropriate documentation. Of course, UTF-8 already exists... as fUTF-8, so for compatibility, I guess it shouldn't change... but it could be deprecated. You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors? Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder. -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 12/7/2008 9:11 PM, came the following characters from the keyboard of Adam Olsen: On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman [EMAIL PROTECTED] wrote: Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I wonder if I could find that code? Can you supply a validated decoder? Then we could run some benchmarks, eh? There is no point for me, as the behaviour of a real UTF-8 codec is clear. It is you who needs to justify a second non-standard UTF-8-ish codec. See below. You didn't address the issue that if the decoding to a canonical form is done first, many of the insecurities just go away, so why throw errors? Unicode is intended to allow interaction between various bits of software. It may be that a library checked it in UTF-8, then passed it to python. It would be nice if the library validated too, but a major advantage of UTF-8 is older libraries (or protocols!) intended for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their security checks continue to work, so long as nobody down stream introduces problems with a non-validating decoder. So I don't understand how this is responsive to the decoding removes many insecurities issue? Yes, you might use libraries. Either they have insecurities, or not. Either they validate, or not. Either they decode, or not. They may be immune to certain attacks, because of their structure and code, or not. So when you examine a library for potential use, you have documentation or code to help you set your expectations about what it does, and whether or not it may have vulnerabilities, and whether or not those vulnerabilities are likely or unlikely, whether you can reduce the likelihood or prevent the vulnerabilities by wrapping the API, etc. And so you choose to use the library, or not. This whole discussion about libraries seems somewhat irrelevant to the question at hand, although it is certainly true that understanding how a library handles Unicode is an important issue for the potential user of a library. So how does a non-validating decoder introduce problems? I can see that it might not solve all problems, but how does it introduce problems? Wouldn't the problems be introduced by something else, and the use of a non-validating decoder may not catch the problem... but not be the cause of the problem? And then, if you would like to address the original issue, that would be fine too. Your non-validating encoder is translating an invalid sequence into a valid one, thus you are introducing the problem. A completely naive environment (8-bit clean ASCII) would leave it as an invalid sequence throughout. This is not a theoretical problem. See http://tools.ietf.org/html/rfc3629#section-10 . We MUST reject invalid sequences, or else we are not using UTF-8. There is no wiggle room, no debate. (The absoluteness is why the standard behaviour doesn't need a benchmark. You are essentially arguing that, when logging in as root over the internet, it's a lot faster if you use telnet rather than ssh. One is simply not an option.) -- Adam Olsen, aka Rhamphoryncus ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 05, 2008 at 08:37:45PM -0500, James Y Knight wrote: On Dec 5, 2008, at 7:48 PM, Nick Coghlan wrote: You can't display a non-decodable filename to the user, hence the user will have no idea what they're working on. Non-filesystem related apps have no business trying to deal with insane filenames. Sigh, same arguments, all over again. Again, *both* KDE and Gnome apps display non-decodable filenames to the user, and let the user work with the files. They display as good a rendition as they can, using a replacement character as appropriate. In some earlier versions, KDE did not work at all on poorly-encoded files, and, users submitted bug reports. People do care, it does happen in real life, and it is a bug in your software if you cannot deal with the users' files. They just want the software to work. If it shows something weird in the window titlebar, that's a bit irritating but at least it doesn't get in the way of working. I agree 100%. Russian Unix users use at least 5 different encodings (koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and iso-8859-5 are less frequent). I have an FTP server with some filenames in koi8 encoding - these filenames are for unix clients, - and some filenames in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am a commandline/console unixhead) for my needs (read email, write files in utf-8 with characters beyond koi8-r, which is my primary encoding) - and I still can work with filenames in koi8/cp1251 encodings. My filemanager (Midnight Commander, for the matter) shows these files and directories as ?.???, but I can chdir to such directories, and I can open such files. It would be a big bad blow for me if filemanagers (or other programs) start to filter these filenames. Oleg. -- Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sat, Dec 06, 2008 at 12:03:55PM +1100, Steven D'Aprano wrote: I'd rather have the Python API report errors then silence them, at least by default. +1 for encoding errors by default. Oleg. -- Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sat, Dec 06, 2008 at 02:22:29AM +0100, Martin v. L?wis wrote: And environment variables, command line arguments, and file names are not bytes, but characters. There is no such thing as plain text! If you say these are characters you must also name the encoding for them. LANG/LC_ALL/LC_CTYPE provide a sensible default, but if a program has problems decoding bytes to characters there must be a way for the user to override the default. But the user must be notified about the error, so programs must not silently filters out non-decodable characters. Oleg. -- Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED] Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan wrote: Toshio Kuratomi wrote: Nonsense. A program can do tons of things with a non-decodable filename. Where it's limited is non-decodable filedata. You can't display a non-decodable filename to the user, hence the user will have no idea what they're working on. Non-filesystem related apps have no business trying to deal with insane filenames. This is where we disagree. There are many ways to display the non-decodable filename to the user because the user is not a machine. The computer must know the unique sequence of bytes in order to access a file. The user, OTOH, usually only needs to know that the file exists. In most GUI-based end-user oriented desktop apps, it's enough to do str(filename, errors='replace'). For instance, the GNOME file manager displays: ? (Invalid encoding) and Konqueror, the KDE file manager just displays: ? The file can still be displayed this way, accessed via the raw bytes that the program keeps internally, and operated upon by applications. For applications in which the user needs more information to differentiate the files the program has the option to display the raw byte sequences as if they were the filename. The *NIX shell and command line tools have this ability. $ LANG=en_US.utf8 ls -b á í $ LANG=C ls -b . .. \303\241 \303\255 $ mv $'\303\241' $'\303\263' $ LANG=C ls -b \303\255 \303\263 $ LANG=en_US.utf8 ls -b í ó Linux is moving towards a standard of UTF-8 for filenames, and once we get to the point where the idea of encoding filenames and environment variables any other way is seen as crazy, then the Python 3 approach will work seamlessly. nod With the caveat that I haven't seen movement by Linux and other Unix variants to enforce UTF-8. What I have seen are statements by kernel programmers that having the filesystem use bytes and not know about encoding is the correct thing to do. This means that utf-8 will be a convention rather than a necessity for a very long time and consequently programs will need to worry about the problems of mixed encoding systems for an equally long time. (Remember, encoding is something that can be changed per user and per file. So on a multiuser OS, mixed encodings can be out of the control of the system administrator for perfectly valid reasons.) In the meantime, raw bytes APIs will provide an alternative for those that disagree with that philosophy. Oh I agree with the UTF-8 everywhere philosophy. I just know that there's tons of real-world systems out there that don't conform to my expectations for sanity and my code has to account for those :-) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 5, 2008 at 10:18 PM, Bugbee, Larry [EMAIL PROTECTED] wrote: There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions. Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform. My prediction is that it won't ever be possible to completely hide this difference between platforms. The platforms differ fundamentally in how they see filenames. An elaborate abstraction can certainly be created that smooths out most of the differences, but at some point useful functionality will have to be lost in order to maintain strict platform independence. This is the fate of most platform-independence abstractions by the way. For example, there are many elaborate packages for platform-independent I/O, but they generally don't provide access to all functionality that is available on a platform. Where they do, the application is once again placed in the position of having to use complex if-statements and/or exceptions. Consider just this example. Many programs have a need to ask their user for a filename to be created by the program. On systems where filenames are raw byte strings, do you want to provide the user with a way to specify an arbitrary byte string? (That is, in addition to the normal case of entering a text string that will be transformed into a filename using some encoding.) Your choices are either not to support the case of bytes that aren't a valid encoding in the current encoding, or add a UI element to select an encoding, or add a UI element to enter raw bytes. An abstraction package is likely to only support the first option (this is what Java does BTW), but this is not acceptable to all applications. That may not be possible for a while, so interim solutions should be such that it minimizes later pain. If that means hiding implementation details behind a new function, so be it. Then, at least, the body of one's app is not burdened with this problem later when conditions change. I believe the problem's severity is actually overstated. The interim solution with the least amount of pain that will work for almost all apps is to treat filenames as text strings encoded in some default encoding, and ignore filenames that aren't valid encodings of any text string. Yes, it is possible that you'll find that you can't completely remove or traverse certain directory trees. But that's a fact of life anyway (filesystems have many hidden failure modes), so you're better off dealing with *that* possibility than worrying over the issue of undecodable filenames. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Fri, Dec 5, 2008 at 8:57 PM, Tres Seaver [EMAIL PROTECTED] wrote: Amen! the idea that paths, environment varioables, and stuff pulled off of sockets can be treated as text rather than strings is just wishful thinking. Unfortunately most of the programmers of the world *do* think that way(*), and it's not easy to wean them off the idea. It's a powerful meme that you can use your own name as a file name, even if you happen to be Czech or Vietnamese -- and it's promoted by the two most popular consumer operating systems. (*) With the exception of sockets. Sockets are typically dealt with through protocols and APIs that provide guidance about how to convert between bytes and strings, and whether that is even a meaningful operation. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Bugbee, Larry wrote: There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions. Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform. I've been thinking about this and I can only see one option. I don't think that it really makes less work for the programmer, though -- it just shifts the problem and makes it more apparent what your code is doing. To avoid exceptions and if-then's in program code when accessing filenames, environment variables, etc, you would need to access each of these resources via the byte API. Then, to avoid having to keep track of what's a string and what's a byte in your other code, you probably want to convert those bytes to strings. This is where the burden gets shifted. You'll have your own routine(s) to do the conversion and have to have exception handling code to deal with undecodable filenames. Note 1: your particular app might be able to get away without doing the conversion from bytes to string -- it depends on what you're planning on doing with the filename/environment data. Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.) -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 02:34 pm, [EMAIL PROTECTED] wrote: On Fri, Dec 05, 2008 at 08:37:45PM -0500, James Y Knight wrote: On Dec 5, 2008, at 7:48 PM, Nick Coghlan wrote: You can't display a non-decodable filename to the user, hence the user will have no idea what they're working on. Non-filesystem related apps have no business trying to deal with insane filenames. Sigh, same arguments, all over again. People do care, it does happen in real life, and it is a bug in your software if you cannot deal with the users' files. They just want the software to work. If it shows something weird in the window titlebar, that's a bit irritating but at least it doesn't get in the way of working. I agree 100%. Russian Unix users use at least 5 different encodings (koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and iso-8859-5 are less frequent). I have an FTP server with some filenames in koi8 encoding - these filenames are for unix clients, - and some filenames in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am a commandline/console unixhead) for my needs (read email, write files in utf-8 with characters beyond koi8-r, which is my primary encoding) - and I still can work with filenames in koi8/cp1251 encodings. My filemanager (Midnight Commander, for the matter) shows these files and directories as ?.???, but I can chdir to such directories, and I can open such files. It would be a big bad blow for me if filemanagers (or other programs) start to filter these filenames. I find it interesting to note that the only users in this discussion who actually have these problems in real life all have this attitude. It is expected that in an imperfect world we will have imperfect encodings, but it is super important that software which can open files can deal with not understanding the character translation of the filename. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sat, Dec 6, 2008 at 10:53 AM, [EMAIL PROTECTED] wrote: On 02:34 pm, [EMAIL PROTECTED] wrote: I agree 100%. Russian Unix users use at least 5 different encodings (koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and iso-8859-5 are less frequent). I have an FTP server with some filenames in koi8 encoding - these filenames are for unix clients, - and some filenames in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am a commandline/console unixhead) for my needs (read email, write files in utf-8 with characters beyond koi8-r, which is my primary encoding) - and I still can work with filenames in koi8/cp1251 encodings. My filemanager (Midnight Commander, for the matter) shows these files and directories as ?.???, but I can chdir to such directories, and I can open such files. It would be a big bad blow for me if filemanagers (or other programs) start to filter these filenames. I find it interesting to note that the only users in this discussion who actually have these problems in real life all have this attitude. It is expected that in an imperfect world we will have imperfect encodings, but it is super important that software which can open files can deal with not understanding the character translation of the filename. For file managers and similar tools I am absolutely 100% in agreement -- that's why the binary APIs are there. Most apps aren't file managers or ftp clients though. The sky is not falling. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Toshio Kuratomi wrote: Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.) Note that this is why I personally think the binary API variants *should* exist on Windows, just with the sense of the system encoding flipped around. That is, on *nix: - underlying OS API uses bytes - binary API just passes values straight through - Unicode API uses the system encoding to encode Unicode names and values to be passed to the OS API and to decode bytes names and values received from the OS API While on Windows: - underlying OS API uses Unicode - Unicode API just passes values straight through - binary API uses the system encoding to decode bytes names and values to be passed to the OS API and to encode Unicode names and values received from the OS API Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
Nick Coghlan ncoghlan at gmail.com writes: If the binary APIs are missing from a major platform (i.e. Windows) then the choice to use them brings with it a major cross-platform portability problem that should really be handled by the standard library. +1 I might also add that providing binary APIs does not prevent us to implement some special representation of broken filenames when using the unicode APIs (for example using private Unicode characters - I'm not sure what the right terminology is - as sometimes suggested). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
* Nick Coghlan wrote: Toshio Kuratomi wrote: Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.) Note that this is why I personally think the binary API variants *should* exist on Windows, just with the sense of the system encoding flipped around. That is, on *nix: - underlying OS API uses bytes - binary API just passes values straight through - Unicode API uses the system encoding to encode Unicode names and values to be passed to the OS API and to decode bytes names and values received from the OS API While on Windows: - underlying OS API uses Unicode - Unicode API just passes values straight through - binary API uses the system encoding to decode bytes names and values to be passed to the OS API and to encode Unicode names and values received from the OS API Now that is somewhat strange. That way you'll have two unreliable APIs and need to switch depending on the platform again. nd -- +[++-][-]++.[++-]+++.--. +. .[-]---.+++[-]+.+. +.+++[-].---.+++[-] +..+[]+.+[-]+.+++[-]+.--..++ [---].+[+-].++..--.+++[-]+. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
André Malo wrote: While on Windows: - underlying OS API uses Unicode - Unicode API just passes values straight through - binary API uses the system encoding to decode bytes names and values to be passed to the OS API and to encode Unicode names and values received from the OS API Now that is somewhat strange. That way you'll have two unreliable APIs and need to switch depending on the platform again. Sory, system encoding was probably a poor choice of words there, since that generally means mbcs when talking about windows (which would indeed be a very poor choice of encoding). For binary wrappers around the Windows Unicode APIs, I was thinking specifically of using UTF-8, since that should be able to encode anything the Unicode APIs can handle. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On Sun, Dec 07, 2008, Nick Coghlan wrote: If the binary APIs are missing from a major platform (i.e. Windows) then the choice to use them brings with it a major cross-platform portability problem that should really be handled by the standard library. +1 -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ It is easier to optimize correct code than to correct optimized code. --Bill Harlan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-3.0, unicode, and os.environ
On 06:07 am, [EMAIL PROTECTED] wrote: Guido van Rossum wrote: On Sat, Dec 6, 2008 at 10:53 AM, [EMAIL PROTECTED] wrote: I find it interesting to note that the only users in this discussion who actually have these problems in real life all have this attitude. For file managers and similar tools I am absolutely 100% in agreement -- that's why the binary APIs are there. Most apps aren't file managers or ftp clients though. The sky is not falling. Most apps aren't file managers or ftp clients but when they interact with files (for instance, a file selection dialog) they need to be able to show the user all the relevant files. So on an app-by-app basis the need for this is high. While I tend to agree emphatically with this, the *real* solution here is a path-abstraction library. In separate discussions, the difficulty of getting such a thing into the standard library has been discussed, due to the wide variety of opinions as to what it should look like (and the shocking level of difficulty involved in making such a thing really work correctly). I'd be very happy to talk to you off-list about my ideas for such a thing, but I'd rather not resurrect yet another tedious discussion here just now :). On a code basis, I'd hope that most file selection dialogs are pulled out into libraries... but that still doesn't help me identify when someone would expect that asking python for a list of all files in a directory or a specific set of files in a directory should, without warning, return only a subset of them. In what situations is this appropriate behaviour? If you say listdir(unicode) on a POSIX OS, your program is saying I only know how to deal with unicode results from this function, so please only give me those.. If your program is smart enough to deal with bytes, then you would have asked for bytes, no? Returning only filenames which can be properly decoded makes sense. Otherwise everyone needs to learn about this highly confusing issue, even for the simplest scripts. Skipping undecodable values is good enough that it will work 90% of the time. When you need to get to 100%, it won't be impossible - the bytes APIs will be there. In the longer term, hopefully some path abstraction will eventually be there too. We should not wait for a perfectly correct path abstraction to arrive before providing the primitives to do it yourself, though. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com