subject:"\[Python\-Dev\] Python\-3.0, unicode, and os.environ"

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-15 Thread Ulrich Eckhardt

On Friday 12 December 2008, Adam Olsen wrote:
 Only pages like this, which indicate the underlying API is an array of
 WCHAR:

 http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

Hmm, true. So even there, the encoding isn't known...

 char * is just fine.  You need only pass a length along with it.  All
 internal APIs *must* already do this, as they support nul bytes.  Also
 note that the underlying POSIX APIs prohibit nul bytes in filenames,
 so it's irrelevant for them.

Hmmm, I see things like Py_GetPath() in the 2.7 sourcecode, which returns a 
plain char*. I really need to check if 3.0 is better.

thanks for the info

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-13 Thread Steven D'Aprano

On Fri, 12 Dec 2008 06:33:28 pm Toshio Kuratomi wrote:

 Also interesting, if you point your browser at:
   http://toshio.fedorapeople.org/u/

 You should see two other test files.  They're both
 (one-half)(enyei).html but one's encoded in utf-8 and the other in
 latin-1.

For what it's worth, Konquorer 3.5 displays the two files as 

(1/2)(n+tilde).html
(A+caret)(1/2)(A+tilde)(plusminus).html

It doesn't seem to have any trouble opening either of them.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Ulrich Eckhardt

On Thursday 11 December 2008, Steve Holden wrote:
 Ulrich Eckhardt wrote:
  If readdir() returned Unicode text, people would start taking that for
  granted. If it returned bytes, just the same. Returning a completely
  unrelated type will give them enough hint that for this thing they have
  to rethink their assumptions. This runs along the lines of In the face
  of ambiguity, refuse the temptation to guess., as it makes guessing
  rather impossible.

 So you are suggesting this special object be used only to represent
 files to users? Now I understand.

Not only files, the same problem crops up when handling sys.argv and 
os.environ.

  I just don't see a case where using a separate path class would break
  things. Further, the special handling that is required would be made even
  clearer by using such a class.

 But it does have to be implemented ...

Well, it isn't really terribly difficult to do so, after all its just a 
container for either a byte string or Unicode string plus some helper code to 
convert it to/from Unicode.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Ulrich Eckhardt

On Thursday 11 December 2008, Adam Olsen wrote:
 The simplest solution there is to have windows bytes APIs that return
 raw UTF-16 bytes (note that windows does NOT guaranteed to be valid
 unicode, despite being much more likely than on linux).

Actually, I'm not aware of this case. I only know that the OS refuses to mount 
media it can't decode, but that is on the OS-level. Can you give me a hint?

 The only real issue I see is that UTF-16 isn't an ASCII superset, so it
 won't print nicely.

True, but I personally couldn't care less. Actually, I would even prefer if 
printing a byte string always produced \x escaped byte values, that way it 
would at least be consistent. 

 In other words, bytes can be your special type.

That would actually be a lot of work to do, but I do agree that it would be a 
way. 

The problem though is that I have seen quite a few places in Python where such 
a byte string is passed as 'char*' and treated with the assumption that 
strlen() would yield a meaningful value there, so this calls at least for a 
distinct 'Py_Byte' type. Also, this still doesn't even remotely handle the 
problem that you do have two valid encodings on win32, even though the MBCS 
one could be called deprecated. People will try to interface to other 
libraries that use win32 CHAR strings and that will be much harder or even 
impossible. Further, and that is IMHO the worst part of it, things will fail 
too silently and programmers aren't encouraged to write portable code, but 
maybe I'm just too pessimistic.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Stephen J. Turnbull

Toshio Kuratomi writes:
  Adam Olsen wrote:
   On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org 
   wrote:
   Unfortunately, even programmers experienced in I18N like Martin, and
   those with intuition-that-has-the-force-of-lawwink like Guido,
   express deliberate disbelief on this point.  They say that filesystem
   names and environment variable values are text, which is true from the
   semantic viewpoint but can't be fully supported by any implementation.
   
   With all the focus on backup tools and file managers I think we've
   lost perspective.  They're an important use case, but hardly the
   dominant one.

True.

   Please, as a user, if your app is creating new files, do NOT use
   bytes!  You have no excuse for creating garbage, and garbage doesn't
   help the user any.  Getting the encoding right, use the unicode APIs,
   and don't pass the buck on to everything else.
   
  Uhmmm That's good advice but doesn't solve any problems :-(.

Exactly.  Furthermore, the problems *already exist*.  My current
locale is UTF-8 and all files dated since about 2002 have UTF-8 names,
*except* in my MIME-bodies garbage can, where only recently have I got
around to coercing my MUA to doing the right thing.  And of course
there are still legacy files names in EUC-JP, which I suppose I could
search for but since I only access a directory containing one once in
a pale blue moon, I'm not gonna bother.

It's just not reasonable to expect users or even sysadminns to go
around cleaning up legacy data.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread André Malo

* Adam Olsen wrote: 

 UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
 the browser has to display the percent escapes in the address bar,
 rather than the intended text.

Duh! The address bar should contain the URL, which *is* the intended text. 
The escapes are there for a reason. If I pass some octets using percent 
escapes via the query string or request body, it's not text, not even 
intended. It's still a collection of octets. Translating them back (and 
forth when I press enter in the address bar) is a pretty ambigious 
operation and therefore pretty wrong.

The defacto standard does not exist. There's a real one instead: RFC 2396.

nd
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Ulrich Eckhardt

On Friday 12 December 2008, Stephen J. Turnbull wrote:
 I gather that the BFDL's line on this thread of discussion is that
 forcing programmers to think about encodings every time they call out
 to the OS is unacceptable

Exactly that is not necessary.

  for n in os.readdir('.'):
  f = open(n)
  if grep('foo', f):
  print('found foo!')

Now, if you actually wanted to output the filename, you could never do so 
reliably anyway, because even though it is supposed to be text, the encoding 
isn't known. So, an archiving program will probably do something like this:

   try:
   for n in os.readdir():
   b = n.encode('UTF-8')
   f = open(n)
   archive.write_file_header(b)
   archive.write_file(f)
   catch ...
   print oops, couldn't decode file '%s' % n.unicode(error='replace')

If you're writing a filemanager, you would store the path alongside an 
approximated Unicode representation.


 when most programs will work acceptably 
 almost all of the time with a rather naive approach.  This means that
 almost all Python programs will be technically broken for the
 forseeable future, sorry, Ulrich.

Actually, they are already broken, only that few people notice it. :|

 And for the same pragmatic reasons, these functions are going to
 return strings (ie, Unicode), not bytes, I expect.  Sorry, Steve.

 What needs to be determined here is the best way to provide
 reliability to those who will go to the effort of asking for it if
 it's available.  I don't think just return bytes fits the bill for
 the reason above.

 What I would like to see is a type that is derived from string (so if
 you present it to an API expecting string, it is silently treated as
 string), but from which the original bytes can always be extracted on
 request.

I like that idea, this type would behave pretty much like the env_string I 
proposed. The main difference is that it does several implicit conversions 
where I personally would rather see explicit conversions. Other than that, 
I'm all for it.

 If the original bytes cannot be sensibly decoded to a 
 string, then the string field in the object would either contain
 something that should normally cause an error in a string API, or some
 made-up string (presumably it would attempt to be a more or less
 faithful representation of the bytes) at the caller's option.
 Probably they'd also contain some metadata useful in guessing
 encodings (the read time locale in particular).

Well, I wouldn't provide an approximation. Considering the archiving software 
above, you would end up with a file name undecodable file name in an 
archive. For that kind of software, it would be fatal. But, and that is much 
more important than my preference, at least your approach would allow writing 
reliable software that properly handles such environment strings. Further, 
and that is where it differs from just returning bytes, it even makes it easy 
by the using a distinct type.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Adam Olsen

On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote:
 * Adam Olsen wrote:

 UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
 the browser has to display the percent escapes in the address bar,
 rather than the intended text.

 Duh! The address bar should contain the URL, which *is* the intended text.
 The escapes are there for a reason. If I pass some octets using percent
 escapes via the query string or request body, it's not text, not even
 intended. It's still a collection of octets. Translating them back (and
 forth when I press enter in the address bar) is a pretty ambigious
 operation and therefore pretty wrong.

 The defacto standard does not exist. There's a real one instead: RFC 2396.

All the heaps of people using non-english wikipedia sites might
disagree with you.  There's only, what, a few *million* pages that
would be affected?

It'd be very interesting if someone at Google could provide some
statistics on URL encodings.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Antoine Pitrou

Curt Hagenlocher curt at hagenlocher.org writes:

 
 
 On Thu, Dec 11, 2008 at 10:19 PM, Adam Olsen rhamph at gmail.com wrote:
 
 
 I doubt that UTF-16 is used very much (other than on windows).
 
 There's this other obscure platform called Java... ;)

Does it have a filesystem?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Curt Hagenlocher

On Fri, Dec 12, 2008 at 5:06 AM, Antoine Pitrou solip...@pitrou.net wrote:

 Curt Hagenlocher curt at hagenlocher.org writes:

  There's this other obscure platform called Java... ;)

 Does it have a filesystem?

No, but it also has to interact with filesystems of possibly invalid
or indeterminate encodings.  What does java.io do?

--
Curt Hagenlocher
c...@hagenlocher.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Antoine Pitrou

Curt Hagenlocher curt at hagenlocher.org writes:
 
 No, but it also has to interact with filesystems of possibly invalid
 or indeterminate encodings.  What does java.io do?

My point was that Python doesn't have to interact with the Java IO libraries,
while it has to interact with the Unix and Windows IO APIs.



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Curt Hagenlocher

On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Curt Hagenlocher curt at hagenlocher.org writes:

 No, but it also has to interact with filesystems of possibly invalid
 or indeterminate encodings.  What does java.io do?

 My point was that Python doesn't have to interact with the Java IO libraries,
 while it has to interact with the Unix and Windows IO APIs.

Of course.  But the Java IO libraries have to interact with the Unix
and Windows IO APIs as well. It might be interesting to know how they
handle similar situations.

--
Curt Hagenlocher
c...@hagenlocher.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Scott Dial

Curt Hagenlocher wrote:
 On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Curt Hagenlocher curt at hagenlocher.org writes:
 No, but it also has to interact with filesystems of possibly invalid
 or indeterminate encodings.  What does java.io do?
 My point was that Python doesn't have to interact with the Java IO libraries,
 while it has to interact with the Unix and Windows IO APIs.
 
 Of course.  But the Java IO libraries have to interact with the Unix
 and Windows IO APIs as well. It might be interesting to know how they
 handle similar situations.

See the following email for a summary of existing practice (as of 2004):

http://www.mail-archive.com/unic...@unicode.org/msg27352.html

-Scott

-- 
Scott Dial
sc...@scottdial.com
scod...@cs.indiana.edu
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Toshio Kuratomi

Adam Olsen wrote:
 UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
 the browser has to display the percent escapes in the address bar,
 rather than the intended text.
 
 IOW, inconsistent behaviour is a bug, but translating into UTF-8 is not. ;)
 
 
I think we should let this tangent drop because it's about bugs in
firefox bug, not in python :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Lennart Regebro

On Fri, Dec 12, 2008 at 16:21, Scott Dial
scott+python-...@scottdial.com wrote:
 See the following email for a summary of existing practice (as of 2004):

 http://www.mail-archive.com/unic...@unicode.org/msg27352.html

Interesting. Quite a lot of them do just drop the undecodable
filenames. The Java solution with replacing it seems to be a better
idea at first glance, but what if you then end up with two filenames
that are the same? Possibly replacing with the ? character is a good
idea to notify that the file is there, but fail then fail to open it.

-- 
Lennart Regebro: Zope and Plone consulting.
http://www.colliberty.com/
+33 661 58 14 64
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread glyph


On 02:23 pm, c...@hagenlocher.org wrote:
On Fri, Dec 12, 2008 at 6:19 AM, Antoine Pitrou solip...@pitrou.net 
wrote:

Curt Hagenlocher curt at hagenlocher.org writes:


No, but it also has to interact with filesystems of possibly invalid
or indeterminate encodings.  What does java.io do?


My point was that Python doesn't have to interact with the Java IO 
libraries,

while it has to interact with the Unix and Windows IO APIs.


Of course.  But the Java IO libraries have to interact with the Unix
and Windows IO APIs as well. It might be interesting to know how they
handle similar situations.


Apparently Java has the facilities to do the right thing, but actually 
it's just broken.


My locale says UTF-8.  However, if I create a non-decodable file with 
Python (2), there are three ways I can tell Java to open it: I can ask 
for it with a string (that won't work, because no valid UTF-8 string 
maps to an undecodable string, pretty much by definition).  I can list 
the directory that it's in (presuming that *that's* a directory) and get 
a java.io.File, which could be retaining all the interesting 
information, or I can use a URI, which is a string that resolves to 
octets before it resolves to characters again.


However, it looks like Java screws up in every case.

Here's a transcript from the ever-helpful jython:

gl...@nhuvasarim:~/tmp$ python
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3 (Ubuntu 
4.2.3-2ubuntu7)] on linux2

Type help, copyright, credits or license for more information.

file(\xff\xff, wb).write(lolz\n)

gl...@nhuvasarim:~/tmp$ jython
Jython 2.2.1 on java1.6.0_07
Type copyright, credits or license for more information.

from java.io import File
fileList = File(.).listFiles()
fileList

array(java.io.File,[./

fileList[0].__class__

jclass java.io.File 1

from java.io import FileReader
FileReader(fileList[0])

Traceback (innermost last):
 File console, line 1, in ?
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:106)
at java.io.FileReader.init(FileReader.java:55)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at 
java.lang.reflect.Constructor.newInstance(Constructor.java:513)


java.io.FileNotFoundException: java.io.FileNotFoundException: ./ÿFDÿFD (No 
such file or directory)

from java.net import URI
u = URI(file:///home/glyph/tmp/%ff%ff)
FileReader(File(u))

Traceback (innermost last):
 File console, line 1, in ?
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:106)
at java.io.FileReader.init(FileReader.java:55)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at 
java.lang.reflect.Constructor.newInstance(Constructor.java:513)


java.io.FileNotFoundException: java.io.FileNotFoundException: 
/home/glyph/tmp/ÿFDÿFD (No such file or directory)


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread André Malo

* Adam Olsen wrote:

 On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote:
  * Adam Olsen wrote:
  UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
  the browser has to display the percent escapes in the address bar,
  rather than the intended text.
 
  Duh! The address bar should contain the URL, which *is* the intended
  text. The escapes are there for a reason. If I pass some octets using
  percent escapes via the query string or request body, it's not text,
  not even intended. It's still a collection of octets. Translating them
  back (and forth when I press enter in the address bar) is a pretty
  ambigious operation and therefore pretty wrong.
 
  The defacto standard does not exist. There's a real one instead: RFC
  2396.

 All the heaps of people using non-english wikipedia sites might
 disagree with you.  There's only, what, a few *million* pages that
 would be affected?

I'm not sure what you're trying to pull here. Is that supposed to be an 
argument? There's no page affected at all. It's a browser UI issue, not a 
page issue.

And even if it were interesting at all, how the URL escapes are displayed in 
the address bar, those millions of people would favourite KOI8-R or Big 5 
over UTF-8 if you would ask them.

Which leads to the exact point: The browser cannot know, nor should it even. 
It's opaque. The only entity which needs to understand the encoding of URL 
percent escapes in query or request body is the *server* selecting the 
resource.

But I'm sure I'm not telling you any news here.

nd
-- 
Das Verhalten von Gates hatte mir bewiesen, dass ich auf ihn und seine
beiden Gefährten nicht zu zählen brauchte -- Karl May, Winnetou III

Im Westen was neues: http://pub.perlig.de/books.html#apache2
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread Adam Olsen

On Fri, Dec 12, 2008 at 9:47 PM, André Malo n...@perlig.de wrote:
 * Adam Olsen wrote:
 On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote:
  * Adam Olsen wrote:
  UTF-8 in percent encodings is becoming a defacto standard.  Otherwise
  the browser has to display the percent escapes in the address bar,
  rather than the intended text.
 
  Duh! The address bar should contain the URL, which *is* the intended
  text. The escapes are there for a reason. If I pass some octets using
  percent escapes via the query string or request body, it's not text,
  not even intended. It's still a collection of octets. Translating them
  back (and forth when I press enter in the address bar) is a pretty
  ambigious operation and therefore pretty wrong.
 
  The defacto standard does not exist. There's a real one instead: RFC
  2396.

 All the heaps of people using non-english wikipedia sites might
 disagree with you.  There's only, what, a few *million* pages that
 would be affected?

 I'm not sure what you're trying to pull here. Is that supposed to be an
 argument? There's no page affected at all. It's a browser UI issue, not a
 page issue.

 And even if it were interesting at all, how the URL escapes are displayed in
 the address bar, those millions of people would favourite KOI8-R or Big 5
 over UTF-8 if you would ask them.

 Which leads to the exact point: The browser cannot know, nor should it even.
 It's opaque. The only entity which needs to understand the encoding of URL
 percent escapes in query or request body is the *server* selecting the
 resource.

 But I'm sure I'm not telling you any news here.

You're arguing that text should be an opaque entity..

We've wasted enough of everybody's time on this already, I'm not going
to continue on this thread.  Send me a private email if you think it's
really important.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-12 Thread André Malo

* Adam Olsen wrote:

 On Fri, Dec 12, 2008 at 9:47 PM, André Malo n...@perlig.de wrote:
  * Adam Olsen wrote:
  On Fri, Dec 12, 2008 at 2:11 AM, André Malo n...@perlig.de wrote:
   * Adam Olsen wrote:
   UTF-8 in percent encodings is becoming a defacto standard. 
   Otherwise the browser has to display the percent escapes in the
   address bar, rather than the intended text.
  
   Duh! The address bar should contain the URL, which *is* the intended
   text. The escapes are there for a reason. If I pass some octets
   using percent escapes via the query string or request body, it's not
   text, not even intended. It's still a collection of octets.
   Translating them back (and forth when I press enter in the address
   bar) is a pretty ambigious operation and therefore pretty wrong.
  
   The defacto standard does not exist. There's a real one instead: RFC
   2396.
 
  All the heaps of people using non-english wikipedia sites might
  disagree with you.  There's only, what, a few *million* pages that
  would be affected?
 
  I'm not sure what you're trying to pull here. Is that supposed to be an
  argument? There's no page affected at all. It's a browser UI issue, not
  a page issue.
 
  And even if it were interesting at all, how the URL escapes are
  displayed in the address bar, those millions of people would favourite
  KOI8-R or Big 5 over UTF-8 if you would ask them.
 
  Which leads to the exact point: The browser cannot know, nor should it
  even. It's opaque. The only entity which needs to understand the
  encoding of URL percent escapes in query or request body is the
  *server* selecting the resource.
 
  But I'm sure I'm not telling you any news here.

 You're arguing that text should be an opaque entity..

No, actually I'm not. I'm arguing that escapes are opaque.

 We've wasted enough of everybody's time on this already, I'm not going
 to continue on this thread. 

Agreed.

nd
-- 
Da fällt mir ein, wieso gibt es eigentlich in Unicode kein
i mit einem Herzchen als Tüpfelchen? Das wär sooo süüss!

 -- Björn Höhrmann in darw
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Ulrich Eckhardt

On Wednesday 10 December 2008, Adam Olsen wrote:
 On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt

 [EMAIL PROTECTED] wrote:
  On Tuesday 09 December 2008, Adam Olsen wrote:
  The only thing separating this from a bikeshed discussion is that a
  bikeshed has many equally good solutions, while we have no good
  solutions.  Instead we're trying to find the least-bad one.  The
  unicode/bytes separation is pretty close to that.  Adding a warning
  gets even closer.  Adding magic makes it worse.
 
  Well, I see two cases:
  1. Converting from an uncertain representation to a known one.
  2. Converting from a known representation to a known one.

 Not quite:
 1. Using a garbage file name locally (within a single process, not
 talking to any libs)
 2. Using a unicode filename everywhere (libs, saved to config files,
 displayed to the user, etc.)

I think there is some misunderstanding. I was referring to conversions and 
whether it is good to perform them implicitly. For that, I saw the above two 
cases.

 On linux the bytes/unicode separation is perfect for this.  You decide
 which approach you're using and use it consistently.  If you mess up
 (mixing bytes and unicode) you'll consistently get an error.

 We currently don't follow this model on windows, so a garbage file
 name gets passed around as if it was unicode, but fails when passed to
 a lib, saved to a config file, is displayed to a user, etc.

I'm not sure I agree with this. Facts I know are:
1. On POSIX systems, there is no reliable encoding for filenames while the 
system APIs use char/byte strings.
2. On MS Windows, the encoding for filenames is Unicode/UTF-16.

Returning Unicode strings from readdir() is wrong because it can't handle the 
case 1 above. Returning byte strings is wrong because it can't handle case 2 
above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, 
worst case, to the locale-dependent MBCS. Returning something different 
depending on the system us also broken because that would make Python code 
that uses this function and assumes a certain type unportable.

Note that this doesn't get much better if you provide a separate readdirb() 
API or one that simply returns a byte string or Unicode string depending on 
its argument. It just shifts the brokenness from readdir() to the code that 
uses it, unless this code makes a distinction between the target systems. 
Since way too many programmers are not aware of the problem, they will not 
handle these systems differently, so code will become non-portable.

What I'd just like some feedback on is the approach to return a distinct type 
(neither a byte string nor a Unicode string) from readdir(). In order to use 
this, a programmer will have to convert it explicitly, otherwise e.g. 
printing it will just produce env_string at 0x01234567. This will 
immediately bump each programmer with their heads on the issue of unknown 
encodings and they will have to make the application-specific choice whether 
an approximation of the filename, an exception or ignoring the file is the 
right choice. Also, it presents the options for doing this conversion in a 
single class, which I personally find much better than providing overloads 
for hundreds of functions.


Sorry for ranting, but I'm a bit confused and desperate, because either I'm 
unable to explain what I mean or I'm really not understanding something that 
everybody else here seems to agree upon. I just know that using a distinct 
path type has helped me in C++ in the past, and I don't see why it shouldn't 
in Python.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Steve Holden

Ulrich Eckhardt wrote:
 On Wednesday 10 December 2008, Adam Olsen wrote:
 On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt

 [EMAIL PROTECTED] wrote:
 On Tuesday 09 December 2008, Adam Olsen wrote:
 The only thing separating this from a bikeshed discussion is that a
 bikeshed has many equally good solutions, while we have no good
 solutions.  Instead we're trying to find the least-bad one.  The
 unicode/bytes separation is pretty close to that.  Adding a warning
 gets even closer.  Adding magic makes it worse.
 Well, I see two cases:
 1. Converting from an uncertain representation to a known one.
 2. Converting from a known representation to a known one.
 Not quite:
 1. Using a garbage file name locally (within a single process, not
 talking to any libs)
 2. Using a unicode filename everywhere (libs, saved to config files,
 displayed to the user, etc.)
 
 I think there is some misunderstanding. I was referring to conversions and 
 whether it is good to perform them implicitly. For that, I saw the above two 
 cases.
 
 On linux the bytes/unicode separation is perfect for this.  You decide
 which approach you're using and use it consistently.  If you mess up
 (mixing bytes and unicode) you'll consistently get an error.

 We currently don't follow this model on windows, so a garbage file
 name gets passed around as if it was unicode, but fails when passed to
 a lib, saved to a config file, is displayed to a user, etc.
 
 I'm not sure I agree with this. Facts I know are:
 1. On POSIX systems, there is no reliable encoding for filenames while the 
 system APIs use char/byte strings.
 2. On MS Windows, the encoding for filenames is Unicode/UTF-16.
 
 Returning Unicode strings from readdir() is wrong because it can't handle the 
 case 1 above. Returning byte strings is wrong because it can't handle case 2 
 above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, 
 worst case, to the locale-dependent MBCS. Returning something different 
 depending on the system us also broken because that would make Python code 
 that uses this function and assumes a certain type unportable.
 
 Note that this doesn't get much better if you provide a separate readdirb() 
 API or one that simply returns a byte string or Unicode string depending on 
 its argument. It just shifts the brokenness from readdir() to the code that 
 uses it, unless this code makes a distinction between the target systems. 
 Since way too many programmers are not aware of the problem, they will not 
 handle these systems differently, so code will become non-portable.
 
 What I'd just like some feedback on is the approach to return a distinct type 
 (neither a byte string nor a Unicode string) from readdir(). In order to use 
 this, a programmer will have to convert it explicitly, otherwise e.g. 
 printing it will just produce env_string at 0x01234567. This will 
 immediately bump each programmer with their heads on the issue of unknown 
 encodings and they will have to make the application-specific choice whether 
 an approximation of the filename, an exception or ignoring the file is the 
 right choice. Also, it presents the options for doing this conversion in a 
 single class, which I personally find much better than providing overloads 
 for hundreds of functions.
 
 
 Sorry for ranting, but I'm a bit confused and desperate, because either I'm 
 unable to explain what I mean or I'm really not understanding something that 
 everybody else here seems to agree upon. I just know that using a distinct 
 path type has helped me in C++ in the past, and I don't see why it shouldn't 
 in Python.
 
Seems to me this just threatens to add to the confusion.

If you know what your filesystem produces, you can take the appropriate
action to convert it into a type that makes sense to the user. If you
don't, then at least if you have the string in its bytes form you can
re-present it to the filesystem to manipulate the file. What are we
supposed to do with the special type?

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Ulrich Eckhardt

On Thursday 11 December 2008, Steve Holden wrote:
 Ulrich Eckhardt wrote:
  What I'd just like some feedback on is the approach to return a distinct
  type (neither a byte string nor a Unicode string) from readdir(). In
  order to use this, a programmer will have to convert it explicitly,
  otherwise e.g. printing it will just produce env_string at 0x01234567.
  This will immediately bump each programmer with their heads on the issue
  of unknown encodings and they will have to make the application-specific
  choice whether an approximation of the filename, an exception or ignoring
  the file is the right choice. Also, it presents the options for doing
  this conversion in a single class, which I personally find much better
  than providing overloads for hundreds of functions.
[...]

 Seems to me this just threatens to add to the confusion.

 If you know what your filesystem produces, you can take the appropriate
 action to convert it into a type that makes sense to the user. If you
 don't, then at least if you have the string in its bytes form you can
   ^^^

There are operating systems that don't use bytes to represent a file path, 
namely all the MS Windows variants. Even worse, when you use a byte string 
there, it typically means that you want to use the obsolete encoding that is 
based on codepages.

Why can we not preserve the representation of a path as it is? Why do we 
_have_ to convert it to anything at all, without even knowing if this 
conversion is needed? I just want to do something to a file's content, why 
does its path have to be converted to something and then be converted back in 
order for the system to digest it?

 re-present it to the filesystem to manipulate the file. What are we
 supposed to do with the special type?

You receive from readdir() and pass it to stat(), simple as that. No 
conversions from the native representation needed. If you need a textual 
representation, then you have to convert it and you have to do so explicitly 
according to whatever logic your application requires.

If readdir() returned Unicode text, people would start taking that for 
granted. If it returned bytes, just the same. Returning a completely 
unrelated type will give them enough hint that for this thing they have to 
rethink their assumptions. This runs along the lines of In the face of 
ambiguity, refuse the temptation to guess., as it makes guessing rather 
impossible.

I just don't see a case where using a separate path class would break things. 
Further, the special handling that is required would be made even clearer by 
using such a class.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Isaac Morland


On Thu, 11 Dec 2008, Ulrich Eckhardt wrote:


On Thursday 11 December 2008, Steve Holden wrote:

Ulrich Eckhardt wrote:
Seems to me this just threatens to add to the confusion.

If you know what your filesystem produces, you can take the appropriate
action to convert it into a type that makes sense to the user. If you
don't, then at least if you have the string in its bytes form you can

  ^^^

There are operating systems that don't use bytes to represent a file path,
namely all the MS Windows variants. Even worse, when you use a byte string
there, it typically means that you want to use the obsolete encoding that is
based on codepages.

Why can we not preserve the representation of a path as it is? Why do we
_have_ to convert it to anything at all, without even knowing if this
conversion is needed? I just want to do something to a file's content, why
does its path have to be converted to something and then be converted back in
order for the system to digest it?


re-present it to the filesystem to manipulate the file. What are we
supposed to do with the special type?


You receive from readdir() and pass it to stat(), simple as that. No
conversions from the native representation needed. If you need a textual
representation, then you have to convert it and you have to do so explicitly
according to whatever logic your application requires.


Not only would this address the issue with the local filesystem, it would 
also provide a principled way to deal with remote filesystems.  For 
example, an FTP interface library for Python could use this type to 
returns paths of the sort actually supported by the raw FTP protocol.


Thinking of the filesystem is actually a misconception - always 
referring to a filesystem opens up all sorts of possibilities.  There is 
a lot of coding to do to allow this, but allowing programs to work with 
paths and files in the local filesystem, remote filesystems, and 
filesystems constructed from others (e.g., by expanding symlinks, changing 
the root similar to chroot, or encoding/unencoding pathnames) would open 
up lots of possibilities, including better test environments.


This is an interesting case of separating byte strings from character 
strings.  As long as the two are conflated, everything appears simple. 
But when they are separated, not only are there two types where before 
there was only one, it turns out that which type is correct in some 
circumstances depends on the platform.  Also, many objects which are byte 
strings at the protocol level are usually or always meant to be character 
strings of some sort, but how to translate them simply cannot be nailed 
down once and for all.


Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Adam Olsen

On Thu, Dec 11, 2008 at 6:41 AM, Ulrich Eckhardt
eckha...@satorlaser.com wrote:
 On Thursday 11 December 2008, Steve Holden wrote:
 re-present it to the filesystem to manipulate the file. What are we
 supposed to do with the special type?

 You receive from readdir() and pass it to stat(), simple as that. No
 conversions from the native representation needed. If you need a textual
 representation, then you have to convert it and you have to do so explicitly
 according to whatever logic your application requires.

The simplest solution there is to have windows bytes APIs that return
raw UTF-16 bytes (note that windows does NOT guaranteed to be valid
unicode, despite being much more likely than on linux).  The only real
issue I see is that UTF-16 isn't an ASCII superset, so it won't print
nicely.

In other words, bytes can be your special type.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Stephen J. Turnbull

Steve Holden writes:
  Ulrich Eckhardt writes:

   What I'd just like some feedback on is the approach to return a
   distinct type (neither a byte string nor a Unicode string) from
   readdir().

This is presumably unacceptable on the grounds that it will break
existing code that does something more or less useful more or less
some of the time.wink

  If you know what your filesystem produces, you can take the appropriate
  action to convert it into a type that makes sense to the user.

Unfortunately, even programmers experienced in I18N like Martin, and
those with intuition-that-has-the-force-of-lawwink like Guido,
express deliberate disbelief on this point.  They say that filesystem
names and environment variable values are text, which is true from the
semantic viewpoint but can't be fully supported by any implementation.

The implementation issue is why you want bytes, but I don't think it
is going to overcome the tide of (semantically-oriented) pragmatism.

  If you don't, then at least if you have the string in its bytes
  form you can re-present it to the filesystem to manipulate the
  file. What are we supposed to do with the special type?

Trivially convert it back to bytes and re-present it to the
filesystem, of course.

I gather that the BFDL's line on this thread of discussion is that
forcing programmers to think about encodings every time they call out
to the OS is unacceptable when most programs will work acceptably
almost all of the time with a rather naive approach.  This means that
almost all Python programs will be technically broken for the
forseeable future, sorry, Ulrich.

And for the same pragmatic reasons, these functions are going to
return strings (ie, Unicode), not bytes, I expect.  Sorry, Steve.

What needs to be determined here is the best way to provide
reliability to those who will go to the effort of asking for it if
it's available.  I don't think just return bytes fits the bill for
the reason above.

What I would like to see is a type that is derived from string (so if
you present it to an API expecting string, it is silently treated as
string), but from which the original bytes can always be extracted on
request.  If the original bytes cannot be sensibly decoded to a
string, then the string field in the object would either contain
something that should normally cause an error in a string API, or some
made-up string (presumably it would attempt to be a more or less
faithful representation of the bytes) at the caller's option.
Probably they'd also contain some metadata useful in guessing
encodings (the read time locale in particular).

These objects probably shouldn't support string-like operations in a
general way (ie, maintaining both the string representation and the
bytes correctly).  Rather, using proper string operations on them
would use the string content and produce strings.  People who really
want to handle mixed-encoding pathnames and the like would have to
keep collections of these objects and handle them in an ad-hoc way.

Unfortunate implementing this is way beyond my skills and time
availability.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Adam Olsen

On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org wrote:
 Unfortunately, even programmers experienced in I18N like Martin, and
 those with intuition-that-has-the-force-of-lawwink like Guido,
 express deliberate disbelief on this point.  They say that filesystem
 names and environment variable values are text, which is true from the
 semantic viewpoint but can't be fully supported by any implementation.

With all the focus on backup tools and file managers I think we've
lost perspective.  They're an important use case, but hardly the
dominant one.

Please, as a user, if your app is creating new files, do NOT use
bytes!  You have no excuse for creating garbage, and garbage doesn't
help the user any.  Getting the encoding right, use the unicode APIs,
and don't pass the buck on to everything else.

The fact that the unicode is easier is a bonus for doing the right thing.

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi

Adam Olsen wrote:
 On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 Unfortunately, even programmers experienced in I18N like Martin, and
 those with intuition-that-has-the-force-of-lawwink like Guido,
 express deliberate disbelief on this point.  They say that filesystem
 names and environment variable values are text, which is true from the
 semantic viewpoint but can't be fully supported by any implementation.
 
 With all the focus on backup tools and file managers I think we've
 lost perspective.  They're an important use case, but hardly the
 dominant one.
 
 Please, as a user, if your app is creating new files, do NOT use
 bytes!  You have no excuse for creating garbage, and garbage doesn't
 help the user any.  Getting the encoding right, use the unicode APIs,
 and don't pass the buck on to everything else.
 
Uhmmm That's good advice but doesn't solve any problems :-(.  No
matter what I create, the filenames will be bytes when the next person
reads them in.  If my locale is shift-js and the person I'm sharing the
file with uses utf-8 things won't work.  Even if my locale is utf-8
(since I come from a European nation) and their locale is utf-16
(because they're from an Asian nation) the Unicode API won't work.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Adam Olsen

On Thu, Dec 11, 2008 at 10:41 PM, Toshio Kuratomi a.bad...@gmail.com wrote:
 Adam Olsen wrote:
 On Thu, Dec 11, 2008 at 6:55 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
 Unfortunately, even programmers experienced in I18N like Martin, and
 those with intuition-that-has-the-force-of-lawwink like Guido,
 express deliberate disbelief on this point.  They say that filesystem
 names and environment variable values are text, which is true from the
 semantic viewpoint but can't be fully supported by any implementation.

 With all the focus on backup tools and file managers I think we've
 lost perspective.  They're an important use case, but hardly the
 dominant one.

 Please, as a user, if your app is creating new files, do NOT use
 bytes!  You have no excuse for creating garbage, and garbage doesn't
 help the user any.  Getting the encoding right, use the unicode APIs,
 and don't pass the buck on to everything else.

 Uhmmm That's good advice but doesn't solve any problems :-(.  No
 matter what I create, the filenames will be bytes when the next person
 reads them in.  If my locale is shift-js and the person I'm sharing the
 file with uses utf-8 things won't work.  Even if my locale is utf-8
 (since I come from a European nation) and their locale is utf-16
 (because they're from an Asian nation) the Unicode API won't work.

So you'll open up the dir and find this collection:

??.txt
.png
???.html
.html
???.png
??.txt
??.txt
??.txt

A half-broken setup is still a broken setup.  Eventually you have to
tell people to stop screwing around and pick one encoding.

I doubt that UTF-16 is used very much (other than on windows).  I
haven't found any statistics on what distros use, but did find this
one of the web itself:
http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

I can't wait for next year's statistics.

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Adam Olsen

On Thu, Dec 11, 2008 at 11:25 PM, Curt Hagenlocher c...@hagenlocher.org wrote:
 On Thu, Dec 11, 2008 at 10:19 PM, Adam Olsen rha...@gmail.com wrote:

 I doubt that UTF-16 is used very much (other than on windows).

 There's this other obscure platform called Java... ;)

Sorry, I should have said for interchange. :)

(CPython doesn't use UTF-8 internally either.  It uses UTF-16 or UTF-32.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi

Adam Olsen wrote:

 A half-broken setup is still a broken setup.  Eventually you have to
 tell people to stop screwing around and pick one encoding.
 
But it's not a broken setup.  It's the way the world is because people
share things with each other.

 I doubt that UTF-16 is used very much (other than on windows).  I
 haven't found any statistics on what distros use, but did find this
 one of the web itself:
 http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
 
UTF-16 is popular in Asian locales for the same reason that shift-js and
big-5 are hanging in there.  utf-8 takes many more bytes to encode Asian
Unicode characters than utf-16.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-11 Thread Toshio Kuratomi

Adam Olsen wrote:
 As a data point, firefox (when pointed at my home dir) DOES skip over
 garbage files.
 
 
That's not true.  However, it looks like Firefox is actually broken.
Take a look at this screenshot:
  firefox.png

That shows a directory with a folder that's not decodable in my utf-8
locale.  What's interesting to note is that I actually have two
nondecodable folders there but only one of them showed up.  So firefox
is inconsistent with its treatment, rendering some non-decodable files
and ignoring others.

Also interesting, if you point your browser at:
  http://toshio.fedorapeople.org/u/

You should see two other test files.  They're both
(one-half)(enyei).html but one's encoded in utf-8 and the other in
latin-1.  Firefox has some bugs in it related to this.  For instance, if
you mouseover the two links you'll see that firefox displays the same
symbolic names for each of the files (even though they're in two
different encodings).  Sometimes firefox is able to load both files and
sometimes it only loads one of them.  Firefox seems to be translating
the characters from ASCII percent encoding of bytes into their unicode
symbols and back to utf-8 in some circumstances related to whether it
has the pages in its cache or not.  In this case, it should be leaving
things as percent encoded bytes as it's the only way that apache is
going to know what to retrieve.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-10 Thread Ulrich Eckhardt

On Tuesday 09 December 2008, Adam Olsen wrote:
 On Tue, Dec 9, 2008 at 11:31 AM, Ulrich Eckhardt

 [EMAIL PROTECTED] wrote:
  On Monday 08 December 2008, Adam Olsen wrote:
  At this point someone suggests we have a type that can store an
  arbitrary mix of unicode and bytes, so the undecodable portions stay
  in their original form. :P
 
  Well, not an arbitrary mix, but a type that just stores whatever comes
  from the system without further specifying it as either bytes or Unicode:
 
  * If you want a string for displaying it, you first have to extract a
  string from that thing and there you optionally specify the encoding and
  error behaviour.
  * If you want to append a string to it, it is automatically encoded in
  the default encoding, which obviously can fail.

 So the 2.x str, but with a more interesting default encoding than
 ASCII.  It'll work fine on the developer's system, but one day a user
 will present it with strange input, and boom.

If the system's representation of filenames can not represent a Unicode 
codepoint that the user entered, trying to open such a file must fail. If it 
can be represented, for convenience I would allow an implicit conversion.

  for i in readdir():
  copy( i, i+.backup)
  ...

 You have to be pessimistic here.  The default operations should either
 always work or never work.  Using unicode internally and skipping
 garbage input means the operations always work.  Using a bytes API
 means mixing with unicode never works, unless the programmer
 explicitly converts, in which case the onus is on them to use proper
 error handling.

So, if I understand you correctly, you would prefer an explicit conversion to 
the system's representation:

  for i in readdir():
  copy( i, i+path(.backup))
  ...

 The only thing separating this from a bikeshed discussion is that a
 bikeshed has many equally good solutions, while we have no good
 solutions.  Instead we're trying to find the least-bad one.  The
 unicode/bytes separation is pretty close to that.  Adding a warning
 gets even closer.  Adding magic makes it worse.

Well, I see two cases:
1. Converting from an uncertain representation to a known one.
2. Converting from a known representation to a known one.

The uncertain one is the one used by the filesystem or environment. The known 
representations are the expected(!) encoding for filesystem and environment 
and the internal text in Unicode. For case 1, I would require an explicit 
conversion to make the programmer really aware of the fact that it can fail. 
For the second case, I would allow an implicit conversion even though it can 
fail. Anyhow, that is a matter of taste, and I can actually live with your 
point of view.

However, one question still remains: What about the approach in general, i.e. 
that these texts with an uncertain representation are handled as a separate 
type? I find this much more appealing that duplicating APIs like readdir() 
using either overloading on the arguments or a separate readdirb().

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-10 Thread Adam Olsen

On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt
[EMAIL PROTECTED] wrote:
 On Tuesday 09 December 2008, Adam Olsen wrote:
 The only thing separating this from a bikeshed discussion is that a
 bikeshed has many equally good solutions, while we have no good
 solutions.  Instead we're trying to find the least-bad one.  The
 unicode/bytes separation is pretty close to that.  Adding a warning
 gets even closer.  Adding magic makes it worse.

 Well, I see two cases:
 1. Converting from an uncertain representation to a known one.
 2. Converting from a known representation to a known one.

Not quite:
1. Using a garbage file name locally (within a single process, not
talking to any libs)
2. Using a unicode filename everywhere (libs, saved to config files,
displayed to the user, etc.)

Note that if you have a GUI doing the former, all you technically need
is a placeholder like undecodable filename.  You might try to
extract some ASCII out of it, but that's just a minor bonus.

On linux the bytes/unicode separation is perfect for this.  You decide
which approach you're using and use it consistently.  If you mess up
(mixing bytes and unicode) you'll consistently get an error.

We currently don't follow this model on windows, so a garbage file
name gets passed around as if it was unicode, but fails when passed to
a lib, saved to a config file, is displayed to a user, etc.
(Depending on the API, as many won't validate either.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Anders J. Munch

On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:
 try:
  files = os.listdir(somedir, errors = strict)
 except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

Instead of a codecs error handler name, how about a callback for
converting bytes to str?

os.listdir(somedir, decoder=bytes.decode)
os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding, 
errors='xmlcharrefreplace'))
os.listdir(somedir, decoder=repr)

ISTM that would be simpler and more flexible than going over the
codecs registry.  One caveat though is that there's no obvious way of
telling listdir to skip a name.  But if the default behaviour for
decoder=None is to skip with a warning, then the need to explicitly
ask for files to be skipped would be small.

Terry's example would then be:

 try:
  files = os.listdir(somedir, decoder=bytes.decode)
 except UnicodeDecodeError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

- Anders
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Nick Coghlan

Glenn Linderman wrote:
 On approximately 12/8/2008 9:30 AM, came the following characters from
 the keyboard of [EMAIL PROTECTED]:
 PS: I'd like to see a similar warning issued when an access attempt
 is made through os.environ to a variable that cannot be decoded.
 
 
 And argv ?  Seems like the warning technique could be useful for _any_
 interface that has been traditionally bytes, because that's the kind of
 characters that were, but now should move to (Unicode) characters.
 
 The warnings could be the same, or very similar.
 
 The question is if one global control should handle all types of bytes
 problems, or if there should be individual controls for each bytes
 problem, or both.  I tend to believe in both; the paranoid can set
 exactly the ones they've coded for, the aggressive can set the global
 one.  In this manner, new cases can be added to the global settings over
 time, if more are discovered -- it should be documented to handle future
 similar issues in a similar manner.

The warnings system provides that level of granularity for 'free' (so
long as we set the stack level appropriately in the C-API warnings call).

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread M.-A. Lemburg

On 2008-12-09 09:41, Anders J. Munch wrote:
 On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:
 try:
  files = os.listdir(somedir, errors = strict)
 except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)
 
 Instead of a codecs error handler name, how about a callback for
 converting bytes to str?
 
 os.listdir(somedir, decoder=bytes.decode)
 os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding, 
 errors='xmlcharrefreplace'))
 os.listdir(somedir, decoder=repr)
 
 ISTM that would be simpler and more flexible than going over the
 codecs registry.  One caveat though is that there's no obvious way of
 telling listdir to skip a name.  But if the default behaviour for
 decoder=None is to skip with a warning, then the need to explicitly
 ask for files to be skipped would be small.
 
 Terry's example would then be:
 
 try:
  files = os.listdir(somedir, decoder=bytes.decode)
 except UnicodeDecodeError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

Well, this is not too far away from just putting the whole decoding
logic into the application directly:

files = [filename.decode(filesystemencoding, errors='warnreplace')
 for filename in os.listdir(dir)]

(or os.listdirb() if that's where the discussion is heading)

... and that also tells us something about this discussion: we're
trying to come up with some magic to work around writing two
lines of Python code.

I'd just have all the os APIs return bytes and leave whatever
conversion to Unicode might be necessary to a higher level API.

Think of it: You really only need the Unicode values if you
ever want to output those values in text form somewhere.

In those cases, it's usually a human reading a log file or
screen output. Most other cases, just care about getting
some form of file identifier in order to open the file
and don't really care about the encoding of the file name
at all.

It's probably better to have a two helper functions in the os module
that take care of the conversion on demand rather than trying
to force this conversion even in cases where the application
never really needs to write the filename somewhere, e.g.
os.decodefilename() and os.encodefilename().

These should then provide some reasonable default logic, e.g.
use a 'warnreplace' error handler. Applications are then
free to use these converters or implement their own.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 09 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread André Malo

* M.-A. Lemburg wrote: 


 On 2008-12-09 09:41, Anders J. Munch wrote:
  On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:
  try:
   files = os.listdir(somedir, errors = strict)
  except OSError as e:
   log(verbose error message that includes somedir and e)
   files = os.listdir(somedir)
 
  Instead of a codecs error handler name, how about a callback for
  converting bytes to str?
 
  os.listdir(somedir, decoder=bytes.decode)
  os.listdir(somedir, decoder=lambda b: b.decode(preferredencoding,
  errors='xmlcharrefreplace')) os.listdir(somedir, decoder=repr)
 
  ISTM that would be simpler and more flexible than going over the
  codecs registry.  One caveat though is that there's no obvious way of
  telling listdir to skip a name.  But if the default behaviour for
  decoder=None is to skip with a warning, then the need to explicitly
  ask for files to be skipped would be small.
 
  Terry's example would then be:
  try:
   files = os.listdir(somedir, decoder=bytes.decode)
  except UnicodeDecodeError as e:
   log(verbose error message that includes somedir and e)
   files = os.listdir(somedir)

 Well, this is not too far away from just putting the whole decoding
 logic into the application directly:

 files = [filename.decode(filesystemencoding, errors='warnreplace')
  for filename in os.listdir(dir)]

 (or os.listdirb() if that's where the discussion is heading)

 ... and that also tells us something about this discussion: we're
 trying to come up with some magic to work around writing two
 lines of Python code.

 I'd just have all the os APIs return bytes and leave whatever
 conversion to Unicode might be necessary to a higher level API.

[...]

What I'm saying ;-)

+1.

nd
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Anders J. Munch

M.-A. Lemburg wrote:
 
 Well, this is not too far away from just putting the whole decoding
 logic into the application directly:
 
 files = [filename.decode(filesystemencoding, errors='warnreplace')
  for filename in os.listdir(dir)]
 
 (or os.listdirb() if that's where the discussion is heading)

I see what you mean, and yes, I think os.listdirb will do just as
well.  There is no need for any extra parameters to os.listdir.  The
typical application will just obliviously use os.listdir(dir) and get
the default elide-and-warn behaviour for un-decodable names.  That
rare special application that needs more control can use os.listdirb
and handle decoding itself.

Using a global registry of error handlers would just get in the way of
an application that needs more control.

- Anders
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Ulrich Eckhardt

On Monday 08 December 2008, Adam Olsen wrote:
 At this point someone suggests we have a type that can store an
 arbitrary mix of unicode and bytes, so the undecodable portions stay
 in their original form. :P

Well, not an arbitrary mix, but a type that just stores whatever comes from 
the system without further specifying it as either bytes or Unicode:

* If you want a string for displaying it, you first have to extract a string 
from that thing and there you optionally specify the encoding and error 
behaviour.
* If you want to append a string to it, it is automatically encoded in the 
default encoding, which obviously can fail.
* Similarly, e.g. globbing is done on the underlying representation's level, 
so *.py will first have to be converted according to the default encoding.
* If you just print it, you will get something that you can make out the 
decodable parts from, but it will probably be like {Unicode:u'abcde'} 
or {bytes:b'ab\xf0\x0fcd'}.
* If you don't want to display it, but just want to pass it to the system, 
just use it as is.

Yes, this puts an inconvenience on application programmers that up to now 
always assumed that they received a list of strings from os.readdir(), but 
that's the way with false assumptions. In any case, they will be aware (from 
reading the docs) of what the problem is and why there is no way to return a 
text. Further, they will get tools to convert these paths or environment vars 
to texts, so it will be simply replacing os.readdir() 
with map(to_unicode,os.readdir()).


Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Adam Olsen

On Tue, Dec 9, 2008 at 11:31 AM, Ulrich Eckhardt
[EMAIL PROTECTED] wrote:
 On Monday 08 December 2008, Adam Olsen wrote:
 At this point someone suggests we have a type that can store an
 arbitrary mix of unicode and bytes, so the undecodable portions stay
 in their original form. :P

 Well, not an arbitrary mix, but a type that just stores whatever comes from
 the system without further specifying it as either bytes or Unicode:

 * If you want a string for displaying it, you first have to extract a string
 from that thing and there you optionally specify the encoding and error
 behaviour.
 * If you want to append a string to it, it is automatically encoded in the
 default encoding, which obviously can fail.

So the 2.x str, but with a more interesting default encoding than
ASCII.  It'll work fine on the developer's system, but one day a user
will present it with strange input, and boom.

You have to be pessimistic here.  The default operations should either
always work or never work.  Using unicode internally and skipping
garbage input means the operations always work.  Using a bytes API
means mixing with unicode never works, unless the programmer
explicitly converts, in which case the onus is on them to use proper
error handling.

The only thing separating this from a bikeshed discussion is that a
bikeshed has many equally good solutions, while we have no good
solutions.  Instead we're trying to find the least-bad one.  The
unicode/bytes separation is pretty close to that.  Adding a warning
gets even closer.  Adding magic makes it worse.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-09 Thread Toshio Kuratomi

James Y Knight wrote:
 On Dec 9, 2008, at 6:04 AM, Anders J. Munch wrote:
 The typical application will just obliviously use os.listdir(dir) and
 get the default elide-and-warn behaviour for un-decodable names. That
 rare special application
 
 I guess this is a new definition of rare special application: an
 application which deals with user-specified files.
 
 This is the problem I see in having two parallel APIs: people keep
 saying most applications can just go ahead and use the [broken] unicode
 string API. If there was a unicode API and a bytes API, but everyone
 was clear that always use the bytes API is the right thing to do,
 that'd be okay... But, since even python-dev members are saying that
 only a rare special app needs to care about working with users' existing
 files, I'm rather worried this API design will cause most programs
 written in python to be broken. Which seems a shame.
 
I agree with you which was part of why I raised this subject but I also
think that using the warnings module to issue a warning and ignore the
entire problematic entry is a reasonable compromise.  Hopefully it will
become obvious to people that it's a python3 wart at some point in the
future and we'll re-examine the default.  But until then, having a
printed warning that individual apps can turn into an exception seems
like it is less broken than the other alternatives the rare special
application people can live with :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Stephen J. Turnbull

Glenn Linderman writes:

  significantly seems to be the only word at question; it seems that 
  there are a fair number of validation checks that could be performed; 
  the numeric part of UTF-8 decoding is just a sequence of shifts, masks, 
  and ORs, so can be coded pretty tightly in C or assembly language.
  
  Anything extra would be slower; how much slower is hard to predict prior 
  to the implementation.

Not much, see my previous response.

  This also seems to be supported by Stephen's comment That's a lot
  to ask, as it turns out.

Not what I meant.  Inefficiency is not an objection to checking for
validity at the level a codec can handle.  The objection is that we
don't want *any* exceptions thrown that we didn't explicitly ask for,
and adding validation certainly will violate that.

  So I don't understand how this is responsive to the decoding removes 
  many insecurities issue?

Because you have to recheck every time the data crosses from Python
into your code.  To the extent that Python codecs promise validation
and keep that promise, internal code *never* has to make those checks.
That is a significant savings in programmer effort, because auditing a
large body of code for *any* I/O from Python is going to be costly.

  So when you examine a library for potential use, you have documentation 
  or code to help you set your expectations about what it does, and 
  whether or not it may have vulnerabilities, and whether or not those 
  vulnerabilities are likely or unlikely, whether you can reduce the 
  likelihood or prevent the vulnerabilities by wrapping the API, etc.  And 
  so you choose to use the library, or not.

Python is precisely such a component that people will choose to use,
or not, based on whether they can expect that when Python hands them a
Unicode object freshly input from the outside world, it won't contain
lone surrogates, or invalid UTF-8 characters that got through a
3rd-party spam filter, or whatever.

  This whole discussion about libraries seems somewhat irrelevant to the 
  question at hand,

No, it's the *only* point that matters.  IMO, speed is not relevant
here.  The question is whether throwing a Unicode exception on invalid
encoding by default generally does more good than harm.  Guido seems
to think not!, which gives me pause.wink  I still disagree, though.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Ulrich Eckhardt

On Friday 05 December 2008, James Y Knight wrote:
 On Dec 5, 2008, at 5:27 AM, Ulrich Eckhardt wrote:
  Using the byte variant is equally fubar, because e.g. on MS Windows
  it is not supported, except through a very lossy roundtrip through
  the locale's codepage, limiting your functionality.

 Yeah, IMO whole mess could have been avoided by keeping the filename/
 args/environ simply *bytes*, like it really is, on unix. Then, make
 the Windows version of python use (always! not dependent upon locale!)
 utf-8 to decode the utf-8 bytestring to the UTF-16 that the Windows
 platform APIs expect (and vice versa).

If possible, I would try to avoid this useless roundtrip from UTF-16 to UTF-8 
and back.

 And never use the ASCII variant of the windows APIs.

That's okay, but I'm afraid it's not possible. The problem is not so much 
doing it, but finding all those places where it is currently done. Those 
could be outside of Python itself. So, even to Python code, there could still 
be APIs that would need the MBCS-encoded strings.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Glenn Linderman

On approximately 12/8/2008 12:57 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:



Internal decoding is (or should be) an oxymoron.  Why would your
software be passing around text in any format other than internal?  So
decoding will happen (a) on I/O, which is itself almost certainly
slower than making a few checks for Unicode hygiene, or (b) on receipt
of data from other software that whose sanitation you shouldn't trust
more than you trust the Internet.

Encoding isn't a problem, AFAICS.



So I can see validating user supplied data, which always comes in via I/O.

But during manipulation of internal data, including file and database 
I/O, there is a need for encoding and decoding also.  If all the data 
has already been validated, then there would be no need to revalidate on 
every conversion.


I hear you when you say that clever coding can make the validation 
nearly free, and I applaud that: the UTF-8 coder that I wrote predated 
most of the rules that have been created since, so I didn't attempt to 
be clever in that regard.


Thanks to you and Adam for your explanations; I see your points, and if 
it is nearly free, I withdraw most of my negativity on this topic.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Nick Coghlan

Terry Reedy wrote:
 This to be is an argument for keeping the default the current behavior,
 but not for rejecting flexibility.  The computing world seems to be
 messier than we would like and worse that I realized until this week. As
 you say below, people need to better anticipate the future, and an
 errors parameter would help do that.

It just occurred to me that this seems like a perfect situation to
address via the warning system. The normal warnings mechanics can then
be used to turn it into an exception if so desired, and this can be done
once per application rather than having to pass a separate argument
every time the affected APIs are called.

And the decoding problems don't pass silently either - they just get
emitted as a warning by default instead of causing the application to crash.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Ulrich Eckhardt

On Sunday 07 December 2008, Guido van Rossum wrote:
 My problem with raising exceptions *by default* when an undecodable
 name exists is that it may render an app completely useless in a
 situation where the developer is no longer around. This happened all
 the time with the 2.x Unicode API, where the developer hadn't
 anticipated a particular input potentially containing non-ASCII bytes,
 and the user fed the application non-ASCII text. Making os.listdir
 raise an exception when a directory contains a single undecodable file
 means that the entire directory can't be read, and most likely the
 entire app crashes at that point. Most likely the developer never
 anticipated this situation (since in most places it is either
 impossible or very unlikely) -- after all, if they had anticipated it
 they would have used the bytes API in the first place.

There is another way to handle this that noisily signals errors but doesn't 
cause programs to suddenly fail. Using os.listdir as example, the problem 
there is that the OS actually returns a list of strings that can not be 
reliably decoded, so I would propose to simply not decode them.

Now, the idea is what if this function simply returned neither a byte string 
nor a Unicode string, but e.g. an environment string type (called env_str)? 
os.listdir would only fail if it really failed to read the dir. If a user 
wants to display an element from the returned list, they would get something 
akin to what repr() returns, i.e. a recognisable string that can be written 
to a logfile. However, this thing will also include additional markup that 
makes it clear that it is not just a piece of text and not suitable to 
display to the end user.

This type distinction is important, because it means that any developer will 
immediately see that something unexpected is going on here. They will 
invoke type(lst[0]) and see the unexpected type env_str, which will (via 
documentation) redirect them to the issue with different encodings and that 
all they have to do is 'map( unicode, lst)' in order to get at a list of real 
text strings, but they will also read that this operation might fail, forcing 
an informed decision.

If they don't care about a textual representation at all but only want to 
invoke os.popen with arguments received from the commandline, then everything 
is fine, too, because that function will take the strings as they are and 
just give them back to the OS. This allows roundtripping from OS over Python 
and back to the OS without any conversions and thus without any conversions 
that could fail. In the case of e.g. a backup program, this is exactly what 
is needed.

Now, if you have any hard-coded strings in your program but a function like 
os.popen needs an env_str object, this string is converted via a default 
encoding, i.e. the same that is used when converting an env_str object to 
Unicode. In this case, I would go so far to say that os.popen should accept 
normal str strings, too, and perform that conversion itself. An alternative 
way would be to reject the string because it is the wrong type, but since 
this internal string's encoding is known, there is no reason to force users 
to convert explicitly, it is just that the conversion might fail.

Similarly, when modifying such an env_str object, like e.g. bak = 
sys.argv[1]+'.backup'. In this case, the string '.backup' is converted 
according to the default encoding and then appended to the commandline 
argument, the result would again be an env_str object.


Note: There is an option in this design, and that is to make the default 
behaviour in case of nonconvertable env_str objects configurable. A 
filemanager would then replace the undecodable bytes by an approximation, a 
backup program would use strict mode and a music player would perhaps simply 
skip and ignore such strings. The problem there is that changing this option 
would possibly affect other library code that one doesn't even know about 
because it is only used indirectly and its implementation is unknown. For 
that reason, I would rather not make this policy a configurable element. If 
you want that, you can easily code it yourself.

BTW: there was a PEP that proposed a new path class, which was rejected. This 
class was actually pretty similar, except that it also included several other 
features (globbing, path handling, opening files and the kitchen sink) which 
eventually made it too bloated. Otherwise, the idea of creating a separate 
type for these strings is the same.


Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at http://www.satorlaser.de/
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread M.-A. Lemburg

On 2008-12-06 01:48, Nick Coghlan wrote:
 You can't display a non-decodable filename to the user, hence the user
 will have no idea what they're working on. Non-filesystem related apps
 have no business trying to deal with insane filenames.

This is not entirely true: OSes, shells, and applications will
typically represent the file names using either ?-replacements or
some form of hex or decimal escapes for the characters they can't
decode. Since humans are usually very good at pattern recognition,
this goes a long way.

Of course, how the application maps that partially converted file name
back to the real thing is another issue and that's something that
Python should not make harder than it should be.

 Linux is moving towards a standard of UTF-8 for filenames, and once we
 get to the point where the idea of encoding filenames and environment
 variables any other way is seen as crazy, then the Python 3 approach
 will work seamlessly.

It's going to take a long time before file names, environment variables
and command line parameters are all encoded using UTF-8, so practicality
beats purity will have to get more attention in this thread.

Python APIs should work out of the box most of the time.

Currently, if you live in a non-ASCII and non-pure-UTF-8 environment,
you have to deal with different and mixed encodings on a regular
basis.

Whether that's a USB stick, you're trying to read, a ZIP file
you're trying to open, a mounted network drive, etc. the problem
pops up in many different kinds of areas.

If I write do_something.py * I expect Python to indeed work on
all the files in my directory, not just the one that happen to
fit a particular encoding.

If I hook up a CGI script written in Python with a web server,
I expect all data to be received by the script, not just data
that happens to be UTF-8 encoded.

 In the meantime, raw bytes APIs will provide an alternative for those
 that disagree with that philosophy.

I think that's a wrong way to put it: The problems are not made
up by people who disagree with the one-encoding-for-everything
strategy.

The problems occur in real-life IT processing all the time - maybe
not so much in places where English scripts dominate, but certainly
in most other places with non-English scripts.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 08 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread rdmurray


On Sun, 7 Dec 2008 at 13:33, Guido van Rossum wrote:

My problem with raising exceptions *by default* when an undecodable
name exists is that it may render an app completely useless in a
situation where the developer is no longer around. This happened all


I think Nick Coghlan's suggestion of emitting warnings would be an
excellent solution that addresses both your concerns and the concerns
Toshio has expressed (and with which I agree 100%).

The above is the only use case I've heard in this thread for ignoring
files with names that can't be decoded:  so that a user can use the
program on those files whose names can be decoded even when the user does
not have the resources to get the program fixed to handle undecodable
filenames.  I agree that that is a worthwhile goal.

If warnings were emitted, then files would not be silently ignored,
yet the program could still be used.

--RDM

PS: I'd like to see a similar warning issued when an access attempt
is made through os.environ to a variable that cannot be decoded.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Bill Janssen

Nick Coghlan [EMAIL PROTECTED] wrote:

 - I think the binary and Unicode APIs should be available (and fully
 functional) on all platforms (including Windows) so that app developers
 don't create portability problems for themselves when they make the
 decision as to which API to use

+1

I'm perhaps biased here; most of my Python programs don't have user
interfaces, because they don't talk to people, they talk to other
programs.  The binary APIs for the OS are essential.  I use and
deeply appreciate all the string handling features in Python,
particularly its firm grip on Unicode issues, but that's *useful*
instead of *essential*.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Terry Reedy


Nick Coghlan wrote:

Terry Reedy wrote:

This to be is an argument for keeping the default the current behavior,
but not for rejecting flexibility.  The computing world seems to be
messier than we would like and worse that I realized until this week. As
you say below, people need to better anticipate the future, and an
errors parameter would help do that.


It just occurred to me that this seems like a perfect situation to
address via the warning system.


I disagree.

 The normal warnings mechanics can then

be used to turn it into an exception if so desired, and this can be done
once per application rather than having to pass a separate argument
every time the affected APIs are called.


The warning mechanism, as far as I know, because I have never dealt with 
it (and do not want to) is for version issues.  In any case, the snippet 
that you clipped


try:
  files = os.listdir(somedir, errors = strict)
except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

specifically requires a per call parameter.


And the decoding problems don't pass silently either - they just get
emitted as a warning by default instead of causing the application to crash.


Do they get automatically logged?  In any case, the errors parameter has 
an in between option to neither ignore or raise but to replace and give 
*something* printable.


This situation seems like an ideal situation for a parameter which gives 
the application program who uses Python a range of options to working 
with an un-ideal world.  I am really flabbergasted why there is so much 
opposition to doing so in favor of more difficult or less functional 
alternatives.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Guido van Rossum

On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:
 Guido van Rossum wrote:

 On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote:

 Toshio Kuratomi wrote:

  - If this is true, a definition of os.listdir(type 'str') that would
 better meet programmer expectation would be: Give me all files in a
 directory with the output as str type.  The definition of
 os.listdir(type 'bytes') would be Give me all files in a directory
 with the output as bytes type.  Raising an exception when the filenames
 are undecodable is perfectly reasonable in this situation.

 Your examples (snipped) pretty well convince me that there is a use case
 for
 raising exceptions.  We should move beyond arguing over which one way is
 right.  I think there should be a second argument 'ignorebad=False' to
 ignore undecodable files rather than raise the exception (or
 'strict=True'
 to stop and raise exception on non-decodable names -- then code is 'if
 strict: raise ...').  I believe other functions have a similar parameter.

 I was thinking of the normal Unicode 'errors' parameter, as described by
 Nick.

 If you want the exceptions, just use the bytes API and try to decode
 the byte strings using the system encoding.

 If it was a matter of adding a new method, I might agree.  But:

 1. We already have a method that does exactly what you describe.  It is only
 a matter of adding flexibility to the response to problems, for which there
 is already precedent.

 2. Suggesting that people who want strings and not bytes should have to deal
 with bytes, just to get an error notification, seems to negate that point of
 moving to 3.0

 3. A builtin would probably do so better than most programmers would, with
 little touches such as the one suggested below.

 4. An error parameter would ALERT programmers to the possibility of a
 PROBLEM, both in the present and future.  As you say below, people need to
 better anticipate the future.

 My problem with raising exceptions *by default* when an undecodable
 name exists is that it may render an app completely useless in a
 situation where the developer is no longer around. This happened all
 the time with the 2.x Unicode API, where the developer hadn't
 anticipated a particular input potentially containing non-ASCII bytes,
 and the user fed the application non-ASCII text. Making os.listdir
 raise an exception when a directory contains a single undecodable file
 means that the entire directory can't be read, and most likely the
 entire app crashes at that point. Most likely the developer never
 anticipated this situation (since in most places it is either
 impossible or very unlikely) -- after all, if they had anticipated it
 they would have used the bytes API in the first place. (It's worse
 because the exception being raised would be UnicodeError -- most
 people expect os.listdir to raise OSError, not other errors.)

 This to be is an argument for keeping the default the current behavior, but
 not for rejecting flexibility.  The computing world seems to be messier than
 we would like and worse that I realized until this week. As you say below,
 people need to better anticipate the future, and an errors parameter would
 help do that.

I'm fine with whatever API enhancements you can come up with (assuming
others like them too :-) as long as the default remains the current
behavior.

 Is Windows really immune?  What about when it reads the directory of
 possibly old removable media with whatever byte name encodings?  Is this a
 possible source of 'unanticipated' problems?

 As to your last sentence, os.listdir() with an errors parameter could
 convert a decoding UnicodeError to OSError: undecodable file name
 ascii+hex repr, thereby supplying the expected exception as well as an
 extractable representation of problematical the raw bytes

 Here is a possible use case: I want filenames as 3.0 strings and I
 anticipate no problems at present but, as you say above, something might
 happen years in the future.  I am using 3.0 *because* of the strings ==
 unicode feature.  I would like to write

 try:
  files = os.listdir(somedir, errors = strict)
 except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

 and go one without the problem file but not without logging the problem so a
 future maintainer can consider what to do about it, but only when there is
 an actual need to think about it.

 Terry Jan Reedy

 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Guido van Rossum

On Mon, Dec 8, 2008 at 10:34 AM,  [EMAIL PROTECTED] wrote:
 On Mon, 8 Dec 2008 at 13:16, Terry Reedy wrote:

  And the decoding problems don't pass silently either - they just get
  emitted as a warning by default instead of causing the application to
  crash.

 Do they get automatically logged?  In any case, the errors parameter has
 an in between option to neither ignore or raise but to replace and give
 *something* printable.

 This situation seems like an ideal situation for a parameter which gives
 the application program who uses Python a range of options to working with
 an un-ideal world.  I am really flabbergasted why there is so much
 opposition to doing so in favor of more difficult or less functional
 alternatives.

 I'm in favor of an option to control what happens.

 I just really really don't want the _default_ to be ignore.  Defaulting
 to a warning is fine with me, as would be defaulting to a traceback.

 But defaulting to silently ignore, as we have now, is just asking for user
 confusion and debugging headaches, as detailed by Toshio.  A _worse_ user
 experience, IMO, than having a program fail when undecodable filenames
 match the selection criteria.

Do you really not care about the risk where apps that weren't written
to be prepared to handle this will be rendered completely useless if a
single file in a directory has an unencodable name? This is similar to
an issue that Python had for a long time where it wouldn't start up if
the current directory contained non-ASCII characters.

Given that most developers will not have this issue in their own
environment, most apps will not be prepared for this issue, and that
makes it worse for the app's user!

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Scott Dial

Guido van Rossum wrote:
 On Mon, Dec 8, 2008 at 10:34 AM,  [EMAIL PROTECTED] wrote:
 On Mon, 8 Dec 2008 at 13:16, Terry Reedy wrote:
  And the decoding problems don't pass silently either - they just get
  emitted as a warning by default instead of causing the application to
  crash.
 Do they get automatically logged?  In any case, the errors parameter has
 an in between option to neither ignore or raise but to replace and give
 *something* printable.

 I just really really don't want the _default_ to be ignore.  Defaulting
 to a warning is fine with me, as would be defaulting to a traceback.
 
 Do you really not care about the risk where apps that weren't written
 to be prepared to handle this will be rendered completely useless if a
 single file in a directory has an unencodable name?

Since when do warnings cause apps to be rendered completely useless? I
think it's easy to agree that defaulting to an exception is not good for
the reason you give, but I don't see how that applies to a warning. And,
it seems like a warning covers the issues that the other people want as
well. If there is a warning, then there is at least a record of the fact
that some filenames were ignored. Presumably if I was responsible for
the correctness of some piece of code, I would see the warning in a log
of some sort and could investigate it further (if I cared), otherwise I
could choose to ignore it. I don't see os.listdir(name) to be one of
those situations that emitting a warning is a nuisance at all.

-Scott

-- 
Scott Dial
[EMAIL PROTECTED]
[EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Bugbee, Larry

 I'm perhaps biased here; most of my Python programs don't have user 
 interfaces, because they don't talk to people, they talk to other 
 programs.  The binary APIs for the OS are essential.  I use and 
 deeply appreciate all the string handling features in Python, 
 particularly its firm grip on Unicode issues, but that's *useful* 
 instead of *essential*.

Exactly!  Another +1.

Larry


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread rdmurray


On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote:

On Mon, Dec 8, 2008 at 10:34 AM,  [EMAIL PROTECTED] wrote:

I'm in favor of an option to control what happens.

I just really really don't want the _default_ to be ignore.  Defaulting
to a warning is fine with me, as would be defaulting to a traceback.

But defaulting to silently ignore, as we have now, is just asking for user
confusion and debugging headaches, as detailed by Toshio.  A _worse_ user
experience, IMO, than having a program fail when undecodable filenames
match the selection criteria.


Do you really not care about the risk where apps that weren't written
to be prepared to handle this will be rendered completely useless if a
single file in a directory has an unencodable name? This is similar to
an issue that Python had for a long time where it wouldn't start up if
the current directory contained non-ASCII characters.


No, I do care.  In another message I agreed with you that having the
ap not fail was a reasonable goal.  What I'm saying is that having it
ignore the undecodable files fail _silently_ is bad.  And not picking
up a file that matches some selection criteria (ex: *.py) because it is
undecodable is a _failure_, in my opinion, that is _worse_ than getting
a traceback because there's an undecodable file in the directory.

But I'm happy with just issuing a warning by default.  That would mean
it doesn't fail silently, but neither does it crash.  Seems like the
best compromise with the broken nature of the real world IT
environment.


Given that most developers will not have this issue in their own
environment, most apps will not be prepared for this issue, and that
makes it worse for the app's user!


It is exactly because most developers won't have the issue in their own
environment that ignoring files silently is a problem.  If they did,
they'd fix their code before it went out the door.  Since they don't,
when their code is used by somebody in a mixed encoding environment,
the programs _will_ fail by ignoring files that they should process.
The question, it seems to me, is do they fail silently and mysteriously
by failing to process files they are supposed to, or do they fail with
at least a little bit of noise?

--RDM
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Guido van Rossum

On Mon, Dec 8, 2008 at 12:07 PM,  [EMAIL PROTECTED] wrote:
 On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote:

 On Mon, Dec 8, 2008 at 10:34 AM,  [EMAIL PROTECTED] wrote:

 I'm in favor of an option to control what happens.

 I just really really don't want the _default_ to be ignore.  Defaulting
 to a warning is fine with me, as would be defaulting to a traceback.

 But defaulting to silently ignore, as we have now, is just asking for
 user
 confusion and debugging headaches, as detailed by Toshio.  A _worse_ user
 experience, IMO, than having a program fail when undecodable filenames
 match the selection criteria.

 Do you really not care about the risk where apps that weren't written
 to be prepared to handle this will be rendered completely useless if a
 single file in a directory has an unencodable name? This is similar to
 an issue that Python had for a long time where it wouldn't start up if
 the current directory contained non-ASCII characters.

 No, I do care.  In another message I agreed with you that having the
 ap not fail was a reasonable goal.  What I'm saying is that having it
 ignore the undecodable files fail _silently_ is bad.  And not picking
 up a file that matches some selection criteria (ex: *.py) because it is
 undecodable is a _failure_, in my opinion, that is _worse_ than getting
 a traceback because there's an undecodable file in the directory.

 But I'm happy with just issuing a warning by default.  That would mean
 it doesn't fail silently, but neither does it crash.  Seems like the
 best compromise with the broken nature of the real world IT
 environment.

OK, I can live with that too.

 Given that most developers will not have this issue in their own
 environment, most apps will not be prepared for this issue, and that
 makes it worse for the app's user!

 It is exactly because most developers won't have the issue in their own
 environment that ignoring files silently is a problem.  If they did,
 they'd fix their code before it went out the door.  Since they don't,
 when their code is used by somebody in a mixed encoding environment,
 the programs _will_ fail by ignoring files that they should process.
 The question, it seems to me, is do they fail silently and mysteriously
 by failing to process files they are supposed to, or do they fail with
 at least a little bit of noise?

A warning is fine. Whether the app *fails* or *succeeds* when the
warning is issued depends on what the app is trying to do and what the
user expects. There certainly are valid use cases for both, but I
expect that succeeding noisily is going to be at least as common as
failing (in the sense of not doing the right thing, not necessarily
crashing) noisily. This is an improvement over always crashing.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread M.-A. Lemburg

On 2008-12-08 19:26, Guido van Rossum wrote:
 On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:
 Here is a possible use case: I want filenames as 3.0 strings and I
 anticipate no problems at present but, as you say above, something might
 happen years in the future.  I am using 3.0 *because* of the strings ==
 unicode feature.  I would like to write

 try:
  files = os.listdir(somedir, errors = strict)
 except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

 and go one without the problem file but not without logging the problem so a
 future maintainer can consider what to do about it, but only when there is
 an actual need to think about it.

If that error parameter is the same as in unicode(value, errors),
then this would be a useful feature:

People could then choose among the already existing error handlers
('strict', 'ignore', 'replace', 'xmlcharrefreplace') or register
their own ones via the codecs module.

Such application specific error handlers could then also apply
whatever fancy round-trip safe encoding of non-decodable bytes
to Unicode escapes, private code points, etc. as seen fit by the
application.

Perhaps we should also add an ''encoding'' parameter that can be
set on a per directory basis (if necessary) and defaults to the
global file system encoding.

If an application hits directory that is known to cause problems,
it could then chose to receive the file names in a different,
more suitable encoding. This allows implementing fallback
mechanisms with a list of common encodings for a locale.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 08 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Antoine Pitrou

M.-A. Lemburg mal at egenix.com writes:
 
 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.

I'd argue that such fancy round-trip safe error handler should be provided by
Python. It's not reasonable to expect application coders to come up with their
own codec variation based on subtle details of the unicode spec.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread M.-A. Lemburg

On 2008-12-08 21:45, Antoine Pitrou wrote:
 M.-A. Lemburg mal at egenix.com writes:
 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.
 
 I'd argue that such fancy round-trip safe error handler should be provided by
 Python. It's not reasonable to expect application coders to come up with their
 own codec variation based on subtle details of the unicode spec.

Fair enough. We could add some e.g.

 * a round-trip safe escape error handler that uses a Unicode private
   code point area which we officially reserve for the Python
   interpreter

 * a human readable escape error handler that encodes the problem
   bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1
   encoded directory name instead of failing

 * a warning error handler that replaces the problem cases with
   a question mark and issues a warning through the warning
   framework

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 08 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Nick Coghlan

Terry Reedy wrote:
 Nick Coghlan wrote:
 Terry Reedy wrote:
 This to be is an argument for keeping the default the current behavior,
 but not for rejecting flexibility.  The computing world seems to be
 messier than we would like and worse that I realized until this week. As
 you say below, people need to better anticipate the future, and an
 errors parameter would help do that.

 It just occurred to me that this seems like a perfect situation to
 address via the warning system.
 
 I disagree.
 
 The normal warnings mechanics can then
 be used to turn it into an exception if so desired, and this can be done
 once per application rather than having to pass a separate argument
 every time the affected APIs are called.
 
 The warning mechanism, as far as I know, because I have never dealt with
 it (and do not want to) is for version issues.

No, it's just DeprecationWarning in particular that is specific to
versioning issues. That's obviously the one that comes up most often for
core development, but there are other warnings as well (e.g. the
off-by-default ImportWarning when potential packages are skipped because
__init__.py is missing).

For this particular case, I would suggest adding something like
EnvironmentWarning (to parallel the EnvironmentError that is the common
parent of OSError and IOError).

  In any case, the snippet
 that you clipped
 
 try:
   files = os.listdir(somedir, errors = strict)
 except OSError as e:
   log(verbose error message that includes somedir and e)
   files = os.listdir(somedir)
 
 specifically requires a per call parameter.

True, but the decision to have errors=warn as the default behaviour is
independent of the decision of whether or not to allow the behaviour to
be changed on a case-by-case basis. There is nothing stopping us from
doing both.

 And the decoding problems don't pass silently either - they just get
 emitted as a warning by default instead of causing the application to
 crash.
 
 Do they get automatically logged?

By default warnings are written to sys.stderr. Whether that gets logged
or not will depend on the nature of the application

There are also mechanisms in warnings that allow an application to
override the handling of warnings (and for 2.7/3.1, there are mechanisms
in logging to make it easy to hook the warning system and the logging
system together, so that warnings are automatically logged).

  In any case, the errors parameter has
 an in between option to neither ignore or raise but to replace and give
 *something* printable.

That's true, and why I would actually support doing both. Adding the
warning is a more pressing need though, since it is what will prevent
the errors from passing silently in the default case.

 This situation seems like an ideal situation for a parameter which gives
 the application program who uses Python a range of options to working
 with an un-ideal world.  I am really flabbergasted why there is so much
 opposition to doing so in favor of more difficult or less functional
 alternatives.

A warning will stop the failure from passing silently in the default
case - that's solving a different problem to the one that the error
handling argument will solve. I do agree that being able to override the
handling on a per-call basis could be a useful feature.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Adam Olsen

On Mon, Dec 8, 2008 at 1:45 PM, Antoine Pitrou [EMAIL PROTECTED] wrote:
 M.-A. Lemburg mal at egenix.com writes:

 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.

 I'd argue that such fancy round-trip safe error handler should be provided by
 Python. It's not reasonable to expect application coders to come up with their
 own codec variation based on subtle details of the unicode spec.

Except they're clearly NOT part of the unicode spec.

Moreover, whatever tricks you use vary depending on if your garbage
input is from UTF-8, UTF-16, or UTF-32 (or any other arbitrary
encoding, like CP-1252 or Shift-JIS.)

At this point someone suggests we have a type that can store an
arbitrary mix of unicode and bytes, so the undecodable portions stay
in their original form. :P

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Antoine Pitrou

Adam Olsen rhamph at gmail.com writes:
 
 Except they're clearly NOT part of the unicode spec.

This is always the same discussion going in circles. I know they're not part of
the unicode spec, but practicality beats purity and if the said error handler
comes with an appropriate warning in the official doc, then why not?

In any case, +1 to Marc-André's proposal.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Adam Olsen

On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2008-12-08 21:45, Antoine Pitrou wrote:
 M.-A. Lemburg mal at egenix.com writes:
 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.

 I'd argue that such fancy round-trip safe error handler should be provided by
 Python. It's not reasonable to expect application coders to come up with 
 their
 own codec variation based on subtle details of the unicode spec.

 Fair enough. We could add some e.g.

  * a round-trip safe escape error handler that uses a Unicode private
   code point area which we officially reserve for the Python
   interpreter

This would of course alter the behaviour of those private code points,
preventing them from round-tripping properly.

I don't think round-tripping can be done from an error handler.  You
need a full codec to do it.  A simple option is 8859-1.  Or, ya know,
bytes.  This has long since gotten repetitive..


  * a human readable escape error handler that encodes the problem
   bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1
   encoded directory name instead of failing

Similar to 'ö'.encode('ascii', 'backslashreplace')?  I'm +1 on making that work.


  * a warning error handler that replaces the problem cases with
   a question mark and issues a warning through the warning
   framework

I dub thee errors='warnreplace'.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Toshio Kuratomi

Guido van Rossum wrote:
 On Mon, Dec 8, 2008 at 12:07 PM,  [EMAIL PROTECTED] wrote:
 On Mon, 8 Dec 2008 at 11:25, Guido van Rossum wrote:
 But I'm happy with just issuing a warning by default.  That would mean
 it doesn't fail silently, but neither does it crash.  Seems like the
 best compromise with the broken nature of the real world IT
 environment.
 
 OK, I can live with that too.
 
Same here.  This lets the application specify globally what should
happen (exception, warning, ignore via the warnings filters) and should
give enough context that it doesn't become a mysterious error in the
program.

The per method addition of an errors argument so that this isoverridable
locally as well as globally is also a nice touch but can be done
separately from this step.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread M.-A. Lemburg

On 2008-12-08 22:32, Adam Olsen wrote:
 On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2008-12-08 21:45, Antoine Pitrou wrote:
 M.-A. Lemburg mal at egenix.com writes:
 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.
 I'd argue that such fancy round-trip safe error handler should be provided 
 by
 Python. It's not reasonable to expect application coders to come up with 
 their
 own codec variation based on subtle details of the unicode spec.
 Fair enough. We could add some e.g.

  * a round-trip safe escape error handler that uses a Unicode private
   code point area which we officially reserve for the Python
   interpreter
 
 This would of course alter the behaviour of those private code points,
 preventing them from round-tripping properly.
 
 I don't think round-tripping can be done from an error handler.  You
 need a full codec to do it.  A simple option is 8859-1.  Or, ya know,
 bytes.  This has long since gotten repetitive..

The error handler would just map the problem bytes to the private
area. The application would then have to decide what to do with
them, ie. the error handler only provides one half of the round-
tripping.

And that's on purpose: I don't believe we can come up with some magic
solution for the encodings problem. This is essentially something
that applications will have to solve on a case-by-case basis.

  * a human readable escape error handler that encodes the problem
   bytes to say hex escapes, e.g. gives Andr\xe9 for a Latin-1
   encoded directory name instead of failing
 
 Similar to 'ö'.encode('ascii', 'backslashreplace')?  I'm +1 on making that 
 work.

Yes.

  * a warning error handler that replaces the problem cases with
   a question mark and issues a warning through the warning
   framework
 
 I dub thee errors='warnreplace'.

Yep, something along those lines.

Perhaps there are more and better alternatives. These suggestions
are just to show how the idea could be put to some real-life use.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 08 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Adam Olsen

On Mon, Dec 8, 2008 at 1:12 PM, Guido van Rossum [EMAIL PROTECTED] wrote:
 On Mon, Dec 8, 2008 at 12:07 PM,  [EMAIL PROTECTED] wrote:
 But I'm happy with just issuing a warning by default.  That would mean
 it doesn't fail silently, but neither does it crash.  Seems like the
 best compromise with the broken nature of the real world IT
 environment.

 OK, I can live with that too.

+1


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread M.-A. Lemburg

On 2008-12-08 22:39, Victor Stinner wrote:
 ('strict', 'ignore', 'replace', 'xmlcharrefreplace')
 
 replace (or xmlcharrefreplace) is just useless because you will not be unable 
 to open or rename the file... You just know that there is a strange file in 
 the directory.

Right, but that's already a lot better than not knowing of the
file's existence at all :-)

Note that the above are standard error handlers for Unicode
conversions. The rest of the email you cut away has more useful
error handlers for the purpose in question.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 08 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2008-12-02: Released mxODBC.Connect 1.0.0  http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Adam Olsen

On Mon, Dec 8, 2008 at 2:44 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2008-12-08 22:32, Adam Olsen wrote:
 On Mon, Dec 8, 2008 at 2:01 PM, M.-A. Lemburg [EMAIL PROTECTED] wrote:
 On 2008-12-08 21:45, Antoine Pitrou wrote:
 M.-A. Lemburg mal at egenix.com writes:
 Such application specific error handlers could then also apply
 whatever fancy round-trip safe encoding of non-decodable bytes
 to Unicode escapes, private code points, etc. as seen fit by the
 application.
 I'd argue that such fancy round-trip safe error handler should be provided 
 by
 Python. It's not reasonable to expect application coders to come up with 
 their
 own codec variation based on subtle details of the unicode spec.
 Fair enough. We could add some e.g.

  * a round-trip safe escape error handler that uses a Unicode private
   code point area which we officially reserve for the Python
   interpreter

 This would of course alter the behaviour of those private code points,
 preventing them from round-tripping properly.

 I don't think round-tripping can be done from an error handler.  You
 need a full codec to do it.  A simple option is 8859-1.  Or, ya know,
 bytes.  This has long since gotten repetitive..

 The error handler would just map the problem bytes to the private
 area. The application would then have to decide what to do with
 them, ie. the error handler only provides one half of the round-
 tripping.

By that point it's already too late.  You've already conflated garbage
PUA with legitimate PUA.

To make it work you need to treat those legitimate PUA scalars as
errors too, transforming them.  A common example is how escaping
replaces a single '\' with '\\'.

Hrm.  nul-escaping should work.  Obviously it can't be used outside
the filesystem though, as they may introduce a legitimate nul.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Terry Reedy


M.-A. Lemburg wrote:


On Sun, Dec 7, 2008 at 3:53 PM, Terry Reedy [EMAIL PROTECTED] wrote:



try:
 files = os.listdir(somedir, errors = strict)
except OSError as e:
 log(verbose error message that includes somedir and e)
 files = os.listdir(somedir)


  If that error parameter is the same as in unicode(value, errors),

then this would be a useful feature:


Except that unicode becomes str in 3.0, that is exactly my intention.


People could then choose among the already existing error handlers
('strict', 'ignore', 'replace', 'xmlcharrefreplace') or register
their own ones via the codecs module.


These could be passed through from listdir or getenv to str.

[Side questions:
1. 'xmlcharrefreplace' is not in the 3.0 LibRef doc or doc string. 
Should it be or is 'xmlcharrefreplace' an addition for a later version.
2. A garbage value for errors (such as 'blah') is silently ignored (so I 
cannot test the above).  Intended or a bug?]


Someone else proposed a new option 'warn', which Guido has accepted to 
be the default instead of the current 'ignore'.  It could not be passed 
through (unless str were changed or something registered).  I believe 
the implementation of that would be to call str with 'strict' but catch 
errors and warn instead.  Whether there should be 1 warning for each 
problematic bytes encountered or 1 for each listdir (or whatever) call, 
possibly with the number of problems, I leave to others to decide.



Such application specific error handlers could then also apply
whatever fancy round-trip safe encoding of non-decodable bytes
to Unicode escapes, private code points, etc. as seen fit by the
application.

Perhaps we should also add an ''encoding'' parameter that can be
set on a per directory basis (if necessary) and defaults to the
global file system encoding.


That could also be passed through, but I will lets others make the 
argument for it.


If an application hits directory that is known to cause problems,
it could then chose to receive the file names in a different,
more suitable encoding. This allows implementing fallback
mechanisms with a list of common encodings for a locale.


Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-08 Thread Glenn Linderman

On approximately 12/8/2008 9:30 AM, came the following characters from 
the keyboard of [EMAIL PROTECTED]:



If warnings were emitted, then files would not be silently ignored,
yet the program could still be used.



Yep, this is sounding useful.



PS: I'd like to see a similar warning issued when an access attempt
is made through os.environ to a variable that cannot be decoded.



And argv ?  Seems like the warning technique could be useful for _any_ 
interface that has been traditionally bytes, because that's the kind of 
characters that were, but now should move to (Unicode) characters.


The warnings could be the same, or very similar.

The question is if one global control should handle all types of bytes 
problems, or if there should be individual controls for each bytes 
problem, or both.  I tend to believe in both; the paranoid can set 
exactly the ones they've coded for, the aggressive can set the global 
one.  In this manner, new cases can be added to the global settings over 
time, if more are discovered -- it should be documented to handle future 
similar issues in a similar manner.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Hagen Fürstenau

 If the Unicode APIs only have correct unicode, sure.  If not you'll
 get errors translating to UTF-8 (and the byte APIs are supposed to
 pass bad names through unaltered.)  Kinda ironic, no?

As far as I can see all Python Unicode strings can be encoded to UTF-8,
even things like lone surrogates because Python doesn't care about them.
So both the Unicode API and the binary API would be fail-safe on Windows.

- Hagen

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen

On Sun, Dec 7, 2008 at 2:35 AM, Hagen Fürstenau [EMAIL PROTECTED] wrote:
 As far as I can see all Python Unicode strings can be encoded to UTF-8,
 even things like lone surrogates because Python doesn't care about them.
 So both the Unicode API and the binary API would be fail-safe on Windows.

 Python is broken and needs to be fixed.

 http://bugs.python.org/issue3672
 http://bugs.python.org/issue3297

 But the question of whether Python should care about lone surrogates or
 not is at best tangential to the issue at hand.  If you have lone
 surrogates in the Unicode API (and didn't raise an exception on the way
 getting there), then the sensible thing is to encode them into lone
 UTF-8 surrogates.  Even if you wanted to prevent lone surrogates,
 encoding to UTF-8 for the binary API would not be the place to enforce it.

No.  Unicode *requires* them to be treated as errors.  If you want to
pass them through then you're creating a custom encoding... which you
might argue for in this case, but it needs to be clearly separate from
the real UTF-8.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Toshio Kuratomi

[EMAIL PROTECTED] wrote:
 
 On 06:07 am, [EMAIL PROTECTED] wrote:
 Most apps aren't file managers or ftp clients but when they interact
 with files (for instance, a file selection dialog) they need to be able
 to show the user all the relevant files.  So on an app-by-app basis the
 need for this is high.
 
 While I tend to agree emphatically with this, the *real* solution here
 is a path-abstraction library.

Why don't you send me some information offlist.  I'm not sure I agree
that a path-abstraction library can work correctly but if it can it
would be nice to have that at a level higher than the file-dialog
libraries that I was envisioning.

[snip]

 ... but that still
 doesn't help me identify when someone would expect that asking python
 for a list of all files in a directory or a specific set of files in a
 directory should, without warning, return only a subset of them.  In
 what situations is this appropriate behaviour?
 
 If you say listdir(unicode) on a POSIX OS, your program is saying I
 only know how to deal with unicode results from this function, so please
 only give me those..

No.  (explained below)

  If your program is smart enough to deal with
 bytes, then you would have asked for bytes, no?

Yes (explained below)

  Returning only
 filenames which can be properly decoded makes sense.  Otherwise everyone
 needs to learn about this highly confusing issue, even for the simplest
 scripts.

os.listdir(unicode) (currently) means that the *programmer* is asking
that the stdlib return the decodable filenames from this directory.  The
question is whether the programmer understood that this is what they
were asking for and whether it is what they most likely want.  I would
make the following statements WRT to this:

1) The programmer most likely does not want decodable filenames and only
decodable filename.  If they were, we'd see a lot of python2.x code that
turns pathnames into unicode and discards everything that wasn't
decodable.  No one has given a use case for finding only the *decodable*
subset of files.  If I request to see all *.py files in a directory, I
want to see all of the *.py files in the directory, decodable or not.
If you can show how programmers intend 90% of their calls to
os.listdir()/glob.glob('*.txt') to show only the decodable subset of the
results, then the foundation of my arguments is gone.  So please, give
examples to prove this wrong.

  - If this is true, a definition of os.listdir(type 'str') that would
better meet programmer expectation would be: Give me all files in a
directory with the output as str type.  The definition of
os.listdir(type 'bytes') would be Give me all files in a directory
with the output as bytes type.  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

2) For the programmer to understand the difference between
os.listdir(type 'bytes') and os.listdir(type 'str') they have to
understand the highly confusing issue and what it means for their
code.  So the current method is forcing programmers to understand it
even for the simplest scripts if their environment is not uniform with
no clue from the interpreter that there is an issue.

  - Similarly, raising an exception on undecodable values means that the
programmer can ignore the issue in any scripts in sane environments and
will be told that they need to deal with it (via an exception) when
their script runs in a non-sane environment.

3) The usage of unicode vs bytes is easy to miss for someone starting
with py2.x or windows and moving to a multi-platform or unix project.
Even simple testing won't reveal the problem unless the programmer knows
that they have to test what happens when encodings are mixed.  Once
again, this is requiring the programmer to understand the encoding issue
 without help from the interpreter.

 Skipping undecodable values is good enough that it will work 90% of the
 time.

You and Guido have now made this claim to defend not raising an
exception but I still don't have a use case.

Here are use cases that I see:

* Bill is coding an application for use inside his company.  His company
only uses utf-8.  His code naively uses os.listdir(type 'str').

  - The code does not throw an exception whether we use the current
os.listdir() or one that could throw an exception because the system
admins have sanitised the environment.  Bill did not need to understand
the implications of encoding for his code to work in this script whether
simple or complex.

* Mary is coding an application for use inside her company.  It finds
all html files on a system and updates her company's copyright, privacy
policy, and other legal boilerplate.  Her expectation is that after her
program runs every file will have been updated.  Her environment is a
mixture of different filename encodings due to having many legacy
documents for users in different locales.  Mary's code also naively uses
os.listdir(type 'str').  Her test case checks that the code does

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Michael Urman

On Sun, Dec 7, 2008 at 11:35, Adam Olsen [EMAIL PROTECTED] wrote:
 http://bugs.python.org/issue3672
 http://bugs.python.org/issue3297

 No.  Unicode *requires* them to be treated as errors.  If you want to
 pass them through then you're creating a custom encoding... which you
 might argue for in this case, but it needs to be clearly separate from
 the real UTF-8.

I suspect it is a common and convenient but (according to what you
say) misconceived expectation that using UTF-8 to encode any Unicode
string will not raise an exception. This behavior is not something
which should be discarded lightly.

I see little reason that this couldn't be a new codec or error handler
that allowed people to choose between correct pure UTF-8 behavior or
the technically incorrect but very practical behavior it currently
has.

[My apologies, Adam, for sending this only to you the first time]
-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen

On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman [EMAIL PROTECTED] wrote:
 On Sun, Dec 7, 2008 at 11:35, Adam Olsen [EMAIL PROTECTED] wrote:
 http://bugs.python.org/issue3672
 http://bugs.python.org/issue3297

 No.  Unicode *requires* them to be treated as errors.  If you want to
 pass them through then you're creating a custom encoding... which you
 might argue for in this case, but it needs to be clearly separate from
 the real UTF-8.

 I suspect it is a common and convenient but (according to what you
 say) misconceived expectation that using UTF-8 to encode any Unicode
 string will not raise an exception. This behavior is not something
 which should be discarded lightly.

It is *not* a valid Unicode string in the first place.  Therein lies
the problem.


 I see little reason that this couldn't be a new codec or error handler
 that allowed people to choose between correct pure UTF-8 behavior or
 the technically incorrect but very practical behavior it currently
 has.

Note that many of the restrictions were added for security reasons.
You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
../../../../../etc/password), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.

However, in this context we only need to allow lone surrogates.
CESU-8 comes to mind.  (It is a perverse world we live in.)

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Terry Reedy


Toshio Kuratomi wrote:


  - If this is true, a definition of os.listdir(type 'str') that would
better meet programmer expectation would be: Give me all files in a
directory with the output as str type.  The definition of
os.listdir(type 'bytes') would be Give me all files in a directory
with the output as bytes type.  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.


Your examples (snipped) pretty well convince me that there is a use case 
for raising exceptions.  We should move beyond arguing over which one 
way is right.  I think there should be a second argument 
'ignorebad=False' to ignore undecodable files rather than raise the 
exception (or 'strict=True' to stop and raise exception on non-decodable 
names -- then code is 'if strict: raise ...').  I believe other 
functions have a similar parameter.


tjr

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Guido van Rossum

On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote:
 Toshio Kuratomi wrote:

  - If this is true, a definition of os.listdir(type 'str') that would
 better meet programmer expectation would be: Give me all files in a
 directory with the output as str type.  The definition of
 os.listdir(type 'bytes') would be Give me all files in a directory
 with the output as bytes type.  Raising an exception when the filenames
 are undecodable is perfectly reasonable in this situation.

 Your examples (snipped) pretty well convince me that there is a use case for
 raising exceptions.  We should move beyond arguing over which one way is
 right.  I think there should be a second argument 'ignorebad=False' to
 ignore undecodable files rather than raise the exception (or 'strict=True'
 to stop and raise exception on non-decodable names -- then code is 'if
 strict: raise ...').  I believe other functions have a similar parameter.

If you want the exceptions, just use the bytes API and try to decode
the byte strings using the system encoding.

My problem with raising exceptions *by default* when an undecodable
name exists is that it may render an app completely useless in a
situation where the developer is no longer around. This happened all
the time with the 2.x Unicode API, where the developer hadn't
anticipated a particular input potentially containing non-ASCII bytes,
and the user fed the application non-ASCII text. Making os.listdir
raise an exception when a directory contains a single undecodable file
means that the entire directory can't be read, and most likely the
entire app crashes at that point. Most likely the developer never
anticipated this situation (since in most places it is either
impossible or very unlikely) -- after all, if they had anticipated it
they would have used the bytes API in the first place. (It's worse
because the exception being raised would be UnicodeError -- most
people expect os.listdir to raise OSError, not other errors.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Nick Coghlan

Terry Reedy wrote:
 Toshio Kuratomi wrote:
 
   - If this is true, a definition of os.listdir(type 'str') that would
 better meet programmer expectation would be: Give me all files in a
 directory with the output as str type.  The definition of
 os.listdir(type 'bytes') would be Give me all files in a directory
 with the output as bytes type.  Raising an exception when the filenames
 are undecodable is perfectly reasonable in this situation.
 
 Your examples (snipped) pretty well convince me that there is a use case
 for raising exceptions.  We should move beyond arguing over which one
 way is right.  I think there should be a second argument
 'ignorebad=False' to ignore undecodable files rather than raise the
 exception (or 'strict=True' to stop and raise exception on non-decodable
 names -- then code is 'if strict: raise ...').  I believe other
 functions have a similar parameter.

If we were going to do anything like that for os.listdir() and other
filesystem APIs (like glob) that return multiple paths, we'd probably be
best advised to just have a normal Unicode 'errors' parameter which allowed:

'strict' - raise an Exception for malformed binary data
'replace' - insert '?' or some other symbol in place of malformed binary
data
'ignore' - simply leave out the malformed binary data
'skip' - run the underlying codec in strict mode, but skip over any
items which raise UnicodeDecodeError (default/current Py3k behaviour)

Obviously, 'skip' doesn't make any sense for APIs like getcwd() that
return a single value - a case could be made for those defaulting to
either replace or strict.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Greg Ewing


Nick Coghlan wrote:


For binary wrappers around the Windows Unicode APIs, I was thinking
specifically of using UTF-8, since that should be able to encode
anything the Unicode APIs can handle.


Why shouldn't the binary interface just expose the raw
utf16 as bytes?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Terry Reedy


Guido van Rossum wrote:

On Sun, Dec 7, 2008 at 1:20 PM, Terry Reedy [EMAIL PROTECTED] wrote:

Toshio Kuratomi wrote:


 - If this is true, a definition of os.listdir(type 'str') that would
better meet programmer expectation would be: Give me all files in a
directory with the output as str type.  The definition of
os.listdir(type 'bytes') would be Give me all files in a directory
with the output as bytes type.  Raising an exception when the filenames
are undecodable is perfectly reasonable in this situation.

Your examples (snipped) pretty well convince me that there is a use case for
raising exceptions.  We should move beyond arguing over which one way is
right.  I think there should be a second argument 'ignorebad=False' to
ignore undecodable files rather than raise the exception (or 'strict=True'
to stop and raise exception on non-decodable names -- then code is 'if
strict: raise ...').  I believe other functions have a similar parameter.


I was thinking of the normal Unicode 'errors' parameter, as described 
by Nick.



If you want the exceptions, just use the bytes API and try to decode
the byte strings using the system encoding.


If it was a matter of adding a new method, I might agree.  But:

1. We already have a method that does exactly what you describe.  It is 
only a matter of adding flexibility to the response to problems, for 
which there is already precedent.


2. Suggesting that people who want strings and not bytes should have to 
deal with bytes, just to get an error notification, seems to negate that 
point of moving to 3.0


3. A builtin would probably do so better than most programmers would, 
with little touches such as the one suggested below.


4. An error parameter would ALERT programmers to the possibility of a 
PROBLEM, both in the present and future.  As you say below, people need 
to better anticipate the future.



My problem with raising exceptions *by default* when an undecodable
name exists is that it may render an app completely useless in a
situation where the developer is no longer around. This happened all
the time with the 2.x Unicode API, where the developer hadn't
anticipated a particular input potentially containing non-ASCII bytes,
and the user fed the application non-ASCII text. Making os.listdir
raise an exception when a directory contains a single undecodable file
means that the entire directory can't be read, and most likely the
entire app crashes at that point. Most likely the developer never
anticipated this situation (since in most places it is either
impossible or very unlikely) -- after all, if they had anticipated it
they would have used the bytes API in the first place. (It's worse
because the exception being raised would be UnicodeError -- most
people expect os.listdir to raise OSError, not other errors.)


This to be is an argument for keeping the default the current behavior, 
but not for rejecting flexibility.  The computing world seems to be 
messier than we would like and worse that I realized until this week. 
As you say below, people need to better anticipate the future, and an 
errors parameter would help do that.



Is Windows really immune?  What about when it reads the directory of 
possibly old removable media with whatever byte name encodings?  Is this 
a possible source of 'unanticipated' problems?


As to your last sentence, os.listdir() with an errors parameter could 
convert a decoding UnicodeError to OSError: undecodable file name 
ascii+hex repr, thereby supplying the expected exception as well as 
an extractable representation of problematical the raw bytes


Here is a possible use case: I want filenames as 3.0 strings and I 
anticipate no problems at present but, as you say above, something might 
happen years in the future.  I am using 3.0 *because* of the strings == 
unicode feature.  I would like to write


try:
  files = os.listdir(somedir, errors = strict)
except OSError as e:
  log(verbose error message that includes somedir and e)
  files = os.listdir(somedir)

and go one without the problem file but not without logging the problem 
so a future maintainer can consider what to do about it, but only when 
there is an actual need to think about it.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Glenn Linderman

On approximately 12/7/2008 10:56 AM, came the following characters from 
the keyboard of Adam Olsen:



You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
../../../../../etc/password), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.



You might indeed.

But if you are interested in checking for security issues, shouldn't you 
 _first_ decode into some canonical form, specifying what sorts of 
Unicode strictness (such as overlong sequences) to check for during the 
decode process, and once the string is in canonical form, _then_ do 
checks for various attacks, such as the ../ sequence you mention?


And with that order of operation, even if you don't reject overlong 
sequences, you have canonized them, and can recognize the resulting 
characters as good or bad.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Stephen J. Turnbull

Glenn Linderman writes:

  But if you are interested in checking for security issues, shouldn't you 
_first_ decode into some canonical form,

Yes.  That's all that is being asked for: that Python do strict
decoding to a canonical form by default.  That's a lot to ask, as it
turns out, but that is what we (the minority of strict Unicode
adherents, that is) want.

If you want the convenience and risk, I believe you should ask for it
by name (I suggest a name like own_me for the relaxed decoding
flagwink).  Failing that, it would be nice to have a global flag to
change the default.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Glenn Linderman

On approximately 12/7/2008 8:13 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:

Glenn Linderman writes:

  But if you are interested in checking for security issues, shouldn't you 
_first_ decode into some canonical form,


Yes.  That's all that is being asked for: that Python do strict
decoding to a canonical form by default.  That's a lot to ask, as it
turns out, but that is what we (the minority of strict Unicode
adherents, that is) want.



I have no problem with having strict validation available.  But doesn't 
validation take significantly longer than decoding?  So I think it 
should be logically decoupled... do validation when/where it is needed 
for security reasons, and allow internal [de]coding to be faster.


I'm mostly indifferent about which should be the default... maybe there 
shouldn't be a default!  Use the vUTF-8 decoder for strict validation, 
and the fUTF-8 decoder for the faster, non-validating version.  Or 
something like that.  With appropriate documentation.  Of course, 
UTF-8 already exists... as fUTF-8, so for compatibility, I guess it 
shouldn't change... but it could be deprecated.



You didn't address the issue that if the decoding to a canonical form is 
done first, many of the insecurities just go away, so why throw errors?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen

On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman [EMAIL PROTECTED] wrote:
 On approximately 12/7/2008 8:13 PM, came the following characters from the
 keyboard of Stephen J. Turnbull:

 Glenn Linderman writes:

   But if you are interested in checking for security issues, shouldn't
 you _first_ decode into some canonical form,

 Yes.  That's all that is being asked for: that Python do strict
 decoding to a canonical form by default.  That's a lot to ask, as it
 turns out, but that is what we (the minority of strict Unicode
 adherents, that is) want.


 I have no problem with having strict validation available.  But doesn't
 validation take significantly longer than decoding?  So I think it should be
 logically decoupled... do validation when/where it is needed for security
 reasons, and allow internal [de]coding to be faster.

I'd like to see benchmarks of such a claim.


 I'm mostly indifferent about which should be the default... maybe there
 shouldn't be a default!  Use the vUTF-8 decoder for strict validation, and
 the fUTF-8 decoder for the faster, non-validating version.  Or something
 like that.  With appropriate documentation.  Of course, UTF-8 already
 exists... as fUTF-8, so for compatibility, I guess it shouldn't change...
 but it could be deprecated.


 You didn't address the issue that if the decoding to a canonical form is
 done first, many of the insecurities just go away, so why throw errors?

Unicode is intended to allow interaction between various bits of
software.  It may be that a library checked it in UTF-8, then passed
it to python.  It would be nice if the library validated too, but a
major advantage of UTF-8 is older libraries (or protocols!) intended
for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
security checks continue to work, so long as nobody down stream
introduces problems with a non-validating decoder.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-07 Thread Adam Olsen

On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman [EMAIL PROTECTED] wrote:
 On approximately 12/7/2008 9:11 PM, came the following characters from the
 keyboard of Adam Olsen:
 On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman [EMAIL PROTECTED]
 wrote:

 Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I
 wonder if I could find that code?  Can you supply a validated decoder?  Then
 we could run some benchmarks, eh?

There is no point for me, as the behaviour of a real UTF-8 codec is
clear.  It is you who needs to justify a second non-standard UTF-8-ish
codec.  See below.


 You didn't address the issue that if the decoding to a canonical form is
 done first, many of the insecurities just go away, so why throw errors?

 Unicode is intended to allow interaction between various bits of
 software.  It may be that a library checked it in UTF-8, then passed
 it to python.  It would be nice if the library validated too, but a
 major advantage of UTF-8 is older libraries (or protocols!) intended
 for ASCII need only be 8-bit clean to be repurposed for UTF-8.  Their
 security checks continue to work, so long as nobody down stream
 introduces problems with a non-validating decoder.


 So I don't understand how this is responsive to the decoding removes many
 insecurities issue?

 Yes, you might use libraries.  Either they have insecurities, or not. Either
 they validate, or not.  Either they decode, or not.  They may be immune to
 certain attacks, because of their structure and code, or not.

 So when you examine a library for potential use, you have documentation or
 code to help you set your expectations about what it does, and whether or
 not it may have vulnerabilities, and whether or not those vulnerabilities
 are likely or unlikely, whether you can reduce the likelihood or prevent the
 vulnerabilities by wrapping the API, etc.  And so you choose to use the
 library, or not.

 This whole discussion about libraries seems somewhat irrelevant to the
 question at hand, although it is certainly true that understanding how a
 library handles Unicode is an important issue for the potential user of a
 library.

 So how does a non-validating decoder introduce problems?  I can see that it
 might not solve all problems, but how does it introduce problems? Wouldn't
 the problems be introduced by something else, and the use of a
 non-validating decoder may not catch the problem... but not be the cause of
 the problem?

 And then, if you would like to address the original issue, that would be
 fine too.

Your non-validating encoder is translating an invalid sequence into a
valid one, thus you are introducing the problem.  A completely naive
environment (8-bit clean ASCII) would leave it as an invalid sequence
throughout.

This is not a theoretical problem.  See
http://tools.ietf.org/html/rfc3629#section-10 .  We MUST reject
invalid sequences, or else we are not using UTF-8.  There is no wiggle
room, no debate.

(The absoluteness is why the standard behaviour doesn't need a
benchmark.  You are essentially arguing that, when logging in as root
over the internet, it's a lot faster if you use telnet rather than
ssh.  One is simply not an option.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Oleg Broytmann

On Fri, Dec 05, 2008 at 08:37:45PM -0500, James Y Knight wrote:
 On Dec 5, 2008, at 7:48 PM, Nick Coghlan wrote:
 You can't display a non-decodable filename to the user, hence the user
 will have no idea what they're working on. Non-filesystem related apps
 have no business trying to deal with insane filenames.

 Sigh, same arguments, all over again.

 Again, *both* KDE and Gnome apps display non-decodable filenames to the 
 user, and let the user work with the files. They display as good a  
 rendition as they can, using a replacement character as appropriate. In 
 some earlier versions, KDE did not work at all on poorly-encoded files, 
 and, users submitted bug reports. People do care, it does happen in real 
 life, and it is a bug in your software if you cannot deal with the users' 
 files. They just want the software to work. If it shows something weird 
 in the window titlebar, that's a bit irritating but at least it doesn't 
 get in the way of working.

   I agree 100%. Russian Unix users use at least 5 different encodings
(koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and
iso-8859-5 are less frequent). I have an FTP server with some filenames in
koi8 encoding - these filenames are for unix clients, - and some filenames
in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am
a commandline/console unixhead) for my needs (read email, write files in
utf-8 with characters beyond koi8-r, which is my primary encoding) - and
I still can work with filenames in koi8/cp1251 encodings. My filemanager
(Midnight Commander, for the matter) shows these files and directories as
?.???, but I can chdir to such directories, and I can open such
files. It would be a big bad blow for me if filemanagers (or other
programs) start to filter these filenames.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Oleg Broytmann

On Sat, Dec 06, 2008 at 12:03:55PM +1100, Steven D'Aprano wrote:
 I'd rather have the Python API report errors then silence them, at least 
 by default.

   +1 for encoding errors by default.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Oleg Broytmann

On Sat, Dec 06, 2008 at 02:22:29AM +0100, Martin v. L?wis wrote:
 And environment variables, command line arguments, and file names
 are not bytes, but characters.

   There is no such thing as plain text! If you say these are
characters you must also name the encoding for them. LANG/LC_ALL/LC_CTYPE
provide a sensible default, but if a program has problems decoding bytes to
characters there must be a way for the user to override the default. But
the user must be notified about the error, so programs must not silently
filters out non-decodable characters.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/[EMAIL PROTECTED]
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Toshio Kuratomi

Nick Coghlan wrote:
 Toshio Kuratomi wrote:

 Nonsense.  A program can do tons of things with a non-decodable
 filename.  Where it's limited is non-decodable filedata.
 
 You can't display a non-decodable filename to the user, hence the user
 will have no idea what they're working on. Non-filesystem related apps
 have no business trying to deal with insane filenames.
 
This is where we disagree.  There are many ways to display the
non-decodable filename to the user because the user is not a machine.
The computer must know the unique sequence of bytes in order to access a
file. The user, OTOH, usually only needs to know that the file exists.
In most GUI-based end-user oriented desktop apps, it's enough to do
str(filename, errors='replace').  For instance, the GNOME file manager
displays:
  ? (Invalid encoding)
and Konqueror, the KDE file manager just displays:
  ?

The file can still be displayed this way, accessed via the raw bytes
that the program keeps internally, and operated upon by applications.

For applications in which the user needs more information to
differentiate the files the program has the option to display the raw
byte sequences as if they were the filename.  The *NIX shell and command
line tools have this ability.

$ LANG=en_US.utf8 ls -b
á
í
$ LANG=C ls -b
.
..
\303\241
\303\255
$ mv $'\303\241' $'\303\263'
$ LANG=C ls -b
\303\255
\303\263
$ LANG=en_US.utf8 ls -b
í
ó

 Linux is moving towards a standard of UTF-8 for filenames, and once we
 get to the point where the idea of encoding filenames and environment
 variables any other way is seen as crazy, then the Python 3 approach
 will work seamlessly.
 
nod  With the caveat that I haven't seen movement by Linux and other
Unix variants to enforce UTF-8.  What I have seen are statements by
kernel programmers that having the filesystem use bytes and not know
about encoding is the correct thing to do.

This means that utf-8 will be a convention rather than a necessity for a
very long time and consequently programs will need to worry about the
problems of mixed encoding systems for an equally long time.  (Remember,
encoding is something that can be changed per user and per file.  So on
a multiuser OS, mixed encodings can be out of the control of the system
administrator for perfectly valid reasons.)

 In the meantime, raw bytes APIs will provide an alternative for those
 that disagree with that philosophy.
 
Oh I agree with the UTF-8 everywhere philosophy.  I just know that
there's tons of real-world systems out there that don't conform to my
expectations for sanity and my code has to account for those :-)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Guido van Rossum

On Fri, Dec 5, 2008 at 10:18 PM, Bugbee, Larry [EMAIL PROTECTED] wrote:
 There has been some discussion here that users should use the str or
 byte function variant based on what is relevant to their system, for
 example when getting a list of file names or opening a file.  That
 thought process really doesn't do much for those of us that write code
 that needs to run on any platform type, without alteration or the
 addition of complex if-statements and/or exceptions.

 Whatever the resolution here, and those of you addressing this thorny
 issue have my admiration, the solution should be such that it gives
 consistent behavior regardless of platform type and doesn't require the
 programmer to know of all the minute details of each possible target
 platform.

My prediction is that it won't ever be possible to completely hide
this difference between platforms. The platforms differ fundamentally
in how they see filenames. An elaborate abstraction can certainly be
created that smooths out most of the differences, but at some point
useful functionality will have to be lost in order to maintain strict
platform independence. This is the fate of most platform-independence
abstractions by the way. For example, there are many elaborate
packages for platform-independent I/O, but they generally don't
provide access to all functionality that is available on a platform.
Where they do, the application is once again placed in the position of
having to use complex if-statements and/or exceptions.

Consider just this example. Many programs have a need to ask their
user for a filename to be created by the program. On systems where
filenames are raw byte strings, do you want to provide the user with a
way to specify an arbitrary byte string? (That is, in addition to the
normal case of entering a text string that will be transformed into a
filename using some encoding.) Your choices are either not to support
the case of bytes that aren't a valid encoding in the current
encoding, or add a UI element to select an encoding, or add a UI
element to enter raw bytes. An abstraction package is likely to only
support the first option (this is what Java does BTW), but this is not
acceptable to all applications.

 That may not be possible for a while, so interim solutions should be
 such that it minimizes later pain.  If that means hiding implementation
 details behind a new function, so be it.  Then, at least, the body of
 one's app is not burdened with this problem later when conditions
 change.

I believe the problem's severity is actually overstated. The interim
solution with the least amount of pain that will work for almost all
apps is to treat filenames as text strings encoded in some default
encoding, and ignore filenames that aren't valid encodings of any text
string. Yes, it is possible that you'll find that you can't completely
remove or traverse certain directory trees. But that's a fact of life
anyway (filesystems have many hidden failure modes), so you're better
off dealing with *that* possibility than worrying over the issue of
undecodable filenames.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Guido van Rossum

On Fri, Dec 5, 2008 at 8:57 PM, Tres Seaver [EMAIL PROTECTED] wrote:
 Amen!  the idea that paths, environment varioables, and stuff pulled off
 of sockets can be treated as text rather than strings is just wishful
 thinking.

Unfortunately most of the programmers of the world *do* think that
way(*), and it's not easy to wean them off the idea. It's a powerful
meme that you can use your own name as a file name, even if you happen
to be Czech or Vietnamese -- and it's promoted by the two most popular
consumer operating systems.

(*) With the exception of sockets. Sockets are typically dealt with
through protocols and APIs that provide guidance about how to convert
between bytes and strings, and whether that is even a meaningful
operation.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Toshio Kuratomi

Bugbee, Larry wrote:
 There has been some discussion here that users should use the str or
 byte function variant based on what is relevant to their system, for
 example when getting a list of file names or opening a file.  That
 thought process really doesn't do much for those of us that write code
 that needs to run on any platform type, without alteration or the
 addition of complex if-statements and/or exceptions.
 
 Whatever the resolution here, and those of you addressing this thorny
 issue have my admiration, the solution should be such that it gives
 consistent behavior regardless of platform type and doesn't require the
 programmer to know of all the minute details of each possible target
 platform.  
 
I've been thinking about this and I can only see one option.  I don't
think that it really makes less work for the programmer, though -- it
just shifts the problem and makes it more apparent what your code is doing.

To avoid exceptions and if-then's in program code when accessing
filenames, environment variables, etc, you would need to access each of
these resources via the byte API.  Then, to avoid having to keep track
of what's a string and what's a byte in your other code, you probably
want to convert those bytes to strings.  This is where the burden gets
shifted.  You'll have your own routine(s) to do the conversion and have
to have exception handling code to deal with undecodable filenames.

Note 1: your particular app might be able to get away without doing the
conversion from bytes to string -- it depends on what you're planning on
doing with the filename/environment data.

Note 2: If there isn't a parallel API on all platforms, for instance,
Guido's proposal to not have os.environb on Windows, then you'll still
have to have a platform specific check. (Likely you should try to access
os.evironb in this instance and if it doesn't exist, use os.environ
instead... and remember that you need to either change os.environ's data
into str type or change os.environb's data into byte type.)

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread glyph


On 02:34 pm, [EMAIL PROTECTED] wrote:

On Fri, Dec 05, 2008 at 08:37:45PM -0500, James Y Knight wrote:

On Dec 5, 2008, at 7:48 PM, Nick Coghlan wrote:
You can't display a non-decodable filename to the user, hence the 
user
will have no idea what they're working on. Non-filesystem related 
apps

have no business trying to deal with insane filenames.



Sigh, same arguments, all over again.



People do care, it does happen in real
life, and it is a bug in your software if you cannot deal with the 
users'
files. They just want the software to work. If it shows something 
weird
in the window titlebar, that's a bit irritating but at least it 
doesn't

get in the way of working.



  I agree 100%. Russian Unix users use at least 5 different encodings
(koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and
iso-8859-5 are less frequent). I have an FTP server with some filenames 
in
koi8 encoding - these filenames are for unix clients, - and some 
filenames

in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am
a commandline/console unixhead) for my needs (read email, write files 
in
utf-8 with characters beyond koi8-r, which is my primary encoding) - 
and
I still can work with filenames in koi8/cp1251 encodings. My 
filemanager
(Midnight Commander, for the matter) shows these files and directories 
as

?.???, but I can chdir to such directories, and I can open such
files. It would be a big bad blow for me if filemanagers (or other
programs) start to filter these filenames.


I find it interesting to note that the only users in this discussion who 
actually have these problems in real life all have this attitude.  It is 
expected that in an imperfect world we will have imperfect encodings, 
but it is super important that software which can open files can deal 
with not understanding the character translation of the filename.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Guido van Rossum

On Sat, Dec 6, 2008 at 10:53 AM,  [EMAIL PROTECTED] wrote:
 On 02:34 pm, [EMAIL PROTECTED] wrote:
  I agree 100%. Russian Unix users use at least 5 different encodings
 (koi8-r, cp1251 and utf-8 are the most frequent in use, cp866 and
 iso-8859-5 are less frequent). I have an FTP server with some filenames in
 koi8 encoding - these filenames are for unix clients, - and some filenames
 in cp1251 for w32 clients. Sometimes I run utf-8 xterm (I am
 a commandline/console unixhead) for my needs (read email, write files in
 utf-8 with characters beyond koi8-r, which is my primary encoding) - and
 I still can work with filenames in koi8/cp1251 encodings. My filemanager
 (Midnight Commander, for the matter) shows these files and directories as
 ?.???, but I can chdir to such directories, and I can open such
 files. It would be a big bad blow for me if filemanagers (or other
 programs) start to filter these filenames.

 I find it interesting to note that the only users in this discussion who
 actually have these problems in real life all have this attitude.  It is
 expected that in an imperfect world we will have imperfect encodings, but it
 is super important that software which can open files can deal with not
 understanding the character translation of the filename.

For file managers and similar tools I am absolutely 100% in agreement
-- that's why the binary APIs are there.

Most apps aren't file managers or ftp clients though. The sky is not falling.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Nick Coghlan

Toshio Kuratomi wrote:
 Note 2: If there isn't a parallel API on all platforms, for instance,
 Guido's proposal to not have os.environb on Windows, then you'll still
 have to have a platform specific check. (Likely you should try to access
 os.evironb in this instance and if it doesn't exist, use os.environ
 instead... and remember that you need to either change os.environ's data
 into str type or change os.environb's data into byte type.)

Note that this is why I personally think the binary API variants
*should* exist on Windows, just with the sense of the system encoding
flipped around.

That is, on *nix:
- underlying OS API uses bytes
- binary API just passes values straight through
- Unicode API uses the system encoding to encode Unicode names and
values to be passed to the OS API and to decode bytes names and values
received from the OS API

While on Windows:
- underlying OS API uses Unicode
- Unicode API just passes values straight through
- binary API uses the system encoding to decode bytes names and values
to be passed to the OS API and to encode Unicode names and values
received from the OS API

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Antoine Pitrou

Nick Coghlan ncoghlan at gmail.com writes:
 
 If the binary APIs are missing from a major platform (i.e. Windows) then
 the choice to use them brings with it a major cross-platform portability
 problem that should really be handled by the standard library.

+1

I might also add that providing binary APIs does not prevent us to implement
some special representation of broken filenames when using the unicode APIs (for
example using private Unicode characters - I'm not sure what the right
terminology is - as sometimes suggested).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread André Malo

* Nick Coghlan wrote:

 Toshio Kuratomi wrote:
  Note 2: If there isn't a parallel API on all platforms, for instance,
  Guido's proposal to not have os.environb on Windows, then you'll still
  have to have a platform specific check. (Likely you should try to
  access os.evironb in this instance and if it doesn't exist, use
  os.environ instead... and remember that you need to either change
  os.environ's data into str type or change os.environb's data into byte
  type.)

 Note that this is why I personally think the binary API variants
 *should* exist on Windows, just with the sense of the system encoding
 flipped around.

 That is, on *nix:
 - underlying OS API uses bytes
 - binary API just passes values straight through
 - Unicode API uses the system encoding to encode Unicode names and
 values to be passed to the OS API and to decode bytes names and values
 received from the OS API

 While on Windows:
 - underlying OS API uses Unicode
 - Unicode API just passes values straight through
 - binary API uses the system encoding to decode bytes names and values
 to be passed to the OS API and to encode Unicode names and values
 received from the OS API

Now that is somewhat strange. That way you'll have two unreliable APIs and 
need to switch depending on the platform again.

nd
-- 
+[++-][-]++.[++-]+++.--.
+.
.[-]---.+++[-]+.+.
+.+++[-].---.+++[-]
+..+[]+.+[-]+.+++[-]+.--..++
[---].+[+-].++..--.+++[-]+.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Nick Coghlan

André Malo wrote:
 While on Windows:
 - underlying OS API uses Unicode
 - Unicode API just passes values straight through
 - binary API uses the system encoding to decode bytes names and values
 to be passed to the OS API and to encode Unicode names and values
 received from the OS API
 
 Now that is somewhat strange. That way you'll have two unreliable APIs and 
 need to switch depending on the platform again.

Sory, system encoding was probably a poor choice of words there, since
that generally means mbcs when talking about windows (which would indeed
be a very poor choice of encoding).

For binary wrappers around the Windows Unicode APIs, I was thinking
specifically of using UTF-8, since that should be able to encode
anything the Unicode APIs can handle.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread Aahz

On Sun, Dec 07, 2008, Nick Coghlan wrote:

 If the binary APIs are missing from a major platform (i.e. Windows) then
 the choice to use them brings with it a major cross-platform portability
 problem that should really be handled by the standard library.

+1
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

It is easier to optimize correct code than to correct optimized code.
--Bill Harlan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

2008-12-06 Thread glyph



On 06:07 am, [EMAIL PROTECTED] wrote:

Guido van Rossum wrote:

On Sat, Dec 6, 2008 at 10:53 AM,  [EMAIL PROTECTED] wrote:


I find it interesting to note that the only users in this discussion 
who

actually have these problems in real life all have this attitude.



For file managers and similar tools I am absolutely 100% in agreement
-- that's why the binary APIs are there.


Most apps aren't file managers or ftp clients though. The sky is not 
falling.



Most apps aren't file managers or ftp clients but when they interact
with files (for instance, a file selection dialog) they need to be able
to show the user all the relevant files.  So on an app-by-app basis the
need for this is high.


While I tend to agree emphatically with this, the *real* solution here 
is a path-abstraction library.  In separate discussions, the difficulty 
of getting such a thing into the standard library has been discussed, 
due to the wide variety of opinions as to what it should look like (and 
the shocking level of difficulty involved in making such a thing really 
work correctly).


I'd be very happy to talk to you off-list about my ideas for such a 
thing, but I'd rather not resurrect yet another tedious discussion here 
just now :).

On a code basis, I'd hope that most file
selection dialogs are pulled out into libraries... but that still
doesn't help me identify when someone would expect that asking python
for a list of all files in a directory or a specific set of files in a
directory should, without warning, return only a subset of them.  In
what situations is this appropriate behaviour?


If you say listdir(unicode) on a POSIX OS, your program is saying I 
only know how to deal with unicode results from this function, so please 
only give me those..  If your program is smart enough to deal with 
bytes, then you would have asked for bytes, no?  Returning only 
filenames which can be properly decoded makes sense.  Otherwise everyone 
needs to learn about this highly confusing issue, even for the simplest 
scripts.


Skipping undecodable values is good enough that it will work 90% of the 
time.  When you need to get to 100%, it won't be impossible - the bytes 
APIs will be there.  In the longer term, hopefully some path abstraction 
will eventually be there too.  We should not wait for a perfectly 
correct path abstraction to arrive before providing the primitives to do 
it yourself, though.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

1 2 >

1 - 100 of 163 matches

Mail list logo