subject:"Re\: \[Python\-Dev\] PEP 383\: Non\-decodable Bytes in System Character Interfaces"

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-17 Thread Piet van Oostrum

> Ned Deily  (ND) wrote:

>ND> In article , Piet van Oostrum  
>ND> wrote:
>>> > Ronald Oussoren  (RO) wrote:
>>> >RO> For what it's worth, the OSX API's seem to behave as follows:
>>> >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the
>>> >RO> system automaticly encodes the name.
>>> 
>>> >RO> That is,  open(chr(255), 'w') will silently create a file named '%FF'
>>> >RO> instead of the name you'd expect on a unix system.
>>> 
>>> Not for me (I am using Python 2.6.2).
>>> 
>>> >>> f = open(chr(255), 'w')
>>> Traceback (most recent call last):
>>> File "", line 1, in 
>>> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
>>> >>> 

>ND> What version of OSX are you using?  On Tiger 10.4.11 I see the failure 
>ND> you see but on Leopard 10.5.6 the behavior Ronald reports.

Yes, I am using Tiger (10.4.11). Interesting that it has changed on Leopard.
-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-01 Thread Stephen J. Turnbull

James Y Knight writes:

 > in python. It seems like the most common reason why people want to use  
 > SJIS is to make old pre-unicode apps work right in WINE -- in which  
 > case it doesn't actually affect unix python at all.

Mounting external drives, especially USB memory sticks which tend to
be FAT-initialized by the manufacturers, is another common case.

But I don't understand why PEP 383 needs to care at all.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ronald Oussoren



On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote:


Ronald Oussoren  (RO) wrote:



RO> For what it's worth, the OSX API's seem to behave as follows:
RO> * If you create a file with an non-UTF8 name on a HFS+  
filesystem the

RO> system automaticly encodes the name.


RO> That is,  open(chr(255), 'w') will silently create a file named  
'%FF'

RO> instead of the name you'd expect on a unix system.


Not for me (I am using Python 2.6.2).


f = open(chr(255), 'w')

Traceback (most recent call last):
 File "", line 1, in 
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'




That's odd. Which version of OSX do you use?

ron...@rivendell-2[0]$ sw_vers
ProductName:Mac OS X
ProductVersion: 10.5.6
BuildVersion:   9G55

[~/testdir]
ron...@rivendell-2[0]$ /usr/bin/python
Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')
[]
>>> open(chr(255), 'w').write('x')
>>> os.listdir('.')
['%FF']
>>>

And likewise with python 2.6.1+ (after cleaning the directory):

[~/testdir]
ron...@rivendell-2[0]$ python2.6
Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('.')
[]
>>> open(chr(255), 'w').write('x')
>>> os.listdir('.')
['%FF']
>>>




I once got a tar file from a Linux system which contained a file  
with a

non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
--
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Steven D'Aprano

On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote:

> You can get the same error on Linux:
>
> $ python
> Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
> [GCC 4.3.3] on linux2
> Type "help", "copyright", "credits" or "license" for more
> information.
>
> >>> f=open(chr(255),'w')
>
> Traceback (most recent call last):
>   File "", line 1, in 
> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

Works for me under Fedora using ext3 as the file system.

$ python2.6
Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13)
[GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f=open(chr(255),'w')
>>> f.close()
>>> import os
>>> os.remove(chr(255))
>>>  

Given that chr(255) is a valid filename on my file system, I would 
consider it a bug if Python couldn't deal with a file with that name.



-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi

Thomas Breuel wrote:
> Not for me (I am using Python 2.6.2).
> 
> >>> f = open(chr(255), 'w')
> Traceback (most recent call last):
>  File "", line 1, in 
> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
> >>>
> 
> 
> You can get the same error on Linux:
> 
> $ python
> Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
> [GCC 4.3.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
 f=open(chr(255),'w')
> Traceback (most recent call last):
>   File "", line 1, in 
> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'

> 
> (Some file system drivers do not enforce valid utf8 yet, but I suspect
> they will in the future.)
> 
Do you suspect that from discussing the issue with kernel developers or
reading a thread on lkml?  If not, then you're suspicion seems to be
pretty groundless

The fact that VFAT enforces an encoding does not lend itself to your
argument for two reasons:

1) VFAT is not a Unix filesystem.  It's a filesystem that's compatible
with Windows/DOS.  If Windows and DOS have filesystem encodings, then it
makes sense for that driver to enforce that as well.  Filesystems
intended to be used natively on Linux/Unix do not necessarily make this
design decision.

2) The encoding is specified when mounting the filesystem.  This means
that you can still mix encodings in a number of ways.  If you mount with
an encoding that has full byte coverage, for instance, each user can put
filenames from different encodings on there.  If you mount with utf8 on
a system which uses euc-jp as the default encoding, you can have full
paths that contain a mix of utf-8 and euc-jp.  Etc.

-Toshio

signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Terry Reedy


James Y Knight wrote:

On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use 
"ja_JP.SJIS" as their locale (which, note, first requires editing some 
files in /var/lib/locales manually to enable its use..) may still have 
python not work with invalid-in-shift-jis filenames. Since that locale 
is widely recognized as a bad idea to use, and is not supported by any 
distros, it certainly doesn't bother me that it isn't 100% supported in 
python. It seems like the most common reason why people want to use SJIS 
is to make old pre-unicode apps work right in WINE -- in which case it 
doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the 
filesystem-encoding will *always* be utf-8b and ignore the locale...but 
I expect some other people might complain about that. Of course, 
application authors can decide to do that themselves by calling 
sys.setfilesystemencoding('utf-8b') at the start of their program.


It seems to me that the 3.1+ doc set (or wiki) could be usefully 
extended with a How-to on working with filenames.  I am not sure that 
everything useful fits anywhere in particular the ref manuals.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if  
they

contain none unicode data?


Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
 return ""


Ha ha! Indeed this works, but I would have to try to turn enough of the
string into a reasonable hint at the name of the file so the user can
some chance of know what is being reported.




This will always return a printable version of the input string...


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Not a bug its the lack of a feature. We use ProFTPd that has just  
implemented
what is required. I forget the exact details - they are at work - when  
the ftp client
asks for the FEAT of the ftp server the server can say use UTF-8.  
Supporting

that in the server was apparently none-trivia.






Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the "strict" error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


O.k. I understand.

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel

>
> Not for me (I am using Python 2.6.2).
>
> >>> f = open(chr(255), 'w')
> Traceback (most recent call last):
>  File "", line 1, in 
> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
> >>>


You can get the same error on Linux:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> f=open(chr(255),'w')
Traceback (most recent call last):
  File "", line 1, in 
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
>>>

(Some file system drivers do not enforce valid utf8 yet, but I suspect they
will in the future.)

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread James Y Knight


On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?


Yes. The practical upshot of this is that users who brokenly use  
"ja_JP.SJIS" as their locale (which, note, first requires editing some  
files in /var/lib/locales manually to enable its use..) may still have  
python not work with invalid-in-shift-jis filenames. Since that locale  
is widely recognized as a bad idea to use, and is not supported by any  
distros, it certainly doesn't bother me that it isn't 100% supported  
in python. It seems like the most common reason why people want to use  
SJIS is to make old pre-unicode apps work right in WINE -- in which  
case it doesn't actually affect unix python at all.


I'd personally be fine with python just declaring that the filesystem- 
encoding will *always* be utf-8b and ignore the locale...but I expect  
some other people might complain about that. Of course, application  
authors can decide to do that themselves by calling  
sys.setfilesystemencoding('utf-8b') at the start of their program.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


Barry Scott wrote:


On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in two 
forms

unicode and bytes if its not utf-8 data. Why not simply return string if
its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


What do you do currently?

The PEP just offers a way of reading all filenames as Unicode, if that's
what you want. So what if the strings can't be encoded to normal UTF-8!
The filenames aren't valid UTF-8 anyway! :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

>>> How do get a printable unicode version of these path strings if they
>>> contain none unicode data?
>>
>> Define "printable". One way would be to use a regular expression,
>> replacing all codes in a certain range with a question mark.
> 
> What I mean by printable is that the string must be valid unicode
> that I can print to a UTF-8 console or place as text in a UTF-8
> web page.
> 
> I think your PEP gives me a string that will not encode to
> valid UTF-8 that the outside of python world likes. Did I get this
> point wrong?

You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return ""

This will always return a printable version of the input string...

> In our application we are running fedora with the assumption that the
> filenames are UTF-8. When Windows systems FTP files to our system
> the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.

> Having an algorithm that says if its a string no problem, if its
> a byte deal with the exceptions seems simple.
> 
> How do I do this detection with the PEP proposal?
> Do I end up using the byte interface and doing the utf-8 decode
> myself?

No, you should encode using the "strict" error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ned Deily

In article , Piet van Oostrum  
wrote:
> > Ronald Oussoren  (RO) wrote:
> >RO> For what it's worth, the OSX API's seem to behave as follows:
> >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the
> >RO> system automaticly encodes the name.
> 
> >RO> That is,  open(chr(255), 'w') will silently create a file named '%FF'
> >RO> instead of the name you'd expect on a unix system.
> 
> Not for me (I am using Python 2.6.2).
> 
> >>> f = open(chr(255), 'w')
> Traceback (most recent call last):
>   File "", line 1, in 
> IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
> >>> 

What version of OSX are you using?  On Tiger 10.4.11 I see the failure 
you see but on Leopard 10.5.6 the behavior Ronald reports.

-- 
 Ned Deily,
 n...@acm.org

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott



On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in  
two forms
unicode and bytes if its not utf-8 data. Why not simply return  
string if

its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x  
uses

for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Piet van Oostrum

> Ronald Oussoren  (RO) wrote:

>RO> For what it's worth, the OSX API's seem to behave as follows:
>RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the
>RO> system automaticly encodes the name.

>RO> That is,  open(chr(255), 'w') will silently create a file named '%FF'
>RO> instead of the name you'd expect on a unix system.

Not for me (I am using Python 2.6.2).

>>> f = open(chr(255), 'w')
Traceback (most recent call last):
  File "", line 1, in 
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
>>> 

I once got a tar file from a Linux system which contained a file with a
non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be
unpacked on a HFS+ filesystem.
-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

MRAB wrote:
> One further question: should the encoder accept a string like
> u'\xDCC2\xDC80'? That would encode to b'\xC2\x80'

Indeed so.

> which, when decoded, would give u'\x80'.

Assuming the encoding is UTF-8, yes.

> Does the PEP only guarantee that strings decoded
> from the filesystem are reversible, but not check what might be de novo
> strings?

Exactly so.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB


One further question: should the encoder accept a string like
u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded,
would give u'\x80'. Does the PEP only guarantee that strings decoded
from the filesystem are reversible, but not check what might be de novo
strings?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Stephen J. Turnbull

Cameron Simpson writes:
 > On 29Apr2009 22:14, Stephen J. Turnbull  wrote:
 > | Baptiste Carvello writes:
 > |  > By contrast, if the new utf-8b codec would *supercede* the old one,
 > |  > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
 > |  > surrogates are unused). Thus ambiguity could be avoided.
 > | 
 > | Unfortunately, that's false.  [Because Python strings are
 > | intended to be used as containers for widechars which are to be
 > | interpreted as Unicode when that makes sense, but there's no
 > | restriction against nonsense code points, including in UCS-4
 > | Python.]

[...]

 > Wouldn't you then be bypassing the implicit encoding anyway, at least to
 > some extent, and thus not trip over the PEP?

Sure.  I'm not really arguing the PEP here; the point is that under
the current definition of Python strings, ambiguity is unavoidable.
The best we can ask for is fewer exceptions, and an attempt to reduce
ambiguity to a bare minimum in the code paths that we open up when we
make definition that allows a formerly erroneous computation to
succeed.

Martin is well aware of this, the PEP is clear enough about that (to
me, but I'm a mail and multilingual editor internals kinda guy).
I'd rather have more validation of strings, but *shrug* Martin's doing
the work.

OTOH, the Unicode fans need to understand that past policy of Python
is not to validate; Python is intended to provide all the tools needed
to write validating apps, but it isn't one itself.  Martin's PEP is
quite narrow in that sense.  All it is about is an invertible encoding
of broken encodings.  It does have the downside that it guarantees
that Python itself can produce non-conforming strings, but that's not
the end of the world, and an app can keep track of them or even refuse
them by setting the error handler, if it wants to.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Aahz

[top-posting for once to preserve full quoting]

Glenn,

Could you please reduce your suggestions into sample text for the PEP?
We seem to be now at the stage where nobody is objecting to the PEP, so
the focus should be on making the PEP clearer.

If you still want to create an alternative PEP implementation, please
provide step-by-step walkthroughs, preferably in a new thread -- if you
did previously provide that, it's gotten lost in the flood of messages.

On Thu, Apr 30, 2009, Glenn Linderman wrote:
> On approximately 4/29/2009 8:46 PM, came the following characters from  
> the keyboard of Terry Reedy:
>> Glenn Linderman wrote:
>>> On approximately 4/29/2009 1:28 PM, came the following characters 
>>> from 
>>
 So where is the ambiguity here?
>>>
>>> None.  But not everyone can read all the Python source code to try to 
>>> understand it; they expect the documentation to help them avoid that. 
>>> Because the documentation is lacking in this area, it makes your  
>>> concisely stated PEP rather hard to understand.
>>
>> If you think a section of the doc is grossly inadequate, and there is 
>> no existing issue on the tracker, feel free to add one.
>>
>>> Thanks for clarifying the Windows behavior, here.  A little more  
>>> clarification in the PEP could have avoided lots of discussion.  It  
>>> would seem that a PEP, proposed to modify a poorly documented (and  
>>> therefore likely poorly understood) area, should be educational about 
>>> the status quo, as well as presenting the suggested change.
>>
>> Where the PEP proposes to change, it should start with the status quo.  
>> But Martin's somewhat reasonable position is that since he is not  
>> proposing to change behavior on Windows, it is not his responsibility 
>> to document what he is not proposing to change more adequately.  This  
>> means, of course, that any observed change on Windows would then be a  
>> bug, or at least a break of the promise.  On the other hand, I can see  
>> that this is enough related to what he is proposing to change that  
>> better doc would help.
>
>
> Yes; the very fact that the PEP discusses Windows, speaks about  
> cross-platform code, and doesn't explicitly state that no Windows  
> functionality will change, is confusing.
>
> An example of how to initialize things within a sample cross-platform  
> application might help, especially if that initialization only happens  
> if the platform is POSIX, or is commented to the effect that it has no  
> effect on Windows, but makes POSIX happy.  Or maybe it is all buried  
> within the initialization of Python itself, and is not exposed to the  
> application at all.  I still haven't figured that out, but was not (and  
> am still not) as concerned about that as ensuring that the overall  
> algorithms are functional and useful and user-friendly.  Showing it  
> might have been helpful in making it clear that no Windows functionality  
> would change, however.
>
> A statement that additional features are being added to allow  
> cross-platform programs deal with non-decodable bytes obtained from  
> POSIX APIs using the same code that already works on Windows, would have  
> made things much clearer.  The present Abstract does, in fact, talk only  
> about POSIX, but later statements about Windows muddy the water.
>
> Rationale paragraph 3, explicitly talks about cross-platform programs  
> needing to work one way on Windows and another way on POSIX to deal with  
> all the cases.  It calls that a proposal, which I guess it is for  
> command line and environment, but it is already implemented in both  
> bytes and str forms for file names... so that further muddies the water.
>
> It is, of course, easier to point out deficiencies in a document than to  
> write a better document; however, it is incumbent upon the PEP author to  
> write a PEP that is good enough to get approved, and that means making  
> it understandable enough that people are in favor... or to respond to  
> the plethora of comments until people are in favor.  I'm not sure which  
> one is more time-consuming.
>
> I've reached the point, based on PEP and comment responses, where I now  
> believe that the PEP is a solution to the problem it is trying to solve,  
> and doesn't create ambiguities in the naming.  I don't believe it is the  
> best solution.
>
> The basic problem is the overuse of fake characters... normalizing them  
> for display results is large data loss -- many characters would be  
> translated to the same replacement characters.
>
> Solutions exist that would allow the use of fewer different fake  
> characters in the strings, while still having a fake character as the  
> escape character, to preserve the invariant that all the strings  
> manipulated by python-escape from the PEP were, and become, strings  
> containing fake characters (from a strict Unicode perspective), which is  
> a nice invariant*.  There even exist solutions that would use only one  
> fake char

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

> I think it has to be excluded from mapping in order to not introduce
> security issues.

I think you are right. I have now excluded ASCII bytes from being
mapped, effectively not supporting any encodings that are not ASCII
compatible. Does that sound ok?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis

> Assuming people agree that this is an accurate summary, it should be
> incorporated into the PEP.

Done!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 7:50 PM, came the following characters from 
the keyboard of Aahz:

On Thu, Apr 30, 2009, Cameron Simpson wrote:

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== "have bare surrogates in them") but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with "outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.


Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.


I'll agree that once other misconceptions were explained away, that the 
remaining issues are those Cameron summarized.  Thanks for the summary!


Point two could be modified because I've changed my opinion; I like the 
invariant Cameron first (I think) explicitly stated about the PEP as it 
stands, and that I just reworded in another message, that the strings 
that are altered by the PEP in either direction are in the subset of 
strings that contain fake (from a strict Unicode viewpoint) characters. 
 I still think an encoding that uses mostly real characters that have 
assigned glyphs would be better than the encoding in the PEP; but would 
now suggest that an escape character be a fake character.


I'll note here that while the PEP encoding causes illegal bytes to be 
translated to one fake character, the 3-byte sequence that looks like 
the range of fake characters would also be translated to a sequence of 3 
fake characters.  This is 512 combinations that must be translated, and 
understood by the user (or at least by the programmer).  The "escape 
sequence" approach requires changing only 257 combinations, and each 
altered combination would result in exactly 2 characters.  Hence, this 
seems simpler to understand, and to manually encode and decode for 
debugging purposes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman

On approximately 4/29/2009 8:46 PM, came the following characters from 
the keyboard of Terry Reedy:

Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.



Yes; the very fact that the PEP discusses Windows, speaks about 
cross-platform code, and doesn't explicitly state that no Windows 
functionality will change, is confusing.


An example of how to initialize things within a sample cross-platform 
application might help, especially if that initialization only happens 
if the platform is POSIX, or is commented to the effect that it has no 
effect on Windows, but makes POSIX happy.  Or maybe it is all buried 
within the initialization of Python itself, and is not exposed to the 
application at all.  I still haven't figured that out, but was not (and 
am still not) as concerned about that as ensuring that the overall 
algorithms are functional and useful and user-friendly.  Showing it 
might have been helpful in making it clear that no Windows functionality 
would change, however.


A statement that additional features are being added to allow 
cross-platform programs deal with non-decodable bytes obtained from 
POSIX APIs using the same code that already works on Windows, would have 
made things much clearer.  The present Abstract does, in fact, talk only 
about POSIX, but later statements about Windows muddy the water.


Rationale paragraph 3, explicitly talks about cross-platform programs 
needing to work one way on Windows and another way on POSIX to deal with 
all the cases.  It calls that a proposal, which I guess it is for 
command line and environment, but it is already implemented in both 
bytes and str forms for file names... so that further muddies the water.


It is, of course, easier to point out deficiencies in a document than to 
write a better document; however, it is incumbent upon the PEP author to 
write a PEP that is good enough to get approved, and that means making 
it understandable enough that people are in favor... or to respond to 
the plethora of comments until people are in favor.  I'm not sure which 
one is more time-consuming.


I've reached the point, based on PEP and comment responses, where I now 
believe that the PEP is a solution to the problem it is trying to solve, 
and doesn't create ambiguities in the naming.  I don't believe it is the 
best solution.


The basic problem is the overuse of fake characters... normalizing them 
for display results is large data loss -- many characters would be 
translated to the same replacement characters.


Solutions exist that would allow the use of fewer different fake 
characters in the strings, while still having a fake character as the 
escape character, to preserve the invariant that all the strings 
manipulated by python-escape from the PEP were, and become, strings 
containing fake characters (from a strict Unicode perspective), which is 
a nice invariant*.  There even exist solutions that would use only one 
fake character (repeatedly if necessary), and all other characters 
generated would be displayable characters.  This would ease the burden 
on the program in displaying the strings, and also on the user that 
might view the resulting mojibake in trying to differentiate one such 
string from another.  Those are outlined in various emails in this 
thread, although some include my misconception that strings obtained via 
 Unicode-enabled OS APIs would also need to be encoded and altered.  If 
there is any interest in using a more readable encoding, I'd be glad to 
rework them to remove those misconceptions.


* It would be nice to point out that invariant in the PEP, also.


--
Glenn -- http://nevcal.com/
===
A

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel

On Wed, Apr 29, 2009 at 23:03, Terry Reedy  wrote:

> Thomas Breuel wrote:
>
>>
>>Sure. However, that requires you to provide meaningful, reproducible
>>counter-examples, rather than a stenographic formulation that might
>>hint some problem you apparently see (which I believe is just not
>>there).
>>
>>
>> Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
>> surrogates.
>>
>
> By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows
> that.

If we use conformance to Unicode 5.1 as the basis for our discussion, then
PEP 383 is off the table anyway.  I'm all for strict Unicode compliance.
But apparently, the Python community doesn't care.

CESU-8 is described in Unicode Technical Report #26, so it at least has some
official recognition.  More importantly, it's also widely used.  So, my
question: what are the implications of PEP 383 for CESU-8 encodings on
Python?

My meta-point is: there are probably many more such issues hidden away and
it is a really bad idea to rush something like PEP 383 out.  Unicode is hard
anyway, and tinkering with its semantics requires a lot of thought.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> Thanks for clarifying the Windows behavior, here.  A little more
> clarification in the PEP could have avoided lots of discussion.  It
> would seem that a PEP, proposed to modify a poorly documented (and
> therefore likely poorly understood) area, should be educational about
> the status quo, as well as presenting the suggested change.  Or is it
> the Python philosophy that the PEPs should be as incomprehensible as
> possible, to generate large discussions?

Certainly not. See PEP 277 for a description of a specification of
how file names are handled on Windows.

Large discussions could be reduced if readers would try to
constructively comment on the PEP, rather than making counter-proposals,
or making statements about the PEP without making their implied
assumptions explicit.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> How do get a printable unicode version of these path strings if they
> contain none unicode data?

Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

> I'm guessing that an app has to understand that filenames come in two forms
> unicode and bytes if its not utf-8 data. Why not simply return string if
> its valid utf-8 otherwise return bytes?

That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.


tjr

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Aahz

On Thu, Apr 30, 2009, Cameron Simpson wrote:
>
> The lengthy discussion mostly revolves around:
> 
>   - Glenn points out that strings that came _not_ from listdir, and that are
> _not_ well-formed unicode (== "have bare surrogates in them") but that
> were intended for use as filenames will conflict with the PEP's scheme -
> programs must know that these strings came from outside and must be
> translated into the PEP's funny-encoding before use in the os.*
> functions. Previous to the PEP they would get used directly and
> encode differently after the PEP, thus producing different POSIX
> filenames. Breakage.
> 
>   - Glenn would like the encoding to use Unicode scalar values only,
> using a rare-in-filenames character.
> That would avoid the issue with "outside' strings that contain
> surrogates. To my mind it just moves the punning from rare illegal
> strings to merely uncommon but legal characters.
> 
>   - Some parties think it would be better to not return strings from
> os.listdir but a subclass of string (or at least a duck-type of
> string) that knows where it came from and is also handily
> recognisable as not-really-a-string for purposes of deciding
> whether is it PEP-funny-encoded by direct inspection.

Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 23:41, Barry Scott  wrote:
> On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>
> Forgive me if this has been covered. I've been reading this thread for a 
> long time and still have a 100 odd replies to go...
>
> How do get a printable unicode version of these path strings if they  
> contain none unicode data?

Personally, I'd use repr(). One might ask, what would you expect to see
if you were printing such a string?

> I'm guessing that an app has to understand that filenames come in two  
> forms unicode and bytes if its not utf-8 data. Why not simply return string 
> if 
> its valid utf-8 otherwise return bytes? Then in the app you check for the 
> type for 
> the object, string or byte and deal with reporting errors appropriately.

Because it complicates the app enormously, for every app.

It would be _nice_ to just call os.listdir() et al with strings, get
strings, and not worry.

With strings becoming unicode in Python3, on POSIX you have an issue of
deciding how to get its filenames-are-bytes into a string and the
reverse. One could naively map the byte values to the same Unicode code
points, but that results in strings that do not contain the same
characters as the user/app expects for byte values above 127.

Since POSIX does not really have a filesystem level character encoding,
just a user environment setting that says how the current user encodes
characters into bytes (UTF-8 is increasingly common and useful, but
it is not universal), it is more useful to decode filenames on the
assumption that they represent characters in the user's (current) encoding
convention; that way when things are displayed they are meaningful,
and they interoperate well with strings made by the user/app. If all
the filenames were actually encoded that way when made, that works. But
different users may adopt different conventions, and indeed a user may
have used ACII or and ISO8859-* coding in the past and be transitioning
to something else now, so they will have a bunch of files in different
encodings.

The PEP uses the user's current encoding with a handler for byte
sequences that don't decode to valid Unicode scaler values in
a fashion that is reversible. That is, you get "strings" out of
listdir() and those strings will go back in (eg to open()) perfectly
robustly.

Previous approaches would either silently hide non-decodable names in
listdir() results or throw exceptions when the decode failed or mangle
things no reversably. I believe Python3 went with the first option
there.

The PEP at least lets programs naively access all files that exist,
and create a filename from any well-formed unicode string provided that
the filesystem encoding permits the name to be encoded.

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== "have bare surrogates in them") but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with "outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

The peever can look at the best day in his life and sneer at it.
- Jim Hill, JennyGfest '95
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott



On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:



If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.



Forgive me if this has been covered. I've been reading this thread for  
a long time

and still have a 100 odd replies to go...

How do get a printable unicode version of these path strings if they  
contain

none unicode data?

I'm guessing that an app has to understand that filenames come in two  
forms
unicode and bytes if its not utf-8 data. Why not simply return string  
if its valid
utf-8 otherwise return bytes? Then in the app you check for the type  
for the object,

string or byte and deal with reporting errors appropriately.

Barry

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 22:14, Stephen J. Turnbull  wrote:
| Baptiste Carvello writes:
|  > By contrast, if the new utf-8b codec would *supercede* the old one,
|  > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
|  > surrogates are unused). Thus ambiguity could be avoided.
| 
| Unfortunately, that's false.  It could have come from a literal string
| (similar to the text above ;-), a C extension, or a string slice (on
| 16-bit builds), and there may be other ways to do it.  The only way to
| avoid ambiguity is to change the definition of a Python string to be
| *valid* Unicode (possibly with Python extensions such as PEP 383 for
| internal use only).  But Guido has rejected that in the past;
| validation is the application's problem, not Python's.
| 
| Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
| Python strings being used to build up code point sequences to be
| directly output, which means that a UCS-4 string might none-the-less
| contain surrogates being added to a string intended to be sent as
| UTF-16 output simply by truncating the 32-bit code units to 16 bits.

Wouldn't you then be bypassing the implicit encoding anyway, at least to
some extent, and thus not trip over the PEP?
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Clemson is the Harvard of cardboard packaging.
- overhead by WIRED at the Intelligent Printing conference Oct2006
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 17:03, Terry Reedy  wrote:
> Thomas Breuel wrote:
>> Sure. However, that requires you to provide meaningful, reproducible
>> counter-examples, rather than a stenographic formulation that might
>> hint some problem you apparently see (which I believe is just not
>> there).
>>
>> Well, here's another one: PEP 383 would disallow UTF-8 encodings of 
>> half surrogates. 
>
> By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

5.0 also disallows it. No surprise I guess.
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Out on the road, feeling the breeze, passing the cars.  - Bob Seger
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 1:28 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the
str interface, no decoding happens, matches in memory the file on disk
with the byte that translates to the same surrogate, accessed via the
bytes interface.  Ambiguity.

What does that mean? What specific interface are you referring to to
obtain file names? 

os.listdir("")

os.listdir(b"")

So I guess I'd better suggest that a specific, equivalent directory name
be passed in either bytes or str form.


[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp")
os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".


So what you are saying here is that Python doesn't use the "A" forms of
the Windows APIs for filenames, but only the "W" forms, and uses lossy
decoding (from MS) to the current code page (which can never be UTF-8 on
Windows).


Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

["abc\uDC10"]
[b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.  Or is it 
the Python philosophy that the PEPs should be as incomprehensible as 
possible, to generate large discussions?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half 
surrogates. 


By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

But such encodings are currently supported by Python, and 
they are used as part of CESU-8 coding.  That's, in fact, a common way 
of converting UTF-16 to UTF-8.  How are you going to deal with existing 
code that relies on being able to code half surrogates as UTF-8?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 4:36 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 29Apr2009 02:56, Glenn Linderman  wrote:
 

os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, 
that I  get quite different results when I pass os.listdir an empty 
str vs an  empty bytes.


Rather than keep you guessing, I get the root directory contents 
from  the empty str, and the current directory contents from an empty 
bytes.  That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory 
name  be passed in either bytes or str form.



I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).
  


Me too.


Sounds like an issue for the tracker.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> So while out of scope of the PEP, I don't think it's at all
> artificial.

Sure - but I see this as the same case as "the file got renamed".
If you have a LRU list in your app, and a file gets renamed, then
the LRU list breaks (unless you also store the inode number in the
LRU list, and lookup the file by inode number - or object UUID
on NTFS, possibly using distributed link tracking).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

>>> C. File on disk with the invalid surrogate code, accessed via the
>>> str interface, no decoding happens, matches in memory the file on disk
>>> with the byte that translates to the same surrogate, accessed via the
>>> bytes interface.  Ambiguity.
>> What does that mean? What specific interface are you referring to to
>> obtain file names? 
> 
> os.listdir("")
> 
> os.listdir(b"")
> 
> So I guess I'd better suggest that a specific, equivalent directory name
> be passed in either bytes or str form.

[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp")
os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".

> So what you are saying here is that Python doesn't use the "A" forms of
> the Windows APIs for filenames, but only the "W" forms, and uses lossy
> decoding (from MS) to the current code page (which can never be UTF-8 on
> Windows).

Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

["abc\uDC10"]
[b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?

> You are further saying that Python doesn't give the programmer control
> over the codec that is used to convert from W results to bytes, so that
> on Windows, it is impossible to obtain a bytes result containing UTF-8
> from os.listdir, even though sys.setfilesystemencoding exists, and
> sys.getfilesystemencoding is affected by it, and the latter is
> documented as returning "mbcs", and as returning the codec that should
> be used by the application to convert str to bytes for filenames.
> (Python 3.0.1).

Not exactly. You *can* do setfilesystemencoding on Windows, but it has
no effect, as the Python file system encoding is never used on Windows.
For a string, it passes it to the W API as is; for bytes, it passes it
to the A API as-is. Python never invokes any codec here.

> While I can hear a "that is outside the scope of the PEP" coming, this
> documentation is confusing, to say the least.

Only because you are apparently unaware of the status quo. If you would
study the current Python source code, it would be all very clear.

> Things are a little clearer in the documentation for
> sys.setfilesystemencoding, which does say the encoding isn't used by
> Windows -- so why is it permitted to change it, if it has no effect?).

As in many cases: because nobody contributed code to make it behave
otherwise. It's not that the file system encoding is "mbcs" - the
file system encoding is simply unused on Windows (but that wasn't
always the case, in particular not when Windows 9x still had to
be supported).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> Sure. However, that requires you to provide meaningful, reproducible
> counter-examples, rather than a stenographic formulation that might
> hint some problem you apparently see (which I believe is just not
> there).
> 
> 
> Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
> surrogates.  But such encodings are currently supported by Python, and
> they are used as part of CESU-8 coding.  That's, in fact, a common way
> of converting UTF-16 to UTF-8.  How are you going to deal with existing
> code that relies on being able to code half surrogates as UTF-8?

Can you please elaborate? What code specifically are you talking about?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

"Martin v. Löwis" writes:

 > I find the case pretty artificial, though: if the locale encoding
 > changes, all file names will look incorrect to the user, so he'll
 > quickly switch back, or rename all the files.

It's not necessarily the case that the locale encoding changes, but
rather the name of the file.  I have a couple of directories where I
have Japanese in both EUC-JP and UTF-8, for example.  (The
applications where I never bothered to do a conversion from EUC to
UTF-8 are things like stripping MIME attachments from messages and
saving them to files when I changed my default.)

So I have a little Emacs Lisp function that tries EUC or UTF8
depending on date and falls back to the other on a decode error.

Another possible situation would be a user program in the user's
locale communicating with a daemon running in some other locale (quite
likely POSIX).

So while out of scope of the PEP, I don't think it's at all
artificial.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

Baptiste Carvello writes:

 > By contrast, if the new utf-8b codec would *supercede* the old one,
 > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
 > surrogates are unused). Thus ambiguity could be avoided.

Unfortunately, that's false.  It could have come from a literal string
(similar to the text above ;-), a C extension, or a string slice (on
16-bit builds), and there may be other ways to do it.  The only way to
avoid ambiguity is to change the definition of a Python string to be
*valid* Unicode (possibly with Python extensions such as PEP 383 for
internal use only).  But Guido has rejected that in the past;
validation is the application's problem, not Python's.

Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
Python strings being used to build up code point sequences to be
directly output, which means that a UCS-4 string might none-the-less
contain surrogates being added to a string intended to be sent as
UTF-16 output simply by truncating the 32-bit code units to 16 bits.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 4:36 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 29Apr2009 02:56, Glenn Linderman  wrote:
  

os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, that I  
get quite different results when I pass os.listdir an empty str vs an  
empty bytes.


Rather than keep you guessing, I get the root directory contents from  
the empty str, and the current directory contents from an empty bytes.  
That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory name  
be passed in either bytes or str form.



I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).
  


Me too.


In ancient times, "" was a valid UNIX name for the working directory.
POSIX disallows that, and requires people to use ".".

Maybe you're seeing an artifact; did python move from UNIX to Windows or the
other way around in its porting history? I'd guess the former.

Do you get differing results from listdir(".") and listdir(b".") ?
  


No.  Both are the same as b""


How's python2 behave for ""? (Since there's no b"" in python2.)


Python2 os.listdir("") produces the same thing as Python3 os.listdir(b"")
Python2 os.listdir(u"") produces the same thing as Python3 os.listdir("")


Another phenomenon of note:

I created a directory named ábc.  (Windows XP, Python 3.0.1, Python 
2.6.1, SetConsoleOutputCP(65001))

Python3 os.listdir(b".") prints it as b"\xe1bc"
Python2 os.listdir(".") prints it as b"\xe1bc"
Python2 os.listdir(u".") prints it as u"\xe1bc"
Python3 os.listdir(".") prints it as "bc"


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 4:07 AM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
>  C. File on disk with the invalid surrogate code, accessed via the 
str >  interface, no decoding happens, matches in memory the file on 
disk with >  the byte that translates to the same surrogate, accessed 
via the bytes >  interface. Ambiguity.


 Unless I'm missing something, one of these is type str, and the 
other is

 type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.


Hopefully Martin will clarify the PEP as I suggested in another branch 
of this thread.  He has eventually convinced me that this ambiguity is 
not possible, via email discussion, but the PEP is certainly less than 
sufficiently explanatory to make that obvious.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 02:56, Glenn Linderman  wrote:
> os.listdir(b"")
>
> I find that on my Windows system, with all ASCII path file names, that I  
> get quite different results when I pass os.listdir an empty str vs an  
> empty bytes.
>
> Rather than keep you guessing, I get the root directory contents from  
> the empty str, and the current directory contents from an empty bytes.  
> That is rather unexpected.
>
> So I guess I'd better suggest that a specific, equivalent directory name  
> be passed in either bytes or str form.

I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).

In ancient times, "" was a valid UNIX name for the working directory.
POSIX disallows that, and requires people to use ".".

Maybe you're seeing an artifact; did python move from UNIX to Windows or the
other way around in its porting history? I'd guess the former.

Do you get differing results from listdir(".") and listdir(b".") ?
How's python2 behave for ""? (Since there's no b"" in python2.)

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

'Supposing a tree fell down, Pooh, when we were underneath it?'
'Supposing it didn't,' said Pooh after careful thought.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread R. David Murray

On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from the 
keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
>  C. File on disk with the invalid surrogate code, accessed via the str 
>  interface, no decoding happens, matches in memory the file on disk with 
>  the byte that translates to the same surrogate, accessed via the bytes 
>  interface. Ambiguity.

 Unless I'm missing something, one of these is type str, and the other is
 type bytes, so no ambiguity.

You are missing that the bytes value would get decoded to a str; thus both 
are str; so ambiguity is possible.

Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:29 AM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk
with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.


No Martin, the point of reviewing the PEP is to _not_ trust you, even
though you are generally very knowledgeable and very trustworthy.  It is
much easier to find problems before something is released, or even
coded, than it is afterwards.


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


You assumed, and maybe I wasn't clear in my statement.

By "accessed via the str interface" I mean that (on Windows) the wide
string interface would be used to obtain a file name.


What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.


Now, suppose that
the file name returned contains "abc" followed by the half-surrogate
U+DC10 -- four 16-bit codes.


Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.



os.listdir("")




Also, why is U+DC10 four 16-bit codes?



It isn't.

First 16-bit code is U+0061
Second 16-bit code is U+0062
Third 16-bit code is U+0063
Fourth 16-bit code is U+DC10




Then, ask for the same filename via the bytes interface, using UTF-8
encoding.


How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU
USING Please use specific python code.



os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, that I 
get quite different results when I pass os.listdir an empty str vs an 
empty bytes.


Rather than keep you guessing, I get the root directory contents from 
the empty str, and the current directory contents from an empty bytes. 
That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory name 
be passed in either bytes or str form.




The PEP says that the above name would get translated to
"abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
used to represent the half-surrogate that is actually in the file name,
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
be seen as two different names in memory.


You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the "mbcs" encoding would be used. The
"mbcs" encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.



So what you are saying here is that Python doesn't use the "A" forms of 
the Windows APIs for filenames, but only the "W" forms, and uses lossy 
decoding (from MS) to the current code page (which can never be UTF-8 on 
Windows).


You are further saying that Python doesn't give the programmer control 
over the codec that is used to convert from W results to bytes, so that 
on Windows, it is impossible to obtain a bytes result containing UTF-8 
from os.listdir, even though sys.setfilesystemencoding exists, and 
sys.getfilesystemencoding is affected by it, and the latter is 
documented as returning "mbcs", and as returning the codec that should 
be used by the application to convert str to bytes for filenames. 
(Python 3.0.1).


While I can hear a "that is outside the scope of the PEP" coming, this 
documentation is confusing, to say the least.




Now posit another file which, when accessed via the str interface, has
the name "abc" followed by U+DCED U+DCB0 U+DC90.

Looks ambiguous to me.  Now if you have a scheme for handling this case,
fine, but I don't understand it from what is written in the PEP.


You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.



Absolutely correct.  I was making what seemed to be reasonable 
assumptions about Python internals on Windows, and several of them are 
false, including misleading documentation for listdir (which doesn't 
specify that bytes and str parameters affect whether or

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:38 AM, came the following characters from 
the keyboard of Baptiste Carvello:

Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 
codepoints could get separated. The good thing with the PEP's strategy 
is that 1 character stays 1 character.


Baptiste



Except for half-surrogates that are in the file names already, which get 
converted to 3 characters.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel

> Sure. However, that requires you to provide meaningful, reproducible
> counter-examples, rather than a stenographic formulation that might
> hint some problem you apparently see (which I believe is just not
> there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
surrogates.  But such encodings are currently supported by Python, and they
are used as part of CESU-8 coding.  That's, in fact, a common way of
converting UTF-16 to UTF-8.  How are you going to deal with existing code
that relies on being able to code half surrogates as UTF-8?

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :



If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?




The problem with your "escape character" scheme is that the meaning is lost with 
slicing of the strings, which is a very common operation.




I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By "legitimate" above I meant things like
half-surrogates which, like quarks, should not occur alone?
  


"Illegal" just means violating the accepted rules.  In this case, the 
accepted rules are those enforced by the file system (at the bytes or 
str API levels), and by Python (for the str manipulations).  None of 
those rules outlaw lone surrogates.  [...]




Python could as well *specify* that lone surrogates are illegal, as their 
meaning is undefined by Unicode. If this rule is respected language-wise, there 
is no ambiguity. It might be unrealistic on windows, though.


This rule could even be specified only for strings that represent filesystem 
paths. Sure, they are the same type as other strings, but the programmer usually 
knows if a given string is intended to be a path or not.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Lino Mastrodomenico a écrit :


Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).



This is questionable. This would have the consequence that \udcxx in a python 
string would sometimes mean a surrogate, and sometimes mean raw bytes, depending 
on the history of the string.


By contrast, if the new utf-8b codec would *supercede* the old one, \udcxx would 
always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). 
Thus ambiguity could be avoided.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Hrvoje Niksic

Zooko O'Whielacronx wrote:
If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.

Why do you say that?  It seems to work as I expected here:

 >>> '\xff'.decode('iso-8859-15')
u'\xff'
 >>> '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'

Here is what I mean by "switch to iso8859-15" only in the presence of 
undecodable UTF-8:

def file_name_to_unicode(fn, encoding):
try:
return fn.decode(encoding)
except UnicodeDecodeError:
return fn.decode('iso-8859-15')

Now, assume a UTF-8 locale and try to use it on the provided example 
file names.

>>> file_name_to_unicode(b'\xff', 'utf-8')
'ÿ'
>>> file_name_to_unicode(b'\xc3\xbf', 'utf-8')
'ÿ'

That is the ambiguity I was referring to -- to different byte sequences 
result in the same unicode string.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 codepoints 
could get separated. The good thing with the PEP's strategy is that 1 character 
stays 1 character.


Baptiste

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk
> with
> the byte that translates to the same surrogate, accessed via the bytes
> interface.  Ambiguity.
 Is that an alternative to A and B?
>>> I guess it is an adjunct to case B, the current PEP.
>>>
>>> It is what happens when using the PEP on a system that provides both
>>> bytes and str interfaces, and both get used.
>>
>> Your formulation is a bit too stenographic to me, but please trust me
>> that there is *no* ambiguity in the case you construct.
> 
> 
> No Martin, the point of reviewing the PEP is to _not_ trust you, even
> though you are generally very knowledgeable and very trustworthy.  It is
> much easier to find problems before something is released, or even
> coded, than it is afterwards.

Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).

> You assumed, and maybe I wasn't clear in my statement.
> 
> By "accessed via the str interface" I mean that (on Windows) the wide
> string interface would be used to obtain a file name.

What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.

> Now, suppose that
> the file name returned contains "abc" followed by the half-surrogate
> U+DC10 -- four 16-bit codes.

Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.

Also, why is U+DC10 four 16-bit codes?

> Then, ask for the same filename via the bytes interface, using UTF-8
> encoding.

How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU
USING Please use specific python code.

> The PEP says that the above name would get translated to
> "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
> used to represent the half-surrogate that is actually in the file name,
> specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
> be seen as two different names in memory.

You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the "mbcs" encoding would be used. The
"mbcs" encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.

> Now posit another file which, when accessed via the str interface, has
> the name "abc" followed by U+DCED U+DCB0 U+DC90.
> 
> Looks ambiguous to me.  Now if you have a scheme for handling this case,
> fine, but I don't understand it from what is written in the PEP.

You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:52 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.


Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.



No Martin, the point of reviewing the PEP is to _not_ trust you, even 
though you are generally very knowledgeable and very trustworthy.  It is 
much easier to find problems before something is released, or even 
coded, than it is afterwards.




By "accessed via the str interface", I assume you do something like

  fn = "some string"
  open(fn)

You are wrong in assuming "no decoding happens", and that "matches
in memory the file on disk" (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.



You assumed, and maybe I wasn't clear in my statement.

By "accessed via the str interface" I mean that (on Windows) the wide 
string interface would be used to obtain a file name.  Now, suppose that 
the file name returned contains "abc" followed by the half-surrogate 
U+DC10 -- four 16-bit codes.


Then, ask for the same filename via the bytes interface, using UTF-8 
encoding.  The PEP says that the above name would get translated to 
"abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes 
used to represent the half-surrogate that is actually in the file name, 
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can 
be seen as two different names in memory.


Now posit another file which, when accessed via the str interface, has 
the name "abc" followed by U+DCED U+DCB0 U+DC90.


Looks ambiguous to me.  Now if you have a scheme for handling this case, 
fine, but I don't understand it from what is written in the PEP.




If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as "/etc/passwd" and b"/etc/passwd",
there is nothing ambiguous about that.


Yes, this would be a ridiculous interpretation of "ambiguous".


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> I'm more concerned with your (yours? someone else's?) mention of shift
> characters. I'm unfamiliar with these encodings: to translate such a
> thing into a Latin example, is it the case that there are schemes with
> valid encodings that look like:
> 
>   [SHIFT] a b c
> 
> which would produce "ABC" in unicode, which is ambiguous with:
> 
>   A B C
> 
> which would also produce "ABC"?

No: the "shift" in "shift-jis" is not really about the shift key.
See http://en.wikipedia.org/wiki/Shift-JIS

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

>> The Python UTF-8 codec will happily encode half-surrogates; people argue
>> that it is a bug that it does so, however, it would help in this
>> specific case.
> 
> Can we use this encoding scheme for writing into files as well?  We've
> turned the filename with undecodable bytes into a string with half
> surrogates.  Putting that string into a file has to turn them into bytes
> at some level.  Can we use the python-escape error handler to achieve
> that somehow?

Sure: if you are aware that what you write to the stream is actually
a file name, you should encode it with the file system encoding, and
the python-escape handler. However, it's questionable that the same
approach is right for the rest of the data that goes into the file.

If you use a different encoding on the stream, yet still use the
python-escape handler, you may end up with completely non-sensical
bytes. In practice, it probably won't be that bad - python-escape
has likely escaped all non-ASCII bytes, so that on re-encoding with
a different encoding, only the ASCII characters get encoded, which
likely will work fine.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

>>> C. File on disk with the invalid surrogate code, accessed via the str
>>> interface, no decoding happens, matches in memory the file on disk with
>>> the byte that translates to the same surrogate, accessed via the bytes
>>> interface.  Ambiguity.
>>
>> Is that an alternative to A and B?
> 
> I guess it is an adjunct to case B, the current PEP.
> 
> It is what happens when using the PEP on a system that provides both
> bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.

By "accessed via the str interface", I assume you do something like

  fn = "some string"
  open(fn)

You are wrong in assuming "no decoding happens", and that "matches
in memory the file on disk" (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.

If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as "/etc/passwd" and b"/etc/passwd",
there is nothing ambiguous about that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 4:06 PM, came the following characters from 
the keyboard of Cameron Simpson:

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).
  


Close.  You at least resolved what you thought my issue was.  And, you 
did make me more comfortable with the idea that I, in programs I write, 
would not be adversely affected by the PEP if implemented.  While I can 
see that the PEP no doubt solves the os.listdir / open problem on POSIX 
systems for Python 3 + PEP programs that don't use 3rd party libraries, 
it does require programs that do use 3rd party libraries to be recoded 
with your functions -- which so far the PEP hasn't embraced.  Or, to use 
the bytes APIs directly to get file names for 3rd party libraries -- but 
the directly ported, filenames-as-strings type of applications that 
could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked 
to do something different than they did before.




On 27Apr2009 23:52, Glenn Linderman  wrote:
  
On approximately 4/27/2009 7:11 PM, came the following characters from  
the keyboard of Cameron Simpson:


[...]
  

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.
  

[...]
  
So assume a non-decodable sequence in a name.  That puts us into   
Martin's funny-decode scheme.  His funny-decode scheme produces a 
bare  string, indistinguishable from a bare string that would be 
produced by a  str API that happens to contain that same sequence.  
Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.
_Lacking_ such a function, your punning concern is valid.
  

Seems like one would also desire os.pathdecode to do the reverse.



Yes.

  
And  
also versions that take or produce bytes from funny-encoded strings.



Isn't that the first two functions above?
  


Yes, sorry.

Then, if programs were re-coded to perform these transformations on what  
you call de novo strings, then the scheme would work.
But I think a large part of the incentive for the PEP is to try to  
invent a scheme that intentionally allows for the puns, so that programs  
do not need to be recoded in this manner, and yet still work.  I don't  
think such a scheme exists.



I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
"undecodable" bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.
  


Right.  And I don't intend to generate ill-formed Unicode strings, in my 
programs.  But I might well read their names from other sources.


It is nice, and thank you for emphasizing (although I already did 
realize it, back there in the far reaches of the brain) that all the 
data puns are between ill-formed Unicode strings, and undecodable bytes 
strings.  That is a nice property of the PEP's encoding/decoding 
method.  I'm not sure it outweighs the disadvantage of taking unreadable 
gibberish, and producing indecipherable gibberish (codepoints with no 
glyphs), though, when there are ways to produce decipherable gibberish 
instead... or at least mostly-decipherable gibberish.  Another idea 
forms described below.


If there is going to be a required transformation from de novo strings  
to funny-encoded strings, then why not make one that people can actually  
see and compare and decode from the displayable form, by using  
displayable characters instead of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk 
with the byte that translates to the same surrogate, accessed via the 
bytes interface. Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 13:37, Glenn Linderman  wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from  
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>UTF-8 (the PEP actually never proposed this to happen).
>>This introduces an ambiguity: two different files in the same
>>directory may decode to the same string name, if one has the PUA
>>character, and the other has a non-decodable byte that gets decoded
>>to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>The same ambiguity does *NOT* exist. If a file on disk already
>>contains an invalid surrogate code in its file name, then the UTF-8b
>>decoder will recognize this as invalid, and decode it byte-for-byte,
>>into three surrogate codes. Hence, the file names that are different
>>on disk are also different in memory. No ambiguity.
>
> C. File on disk with the invalid surrogate code, accessed via the str  
> interface, no decoding happens, matches in memory the file on disk with  
> the byte that translates to the same surrogate, accessed via the bytes  
> interface.  Ambiguity.

Is this a Windows example, or (now I think on it) an equivalent POSIX example
of using the PEP where the locale encoding is UTF-16?

In either case, I would say one could make an argument for being stricter
in reading in OS-native sequences. Grant that NTFS doesn't prevent
half-surrogates in filenames, and likewise that POSIX won't because to
the OS they're just bytes. On decoding, require well-formed data. When
you hit ill-formed data, treat the nasty half surrogate as a PAIR of
bytes to be escaped in the resulting decode.

Ambiguity avoided.

I'm more concerned with your (yours? someone else's?) mention of shift
characters. I'm unfamiliar with these encodings: to translate such a
thing into a Latin example, is it the case that there are schemes with
valid encodings that look like:

  [SHIFT] a b c

which would produce "ABC" in unicode, which is ambiguous with:

  A B C

which would also produce "ABC"?

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Helicopters are considerably more expensive [than fixed wing aircraft],
which is only right because they don't actually fly, but just beat
the air into submission.- Paul Tomblin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Martin v. Löwis wrote:
>> Since the serialization of the Unicode string is likely to use UTF-8,
>> and the string for  such a file will include half surrogates, the
>> application may raise an exception when encoding the names for a
>> configuration file. These encoding exceptions will be as rare as the
>> unusual names (which the careful I18N aware developer has probably
>> eradicated from his system), and thus will appear late.
> 
> There are trade-offs to any solution; if there was a solution without
> trade-offs, it would be implemented already.
> 
> The Python UTF-8 codec will happily encode half-surrogates; people argue
> that it is a bug that it does so, however, it would help in this
> specific case.

Can we use this encoding scheme for writing into files as well?  We've
turned the filename with undecodable bytes into a string with half
surrogates.  Putting that string into a file has to turn them into bytes
at some level.  Can we use the python-escape error handler to achieve
that somehow?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 14:37, Thomas Breuel  wrote:
| But the biggest problem with the proposal is that it isn't needed: if you
| want to be able to turn arbitrary byte sequences into unicode strings and
| back, just set your encoding to iso8859-15.  That already works and it
| doesn't require any changes.

No it doesn't. It does transcode without throwing exceptions. On POSIX.
(On Windows? I doubt it - windows isn't using an 8-bit scheme. I
believe.) But it utter destorys any hope of working in any other locale
nicely. The PEP lets you work losslessly in other locales.

It _may_ require some app care for particular very weird strings
that don't come from the filesystem, but as far as I can see only in
circumstances where such care would be needed anyway i.e. you've got to
do special stuff for weirdness in the first place. Weird == "ill-formed
unicode string" here.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

I just kept it wide-open thinking it would correct itself.
Then I ran out of talent.   - C. Fittipaldi
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread R. David Murray


On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with the 
byte that translates to the same surrogate, accessed via the bytes interface. 
Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.


--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Zooko O'Whielacronx wrote:
> On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:
>> If you switch to iso8859-15 only in the presence of undecodable UTF-8,
>> then you have the same round-trip problem as the PEP: both b'\xff' and
>> b'\xc3\xbf' will be converted to u'\u00ff' without a way to
>> unambiguously recover the original file name.
> 
> Why do you say that?  It seems to work as I expected here:
> 
 '\xff'.decode('iso-8859-15')
> u'\xff'
 '\xc3\xbf'.decode('iso-8859-15')
> u'\xc3\xbf'



 '\xff'.decode('cp1252')
> u'\xff'
 '\xc3\xbf'.decode('cp1252')
> u'\xc3\xbf'
> 

You're not showing that this is a fallback path.  What won't work is
first trying a local encoding (in the following example, utf-8) and then
if that doesn't work, trying a one-byte encoding like iso8859-15:

try:
file1 = '\xff'.decode('utf-8')
except UnicodeDecodeError:
file1 = '\xff'.decode('iso-8859-15')
print repr(file1)

try:
file2 = '\xc3\xbf'.decode('utf-8')
except UnicodeDecodeError:
file2 = '\xc3\xbf'.decode('iso-8859-15')
print repr(file2)


That prints:
  u'\xff'
  u'\xff'

The two encodings can map different bytes to the same unicode code point
 so you can't do this type of thing without recording what encoding was
used in the translation.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:01 PM, came the following characters from 
the keyboard of MRAB:

Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters 
from the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).



OK, I won't be offended if you don't call them 'escape codes'. :)  But 
what else to call them?


My use of that term is a bit backwards, perhaps... what happens is that 
because these 256 half surrogates are used to decode otherwise 
undecodable bytes, they themselves must be "escaped" or translated into 
something different, when they appear in the byte sequence.  The process 
 described reserves a set of codepoints for use, and requires that that 
same set of codepoints be translated using a similar mechanism to avoid 
their untranslated appearance in the resulting str.  Escape codes have 
the same sort of characteristic... by replacing their normal use for 
some other use, they must themselves have a replacement.


Anyway, I think we are communicating successfully.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?


Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".



Yes, your rephrasing is clearer, regarding your intention.



For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.



Decoding with UTF-8b might never produce U+DC00, but then again, it 
won't handle the random byte string, either.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 
to U+01FF (see rule 3 for why).  Only one escape codepoint is needed, 
this is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a 
bytes interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint 
followed by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape 
codepoints not followed by the escape codepoint, or by a codepoint in 
the range U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding 
as would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and 
escape characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an 
annoyance with the PEP, that the codepoints used are not ones that are 
easily displayed, so endecodable names could easily result in long 
strings of indisting

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).

On 27Apr2009 23:52, Glenn Linderman  wrote:
> On approximately 4/27/2009 7:11 PM, came the following characters from  
> the keyboard of Cameron Simpson:
[...]
>> There may be puns. So what? Use the right strings for the right purpose
>> and all will be well.
>>
>> I think what is missing here, and missing from Martin's PEP, is some
>> utility functions for the os.* namespace.
>>
>> PROPOSAL: add to the PEP the following functions:
>>
>>   os.fsdecode(bytes) -> funny-encoded Unicode
>> This is what os.listdir() does to produce the strings it hands out.
>>   os.fsencode(funny-string) -> bytes
>> This is what open(filename,..) does to turn the filename into bytes
>> for the POSIX open.
>>   os.pathencode(your-string) -> funny-encoded-Unicode
>> This is what you must do to a de novo string to turn it into a
>> string suitable for use by open.
>> Importantly, for most strings not hand crafted to have weird
>> sequences in them, it is a no-op. But it will recode your puns
>> for survival.
[...]
>>> So assume a non-decodable sequence in a name.  That puts us into   
>>> Martin's funny-decode scheme.  His funny-decode scheme produces a 
>>> bare  string, indistinguishable from a bare string that would be 
>>> produced by a  str API that happens to contain that same sequence.  
>>> Data puns.
>>> 
>>
>> See my proposal above. Does it address your concerns? A program still
>> must know the providence of the string, and _if_ you're working with
>> non-decodable sequences in a names then you should transmute then into
>> the funny encoding using the os.pathencode() function described above.
>>
>> In this way the punning issue can be avoided.
>> _Lacking_ such a function, your punning concern is valid.
>
> Seems like one would also desire os.pathdecode to do the reverse.

Yes.

> And  
> also versions that take or produce bytes from funny-encoded strings.

Isn't that the first two functions above?

> Then, if programs were re-coded to perform these transformations on what  
> you call de novo strings, then the scheme would work.
> But I think a large part of the incentive for the PEP is to try to  
> invent a scheme that intentionally allows for the puns, so that programs  
> do not need to be recoded in this manner, and yet still work.  I don't  
> think such a scheme exists.

I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
"undecodable" bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.

> If there is going to be a required transformation from de novo strings  
> to funny-encoded strings, then why not make one that people can actually  
> see and compare and decode from the displayable form, by using  
> displayable characters instead of lone surrogates?

Because that would _not_ be a no-op for well formed Unicode strings.

That reason is sufficient for me.

I consider the fact that well-formed Unicode -> funny-encoded is a no-op
to be an enormous feature of Martin's scheme.

Unless I'm missing something, there _are_no_puns_ between funny-encoded
strings and well formed unicode strings.

 I suppose if your program carefully constructs a unicode string riddled
 with half-surrogates etc and imagines something specific should happen
 to them on the way to being POSIX bytes then you might have a problem...

>>> Right.  Or someone else's program does that.

I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a
coffee, reading section 3.9 (Unicode Encoding Forms).

I now do not believe your scenario makes sense.

Someone can construct a Python3 string containing code points that
includes surrogates. Granted.

However such a string is not meaningful because it is not well-formed
(D85).  It's ill-formed (D84). It is not sane to expect it to
translate into a POSIX byte sequence, be it UTF-8 or anything else,
unless it is accompanied by some kind of explicit mapping provided by
the programmer.  Absent that mapping, it's nonsense in much the same
way that a non-decodable UTF-8 byte sequence is nonsense.

For example, Martin's funny-encoding is such an explicit mapping.

>>>I only want to use 
>>> Unicode  file names.  But if those other file names exist, I want to 
>>> be able to  access them, and not accidentally get a different file.

But those other

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:02 PM, came the following characters from 
the keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/28/2009 1:25 PM, came the following characters from
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.


Is that an alternative to A and B?


I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both 
bytes and str interfaces, and both get used.


On a Windows system, perhaps the ambiguous case would be the use of the 
str API and bytes APIs producing different memory names for the same 
file that contains a (Unicode-illegal) half surrogate.  The 
half-surrogate would seem to get decoded to 3 half surrogates if 
accessed via the bytes interface, but only one via the str interface. 
The version with 3 half surrogates could match another name that 
actually contains 3 half surrogates, that is accessed via the str interface.


I can't actually tell by reading the PEP whether it affects Windows 
bytes interfaces or is only implemented on POSIX, so that POSIX has a 
str interface.


If it is only implemented on POSIX, then the current scheme (now 
escaping the hundreds of escape codes) could work, within a single 
platform... but it would still suffer from displaying garbage (sequences 
of replacement characters) in file listings displayed or printed.  There 
is no way, once the string is adjusted to contain replacement characters 
for display, to distinguish one file name from another, if they are 
identical except for a same-length sequence of different undecodable bytes.


The concept of a function that allows the same decoding and encoding 
process for 3rd party interfaces is still missing from the PEP; 
implementation of the PEP would require that all interfaces to 3rd party 
software that accept file names would have to be transcoded by the 
interface layer.  Or else such software would have to use the bytes 
interfaces directly, and if they do, there is no need for the PEP.


So I see the PEP as a partial solution to a limited problem, that on the 
one hand potentially produces indistinguishable sequences of replacement 
characters in filenames, rather than the mojibake (which is at least 
distinguishable), and on the other hand, doesn't help software that also 
uses 3rd party libraries to avoid the use of bytes APIs for accessing 
file names.  There are other encodings that produce more distinguishable 
mojibake, and would work in the same situations as the PEP.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

Glenn Linderman wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>UTF-8 (the PEP actually never proposed this to happen).
>>This introduces an ambiguity: two different files in the same
>>directory may decode to the same string name, if one has the PUA
>>character, and the other has a non-decodable byte that gets decoded
>>to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>The same ambiguity does *NOT* exist. If a file on disk already
>>contains an invalid surrogate code in its file name, then the UTF-8b
>>decoder will recognize this as invalid, and decode it byte-for-byte,
>>into three surrogate codes. Hence, the file names that are different
>>on disk are also different in memory. No ambiguity.
> 
> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk with
> the byte that translates to the same surrogate, accessed via the bytes
> interface.  Ambiguity.

Is that an alternative to A and B?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).


2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?


Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".

For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.



Perhaps the escape character should be U+005C. ;-)

It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> Others have made this suggestion, and it is helpful to the PEP, but not
> sufficient.  As implemented as an error handler, I'm not sure that the
> b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8
> decoder is happy with it.  Which, in my testing, it is.

Rest assured that the utf-8b codec will work the way it is specified.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 1:25 PM, came the following characters from 
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 


Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.


C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with 
the byte that translates to the same surrogate, accessed via the bytes 
interface.  Ambiguity.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 6:01 AM, came the following characters from 
the keyboard of Lino Mastrodomenico:

2009/4/28 Glenn Linderman :

The switch from PUA to half-surrogates does not resolve the issues with the
encoding not being a 1-to-1 mapping, though.  The very fact that you  think
you can get away with use of lone surrogates means that other people might,
accidentally or intentionally, also use lone surrogates for some other
purpose.  Even in file names.


It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.



Wrong.

An 8859-1 locale allows any byte sequence to placed into a POSIX filename.

And while U+DCFF is illegal alone in Unicode, it is not illegal in 
Python str values.  And from my testing, Python 3's current UTF-8 
encoder will happily provide exactly the bytes value you mention when 
given U+DCFF.




Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'



Others have made this suggestion, and it is helpful to the PEP, but not 
sufficient.  As implemented as an error handler, I'm not sure that the 
b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 
decoder is happy with it.  Which, in my testing, it is.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?



I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.


It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> The UTF-8b representation suffers from the same potential ambiguities as
> the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:53 AM, came the following characters from 
the keyboard of James Y Knight:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.



It would seem from the definition of ISO-2022 that what it calls "escape 
sequences" is in your POSIX spec called "locking-shift encoding". 
Therefore, the second bullet item under the "Character Encoding" heading 
prohibits use of ISO-2022, for whatever uses that document defines 
(which, since you referenced it, I assume means locales, and possibly 
file system encodings, but I'm not familiar with the structure of all 
the POSIX standards documents).


A locking-shift encoding (where the state of the character is determined 
by a shift code that may affect more than the single character following 
it) cannot be defined with the current character set description file 
format. Use of a locking-shift encoding with any of the standard 
utilities in the XCU specification or with any of the functions in the 
XSH specification that do not specifically mention the effects of 
state-dependent encoding is implementation-dependent.





 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
"\x81\xFD".decode("shift-jis", "python-escape") will turn into 
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 



Why is that obvious?  The only thing I saw that could exclude EBCDIC 
would be the requirement that the codes be positive in a char, but on a 
system where the C compiler treats char as unsigned, EBCDIC would qualify.


Of course, the use of EBCDIC would also restrict the other possible code 
pages to those derived from EBCDIC (rather than the bulk of code pages 
that are derived from ASCII), due to:


If the encoded values associated with each member of the portable 
character set are not invariant across all locales supported by the 
implementation, the results achieved by an application accessing those 
locales are unspecified.



iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Zooko O'Whielacronx

On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:

Are you proposing to unconditionally encode file names as  
iso8859-15, or to do so only when undecodeable bytes are encountered?

For what it is worth, what we have previously planned to do for the  
Tahoe project is the second of these -- decode using some 1-byte  
encoding such as iso-8859-1, iso-8859-15, or windows-1252 only in the  
case that attempting to decode the bytes using the local alleged  
encoding failed.

If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.

Why do you say that?  It seems to work as I expected here:

>>> '\xff'.decode('iso-8859-15')
u'\xff'
>>> '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'
>>>
>>>
>>>
>>> '\xff'.decode('cp1252')
u'\xff'
>>> '\xc3\xbf'.decode('cp1252')
u'\xc3\xbf'

Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


James Y Knight wrote:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.


 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
"\x81\xFD".decode("shift-jis", "python-escape") will turn into 
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 
iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the 
proposal can be made: Change the second sentence to: "When given a 
non-decodable byte from 0x00 to 0x7F, that byte must be the second or 
later byte in a multibyte sequence. In such a case, the error handler 
will produce the encoding of that byte if it was standing alone (thus in 
most encodings, \x00-\x7f turn into U+00-U+7F)."


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like 
some people do actually use shift_jisx0213, unfortunately.



I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.

But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.

2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.

3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.

4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.

I think I've covered all the possibilities. :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:00 AM, came the following characters from 
the keyboard of Martin v. Löwis:



An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.


The UTF-8b representation suffers from the same potential ambiguities as 
the PUA characters... perhaps slightly less likely in practice, due to 
the use of Unicode-illegal characters, but exactly the same theoretical 
likelihood in the space of Python-acceptable character codes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread James Y Knight



On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively)  
that all encodings must be supersets of ASCII and use the same  
codepoints, though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire  
reason why EUC-JP was created, so I'm pretty sure that it is in fact  
inappropriate, and I cannot find any evidence of it ever being used on  
any system.


From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant  
bit of each 7-bit byte of the original ISO 2022 codes is set (by  
adding 128 to each of these original 7-bit codes); this allows  
software to easily distinguish whether a particular byte in a  
character string belongs to the ISO-646 code or the ISO-2022 (EUC)  
code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html


I'm a bit scared at the prospect that U+DCAF could turn into "/",  
that
just screams security vulnerability to me.  So I'd like to propose  
that

only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce  
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people  
actually want to use as their default encoding, despite it being  
broken to do so. RedHat apparently refuses to provide it as a locale  
charset (due to its brokenness), and it's also not available by  
default on my Debian system. People do unfortunately seem to actually  
use it in real life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from  
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- 
decodable byte from 0x00 to 0x7F, it will be converted to U+-U 
+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded  
into 0x80 to 0xFF, and all other characters are treated in whatever  
way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets,  
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works  
for Shift-JIS and other similar ASCII-supersets with overlaps in  
trailing bytes of a multibyte sequence. So, a sequence like  
"\x81\xFD".decode("shift-jis", "python-escape") will turn into  
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages  
(obviously completely unsuitable for a locale encoding on unix),  
iso2022-* (covered above), and shift-jisx0213 (because it has replaced  
\ with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the  
proposal can be made: Change the second sentence to: "When given a non- 
decodable byte from 0x00 to 0x7F, that byte must be the second or  
later byte in a multibyte sequence. In such a case, the error handler  
will produce the encoding of that byte if it was standing alone (thus  
in most encodings, \x00-\x7f turn into U+00-U+7F)."


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like  
some people do actually use shift_jisx0213, unfortunately.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis


> If the PEP depends on this being changed, it should be mentioned in the
> PEP.

The PEP says that the utf-8b codec decodes invalid bytes into low
surrogates. I have now clarified that a strict definition of UTF-8
is assumed for utf-8b.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> Since the serialization of the Unicode string is likely to use UTF-8,
> and the string for  such a file will include half surrogates, the
> application may raise an exception when encoding the names for a
> configuration file. These encoding exceptions will be as rare as the
> unusual names (which the careful I18N aware developer has probably
> eradicated from his system), and thus will appear late.

There are trade-offs to any solution; if there was a solution without
trade-offs, it would be implemented already.

The Python UTF-8 codec will happily encode half-surrogates; people argue
that it is a bug that it does so, however, it would help in this
specific case.

An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.

> Or say de/serialization succeeds. Since the resulting Unicode string
> differs depending on the encoding (which is a good thing; it is
> supposed to make most cases mostly readable), when the filesystem
> encoding changes (say from legacy to UTF-8), the "name" changes, and
> deserialized references to it become stale.

That problem has nothing to do with the PEP. If the encoding changes,
LRU entries may get stale even if there were no encoding errors at
all. Suppose the old encoding was Latin-1, and the new encoding is
KOI8-R, then all file names are decodable before and afterwards, yet
the string representation changes. Applications that want to protect
themselves against that happening need to store byte representations
of the file names, not character representations. Depending on the
configuration file format, that may or may not be possible.

I find the case pretty artificial, though: if the locale encoding
changes, all file names will look incorrect to the user, so he'll
quickly switch back, or rename all the files. As an application
supporting a LRU list, I would remove/hide all entries that don't
correlate to existing files - after all, the user may have as well
deleted the file in the LRU list.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
> not a valid Unicode character (not a character at all, really) and the
> only way you can put this in a POSIX filename is if you use a very
> lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.
> 
> Since this byte sequence doesn't represent a valid character when
> decoded with UTF-8, it should simply be considered an invalid UTF-8
> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
> '\udcff').
> 
> Martin: maybe the PEP should say this explicitly?

Sure, will do.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Stephen J. Turnbull

Paul Moore writes:

 > But it seems to me that there is an assumption that problems will
 > arise when code gets a potentially funny-decoded string and doesn't
 > know where it came from.
 > 
 > Is that a real concern?

Yes, it's a real concern.  I don't think it's possible to show a small
piece of code one could point at and say "without a better API I bet
you can't write this correctly," though.  Rather, my experience with
Emacs and various mail packages is that without type information it is
impossible to keep track of the myriad bits and pieces of text that
are recombining like pig flu, and eventually one breaks out and causes
an error.  It's usually easy to fix, but so are the next hundred
similar regressions, and in the meantime a hundred users have suffered
more or less damage or at least annoyance.

There's no question that dealing with escapes of funny-decoded strings
to uprepared code paths is mission creep compared to Martin's stated
purpose for PEP 383, but it is also a real problem.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Urman

On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull  wrote:
> Nobody said we were at the stage of *saving* the [attachment]!

But speaking of saving files, I think that's the biggest hole in this
that has been nagging at the back of my mind. This PEP intends to
allow easy access to filenames and other environment strings which are
not restricted to known encodings. What happens if the detected
encoding changes? There may be difficulties de/serializing these
names, such as for an MRU list.

Since the serialization of the Unicode string is likely to use UTF-8,
and the string for  such a file will include half surrogates, the
application may raise an exception when encoding the names for a
configuration file. These encoding exceptions will be as rare as the
unusual names (which the careful I18N aware developer has probably
eradicated from his system), and thus will appear late.

Or say de/serialization succeeds. Since the resulting Unicode string
differs depending on the encoding (which is a good thing; it is
supposed to make most cases mostly readable), when the filesystem
encoding changes (say from legacy to UTF-8), the "name" changes, and
deserialized references to it become stale.

This can probably be handled through careful use of the same
encoding/decoding scheme, if relevant, but that sounds like we've just
moved the problem from fs/environment access to serialization. Is that
good enough? For other uses the API knew whether it was
environmentally aware, but serialization probably will not. Should
this PEP make recommendations about how to save filenames in
configuration files?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Hrvoje Niksic :
> Lino Mastrodomenico wrote:
>>
>> Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid
>> character when
>> decoded with UTF-8, it should simply be considered an invalid UTF-8
>> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
>> '\udcff').
>
> "Should be considered" or "will be considered"?  Python 3.0's UTF-8 decoder
> happily accepts it and returns u'\udcff':
>
 b'\xed\xb3\xbf'.decode('utf-8')
> '\udcff'

Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Lino Mastrodomenico wrote:

Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character 
when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').


"Should be considered" or "will be considered"?  Python 3.0's UTF-8 
decoder happily accepts it and returns u'\udcff':


>>> b'\xed\xb3\xbf'.decode('utf-8')
'\udcff'

If the PEP depends on this being changed, it should be mentioned in the PEP.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Glenn Linderman :
> The switch from PUA to half-surrogates does not resolve the issues with the
> encoding not being a 1-to-1 mapping, though.  The very fact that you  think
> you can get away with use of lone surrogates means that other people might,
> accidentally or intentionally, also use lone surrogates for some other
> purpose.  Even in file names.

It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.

Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Thomas Breuel wrote:
But the biggest problem with the proposal is that it isn't needed: if 
you want to be able to turn arbitrary byte sequences into unicode 
strings and back, just set your encoding to iso8859-15.  That already 
works and it doesn't require any changes.


Are you proposing to unconditionally encode file names as iso8859-15, or 
to do so only when undecodeable bytes are encountered?


If you unconditionally set encoding to iso8859-15, then you are 
effectively reverting to treating file names as bytes, regardless of the 
locale.  You're also angering a lot of European users who expect 
iso8859-2, etc.


If you switch to iso8859-15 only in the presence of undecodable UTF-8, 
then you have the same round-trip problem as the PEP: both b'\xff' and 
b'\xc3\xbf' will be converted to u'\u00ff' without a way to 
unambiguously recover the original file name.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel

>
> Yep, that's the problem. Lots of theoretical problems noone has ever
> encountered
> brought up against a PEP which resolves some actual problems people
> encounter on
> a regular basis.


How can you bring up practical problems against something that hasn't been
implemented?

The fact that no other language or library does this is perhaps an
indication that it isn't the right thing to do.

But the biggest problem with the proposal is that it isn't needed: if you
want to be able to turn arbitrary byte sequences into unicode strings and
back, just set your encoding to iso8859-15.  That already works and it
doesn't require any changes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Ronald Oussoren


For what it's worth, the OSX API's seem to behave as follows:

* If you create a file with an non-UTF8 name on a HFS+ filesystem the  
system automaticly encodes the name.


That is,  open(chr(255), 'w') will silently create a file named '%FF'  
instead of the name you'd expect on a unix system.


* If you mount an NFS filesystem from a linux host and that directory  
contains a file named chr(255)


- unix-level tools will see a file with the expected name (just like  
on linux)
- Cocoa's NSFileManager returns u"?" as the filename, that is when the  
filename cannot be decoded using UTF-8 the name returned by the high- 
level API is mangled. This is regardless of the setting of LANG.
- I haven't found a way yet to access files whose names are not valid  
UTF-8 using the high-level Cocoa API's.


The latter two are interesting because Cocoa has a unicode filesystem  
API on top of a POSIX C-API, just like Python 3.x. I guess the choosen  
behaviour works out on OSX (where users are unlikely to run into this  
issue), but could be more problematic on other POSIX systems.


Ronald

On 28 Apr, 2009, at 14:03, Michael Foord wrote:


Paul Moore wrote:

2009/4/28 Antoine Pitrou :


Paul Moore  gmail.com> writes:

I've yet to hear anyone claim that they would have an actual  
problem

with a specific piece of code they have written.

Yep, that's the problem. Lots of theoretical problems noone has  
ever encountered
brought up against a PEP which resolves some actual problems  
people encounter on

a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented  
as soon as

possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.



Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk




--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Foord


Paul Moore wrote:

2009/4/28 Antoine Pitrou :
  

Paul Moore  gmail.com> writes:


I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written.
  

Yep, that's the problem. Lots of theoretical problems noone has ever encountered
brought up against a PEP which resolves some actual problems people encounter on
a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented as soon as
possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.
  


Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Antoine Pitrou :
> Paul Moore  gmail.com> writes:
>>
>> I've yet to hear anyone claim that they would have an actual problem
>> with a specific piece of code they have written.
>
> Yep, that's the problem. Lots of theoretical problems noone has ever 
> encountered
> brought up against a PEP which resolves some actual problems people encounter 
> on
> a regular basis.
>
> For the record, I'm +1 on the PEP being accepted and implemented as soon as
> possible (preferably before 3.1).

In case it's not clear, I am also +1 on the PEP as it stands.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Glenn Linderman :
> So assume a non-decodable sequence in a name.  That puts us into Martin's
> funny-decode scheme.  His funny-decode scheme produces a bare string,
> indistinguishable from a bare string that would be produced by a str API
> that happens to contain that same sequence.  Data puns.
>
> So when open is handed the string, should it open the file with the name
> that matches the string, or the file with the name that funny-decodes to the
> same string?  It can't know, unless it knows that the string is a
> funny-decoded string or not.

Sorry for picking on Glenn's comment - it's only one of many in this
thread. But it seems to me that there is an assumption that problems
will arise when code gets a potentially funny-decoded string and
doesn't know where it came from.

Is that a real concern? How many programs really don't know where
their data came from? Maybe a general-purpose library routine *might*
just need to document explicitly how it handles funny-encoded data (I
can't actually imagine anything that would, but I'll concede it may be
possible) but that's just a matter of documenting your assumptions -
no better or worse than many other cases.

This all sounds similar to the idea of "tainted" data in security - if
you lose track of untrusted data from the environment, you expose
yourself to potential security issues. So the same techniques should
be relevant here (including ignoring it if your application isn't such
that it's s concern!)

I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written. (NB, if such a claim
has been made, feel free to point me to it - I admit I've been
skimming this thread at times).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis

> Does the PEP take into consideration the normalising behaviour of Mac
> OSX ? We've had some ongoing challenges in bzr related to this with bzr.

No, that's completely out of scope, AFAICT. I don't even know what the
issues are, so I'm not able to propose a solution, at the moment.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/27/2009 7:11 PM, came the following characters from 
the keyboard of Cameron Simpson:

On 27Apr2009 18:15, Glenn Linderman  wrote:
  

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.



Why is it necessary that you are able to make this distinction?
  
  
It is necessary that programs (not me) can make the distinction, so 
that  it knows whether or not to do the funny-encoding or not.



I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)
  
So you agree they can't... that there are data puns.   (OK, you may not  
have thought that through)



I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former "use the user's current locale" behaviour.
(We have locale "C" for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

  
If a name is  funny-decoded when the name is accessed by a directory 
listing, it needs  to be funny-encoded in order to open the file.


Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
  
So assume a non-decodable sequence in a name.  That puts us into  
Martin's funny-decode scheme.  His funny-decode scheme produces a bare  
string, indistinguishable from a bare string that would be produced by a  
str API that happens to contain that same sequence.  Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.
  


Seems like one would also desire os.pathdecode to do the reverse.  And 
also versions that take or produce bytes from funny-encoded strings.


Then, if programs were re-coded to perform these transformations on what 
you call de novo strings, then the scheme would work.


But I think a large part of the incentive for the PEP is to try to 
invent a scheme that intentionally allows for the puns, so that programs 
do not need to be recoded in this manner, and yet still work.  I don't 
think such a scheme exists.


If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?



So when open is handed the string, should it open the file with the name  
that matches the string, or the file with the name that funny-decodes to  
the same string?  It can't know, unless it knows that the string is a  
funny-decoded string or not.



True. open() should always expect a funny-encoded name.

  

So it is already the case that strings get decoded to bytes by
cal

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis

James Y Knight wrote:
> Hopefully it can be assumed that your locale encoding really is a
> non-overlapping superset of ASCII, as is required by POSIX...

Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?

> I'm a bit scared at the prospect that U+DCAF could turn into "/", that
> just screams security vulnerability to me.  So I'd like to propose that
> only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
> encoded/decoded via the error handler.

It would be actually U+DC2f that would turn into /.
I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/27/2009 8:39 PM, came the following characters from 
the keyboard of Martin v. Löwis:

I'm not suggesting the PEP should solve the problem of mounting foreign
file systems, although if it doesn't it should probably point that out. 
I'm just suggesting that if the people that write software to solve the

problem of mounting foreign file systems have already solved the naming
problem, then it might be a source of a good solution.  On the other
hand, it might be the source of a mediocre or bad solution.  However, if
those mounting system have good solutions, it would be good to be
compatible with them, rather than have yet another solution.  It was in
that sense, of thinking about possibly existing practice, and leveraging
an existing solution, that caused me to bring up the topic.



I think you make quite a lot of assumptions here. It would be better
to research the state of the art first, and only then propose to follow it.


I didn't propose to follow it.  I only proposed an area that could be 
researched as a source of ideas and/or potential solutions.  Apparently 
there wasn't, but there could have been someone listening that had the 
results of such research on the tip of their tongue, and might have 
piped up with the techniques used.  I did, in fact, begin researching 
the topic after making the suggestion, and thus far haven't found any 
brilliant solutions from that arena.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Robert Collins

On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote:
> 
> Indeed, that was the missing piece.  I'd forgotten about the
> encodings 
> that use escape sequences, rather than UTF-8, and DBCS.  I don't
> think 
> those encodings are permitted by POSIX file systems, but I suppose
> they 
> could sneak in via Environment variable values, and the like.

This may already have been discussed, and if so I apologise for the for
the noise.

Does the PEP take into consideration the normalising behaviour of Mac
OSX ? We've had some ongoing challenges in bzr related to this with bzr.

-Rob


signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman

On approximately 4/27/2009 8:35 PM, came the following characters from 
the keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/27/2009 12:42 PM, came the following characters from
the keyboard of Martin v. Löwis:

It's a private use area. It will never carry an official character
assignment.

I know that U+F - U+F is a private use area.  I don't find a
definition of U+F01xx to know what the notation means.  Are you picking
a particular character within the private use area, or a particular
range, or what?

It's a range. The lower-case 'x' denotes a variable half-byte, ranging
from 0 to F. So this is the range U+F0100..U+F01FF, giving 256 code
points.


So you only need 128 code points, so there is something else unclear.


(please understand that this is history now, since the PEP has stopped
using PUA characters).



Yes, but having found the latest PEP finally (at least I hope the one at 
python.org is the latest, it has quit using PUA anyway), I confirm it is 
history.  But the same issue applies to the range of half-surrogates.




No. You seem to assume that all bytes < 128 decode successfully always.
I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.



Indeed, that was the missing piece.  I'd forgotten about the encodings 
that use escape sequences, rather than UTF-8, and DBCS.  I don't think 
those encodings are permitted by POSIX file systems, but I suppose they 
could sneak in via Environment variable values, and the like.


The switch from PUA to half-surrogates does not resolve the issues with 
the encoding not being a 1-to-1 mapping, though.  The very fact that you 
 think you can get away with use of lone surrogates means that other 
people might, accidentally or intentionally, also use lone surrogates 
for some other purpose.  Even in file names.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread James Y Knight



On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:

No. You seem to assume that all bytes < 128 decode successfully  
always.

I believe this assumption is wrong, in general:

py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax
Traceback (most recent call last):
 File "", line 1, in 
UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position
3-4: illegal multibyte sequence

All bytes are below 128, yet it fails to decode.


Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly  
forbidden by POSIX, if I'm not mistaken...and I can't see how it would  
work, considering that it uses all the bytes from 0x20-0x7f, including  
0x2f ("/"), to represent non-ascii characters.


Hopefully it can be assumed that your locale encoding really is a non- 
overlapping superset of ASCII, as is required by POSIX...


I'm a bit scared at the prospect that U+DCAF could turn into "/", that  
just screams security vulnerability to me.  So I'd like to propose  
that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be  
encoded/decoded via the error handler.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull

Tony Nelson writes:
 > At 16:09 + 04/27/2009, Antoine Pitrou wrote:
 > >Stephen J. Turnbull  xemacs.org> writes:
 > >>
 > >> I hate to break it to you, but most stages of mail processing have
 > >> very little to do with SMTP.  In particular, processing MIME
 > >> attachments often requires dealing with file names.
 > >
 > >AFAIK, the file name is only there as an indication for the user
 > >when he wants to save the file. If it's garbled a bit, no big
 > >deal.

Nobody said we were at the stage of *saving* the file!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull

Michael Foord writes:

 > The problem you don't address, which is still the reality for most 
 > programmers (especially Mac OS X where filesystem encoding is UTF 8), is 
 > that programmers *are* going to treat filenames as strings.

 > The proposed PEP allows that to work for them - whatever platform their 
 > program runs on.

Sure, for values of "work" == "No exception will be raised in my
module, and some content will actually be returned."  It doesn't say
anything about what happens once those strings escape the immediate
context.  So it *encourages* those programmers to pass any problems
downstream, but only after discarding the resources needed to deal
with problems effectively.

It's not that hard to overcome that problem, but it does require a
slightly more complex API, and one that doesn't return a string but
rather a stringlike object annotated with the information about how it
was decoded.  Conversion to a string *should* be trivial; I just think
it should be invoked explicitly to make it clear where information is
being discarded.  Without an implicit conversion, the nature of the
data (ie, context-dependent structure) is made explicit.  There's a
natural place to document the problem that context must be used to
interpret the data accurately, and even add more robust processing (in
a new PEP, of course!), etc.

Then in the future this interface could be used as the basis of a more
robust API.  With good design (and luck) it might be subclassible or
extensible to a path object API, for example.  PEP 383 on the other
hand is a dead end as it stands.  AFAICS it gives the best possible
treatment of conversion of OS data to plain string, but we're already
got developers lining up to say "I can't use it". :-(

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

1 2 3 >

1 - 100 of 202 matches

Mail list logo