Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Hrvoje Niksic Wed, 29 Apr 2009 01:30:03 -0700

Zooko O'Whielacronx wrote:

If you switch to iso8859-15 only in the presence of undecodableUTF-8, then you have the same round-trip problem as the PEP: bothb'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without away to unambiguously recover the original file name.
Why do you say that?  It seems to work as I expected here:

 >>> '\xff'.decode('iso-8859-15')
u'\xff'
 >>> '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'

Here is what I mean by "switch to iso8859-15" only in the presence ofundecodable UTF-8:


def file_name_to_unicode(fn, encoding):
    try:
        return fn.decode(encoding)
    except UnicodeDecodeError:
        return fn.decode('iso-8859-15')

Now, assume a UTF-8 locale and try to use it on the provided examplefile names.


>>> file_name_to_unicode(b'\xff', 'utf-8')
'ÿ'
>>> file_name_to_unicode(b'\xc3\xbf', 'utf-8')
'ÿ'

That is the ambiguity I was referring to -- to different byte sequencesresult in the same unicode string.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to