Zooko O'Whielacronx wrote:
Following-up to my own post to correct a major error:


On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zoo...@gmail.com> wrote:
Folks:

My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
binary names from the filesystem and store them so that I can regenerate
the same byte string later, but it also requires that I *know* whether
what I got was a valid string in the expected encoding (which might be
utf-8) or whether it was not and I need to fall back to storing the
bytes.

Okay, I am wrong about this.  Having a flag to remember whether I had to
fall back to the utf-8b trick is one method to implement my requirement,
but my actual requirement is this:

Requirement: either the unicode string or the bytes are faithfully
transmitted from one system to another.

That is: if you read a filename from the filesystem, and transmit that
filename to another system and use it, then there are two cases:

Requirement 1: the byte string was valid in the encoding of source
system, in which case the unicode name is faithfully transmitted
(i.e. the bytes that finally land on the target system are the result of
sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).

Requirement 2: the byte string was not valid in the encoding of source
system, in which case the bytes are faithfully transmitted (i.e. the
bytes that finally land on the target system are the same as the bytes
that originated in the source system).

Now I finally understand how fiendishly clever MvL's PEP 383
generalization of Markus Kuhn's utf-8b trick is!  The only thing
necessary to achieve both of those requirements above is that the
'python-escape' error handler is used on the target system .encode() as
well as on the source system .decode()!

Well, I'm going to have to let this sink in and maybe write some code to
see if I really understand it.

But if this is right, then I can do away with some of the mechanism that
I've built up, and instead:

Backport PEP 383 to Python 2.

And, document the PEP 383 trick in some generic, widely respected format
such as an Internet Draft so that I can explain to other users of the
Tahoe data (many of whom use other languages than Python) what they have
to do if they find invalid utf-8 in the data.  Oh good, I just realized
that Tahoe emits only utf-8, so all I have to do is point them to the
utf-8b documents (such as they are) and explain that to read filenames
produced by Tahoe they have to implement utf-8b.  That's really good
that they don't have to implement MvL's generalization of that trick to
other encodings, since utf-8b is already understood by some folks.


Okay, I find it surprisingly easy to make subtle errors in this encoding
stuff, so please let me know if you spot one.  Is it true that
srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
'python-escape') will always produce srcbytes ?  That is my Requirement
2.

No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8',
'python-escape') == srcbytes. The encodings on both ends need to be the
same.

For example:

>>> b'\x80'.decode('windows-1252')
u'\u20ac'
>>> u'\u20ac'.encode('utf-8')
'\xe2\x82\xac'

Currently:

>>> b'\x80'.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    b'\x80'.decode('utf-8')
  File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte

But under this PEP:

>>> b'x80'.decode('utf-8', 'python-escape')
u'\xdc80'
>>> u'\xdc80'.encode('utf-8', 'python-escape')
'\x80'
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to