Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

Glenn Linderman Wed, 06 May 2009 13:17:43 -0700

On approximately 5/6/2009 12:18 PM, came the following characters fromthe keyboard of Zooko Wilcox-O'Hearn:

On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:
Zooko Wilcox-O'Hearn <zooko <at> zooko.com> writes:
I'm not thinking of API compatibility as much as data compatibility-- someone used Python 3.1 to write down some filenames, and now afew years later they are trying to use the latest and greatest Pythonrelease to read those filenames...
Well, if the filenames are generated by Python (as opposed to readfrom an existing directory on disk), they should be regular unicodeobjects without any lone surrogates, so I don't see the compatibilityproblem.
I meant that the application reads filenames from an existing directoryon disk, saves those filenames, and then later, using a future versionof Python, wants to read them and use them.

Regarding future versions of Python. In the worst case, even ifPython's default behavior changes, the transcoding done by PEP 383 canbe done in other software too... it is a straightforward, fullyspecified, 1-to-1, reversible transcoding process, affecting andgenerating only invalid byte encodings on one side, and invalid Unicodesequences on the other.

So if Python's default behavior should change, the transcodingimplemented by PEP 383 could be easily reimplemented to enable a futureversion of a Python application to manipulate the transcoded, saved,filenames.


By easily, I mean that I could code it in a couple hours, max.

I'm not saying that I know this would be a problem. I'm saying that Ipersonally can't tell whether it would be a problem or not, and theextensive discussions so far have not convinced me that there is anyonewho both understands PEP 383 and considers this use case.



Does the above help?

Many people who apparently understand encoding issues well have saidsomething to the effect that there is no problem, but those peoplehaven't yet managed to get through my thick skull how I would use PEP383 safely for this sort of use case -- the one where data generated byos.listdir() travels forward in time or the one were that data travelssideways to other systems, including Windows or other systems thatvalidate incoming unicode.



Regarding data traveling sideways, some comments:

1) PEP 383's effect could be recoded in other languages as easily as itis in Python (or the C in which Python is implmented). So that could bea solution.

2) You mention "Windows" and "other systems that validate incomingunicode" in the same phrase, as if you think that "Windows" qualifies asan "other systems that validate incoming unicode", but it does not (atleast not universally).

That's why I am a bit uncomfortable about PEP 383 being quicklyimplemented and deployed in Python 3.1.



Does the above help?

By the way, much of the detailed discussion about what Tahoe requiresand how that may or may not benefit from PEP 383 has now moved to thetahoe-dev mailing list:http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .

I have no background with Tahoe, nor particular interest, although itsounds like a useful project... so I won't be joining that list. I haveno idea if there is an installed base of existing Tahoe file systems, mysuggestions below assume that there is not, and that you are presentlyinventing them. Therefore, I provide no migration path, although Icould invent one, but it would take longer to describe.

However, since I'm responding here, and have read what you have postedhere, it seems like the following could be true.


Assumptions from your emails:

A) Tahoe wants to provide a UTF-8 file name system

B) Tahoe wants to interface to POSIX systems that use (and do notvalidate) byte interfaces.C) Tahoe wants to interface to non-POSIX systems that use 16-bit filename interfaces, with no validation.D) Tahoe wants to interface to non-POSIX systems that use 16-bit filename interfaces, with validation.

Uncertainties: I'm not clear on what your goals are for Tahoe filenames.There seem to be 2 possibilities:

1) you want to reject attempts to use non-validating Unicode, be it froma 16-bit interface, or a bytes interface.2) you don't want to reject non-validating Unicode, but you want toconvert it to valid Unicode for (D) systems.

3) Orthogonally, you might want to store only Valid Unicode in thenames, or you might not care, if you can meet the other goals.


Truisms:

If you want to support (D), and (2), then you must transform names atsome point, using some scheme, because not all names supplied by (B)systems will be acceptable to (D) systems. You can choose to do thistransformation when a (B) system provides an invalid (per Unicode) name,or you can choose to do the transformation when a (D) system accesses afile with an invalid (per Unicode) name.

If the (B) and (D) systems talk to each other outside of Tahoe, theywill have to do similar transformations, or, if they both access thesame Tahoe system, they will have to do the identical transformation, tobe sure that they can access the same file.

All transcoding schemes have the possibility of data puns betweennon-transcoded names and transcoded names. In order to successfully andproperly manipulate a name, you must know whether or not it has beentranscoded, and how.

PEP 383 limits its transcoding to names that are invalid (per Unicode).Names that cannot be properly decoded to Unicode are decoded toinvalid Unicode. Names that are invalid Unicode are encoded to invalidbyte sequences (per the encoding scheme specified).

For PEP 383 and Python, transcoded names can be distinguished bychecking for the existence of lone surrogates in the str form of thefilename, or by attempting to do a strict decoding of the bytes form ofthe filename, depending on what you have (generally, the former).

For PEP 383 and Python, the names will round trip from the POSIX bytesinterfaces to the program, and back to POSIX bytes interfaces, as longas only Python wrappers of system functions are used, and the filesystemencoding is not changed between calls (or is restored). Passing them to3rd party libraries or other systems requires extra work, if there is adesire to manipulate files with names that are not decodeable to Unicodeby the standard decoding algorithm for that encoding.



--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

Reply via email to