On approximately 5/6/2009 12:18 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn:
On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:

Zooko Wilcox-O'Hearn <zooko <at> zooko.com> writes:

I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames...

Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem.

I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them.


Regarding future versions of Python. In the worst case, even if Python's default behavior changes, the transcoding done by PEP 383 can be done in other software too... it is a straightforward, fully specified, 1-to-1, reversible transcoding process, affecting and generating only invalid byte encodings on one side, and invalid Unicode sequences on the other.

So if Python's default behavior should change, the transcoding implemented by PEP 383 could be easily reimplemented to enable a future version of a Python application to manipulate the transcoded, saved, filenames.

By easily, I mean that I could code it in a couple hours, max.


I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case.


Does the above help?


Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode.


Regarding data traveling sideways, some comments:

1) PEP 383's effect could be recoded in other languages as easily as it is in Python (or the C in which Python is implmented). So that could be a solution.

2) You mention "Windows" and "other systems that validate incoming unicode" in the same phrase, as if you think that "Windows" qualifies as an "other systems that validate incoming unicode", but it does not (at least not universally).


That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1.


Does the above help?


By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .


I have no background with Tahoe, nor particular interest, although it sounds like a useful project... so I won't be joining that list. I have no idea if there is an installed base of existing Tahoe file systems, my suggestions below assume that there is not, and that you are presently inventing them. Therefore, I provide no migration path, although I could invent one, but it would take longer to describe.

However, since I'm responding here, and have read what you have posted here, it seems like the following could be true.

Assumptions from your emails:

A) Tahoe wants to provide a UTF-8 file name system
B) Tahoe wants to interface to POSIX systems that use (and do not validate) byte interfaces. C) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with no validation. D) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with validation.

Uncertainties: I'm not clear on what your goals are for Tahoe filenames. There seem to be 2 possibilities:

1) you want to reject attempts to use non-validating Unicode, be it from a 16-bit interface, or a bytes interface. 2) you don't want to reject non-validating Unicode, but you want to convert it to valid Unicode for (D) systems.

3) Orthogonally, you might want to store only Valid Unicode in the names, or you might not care, if you can meet the other goals.

Truisms:

If you want to support (D), and (2), then you must transform names at some point, using some scheme, because not all names supplied by (B) systems will be acceptable to (D) systems. You can choose to do this transformation when a (B) system provides an invalid (per Unicode) name, or you can choose to do the transformation when a (D) system accesses a file with an invalid (per Unicode) name.

If the (B) and (D) systems talk to each other outside of Tahoe, they will have to do similar transformations, or, if they both access the same Tahoe system, they will have to do the identical transformation, to be sure that they can access the same file.

All transcoding schemes have the possibility of data puns between non-transcoded names and transcoded names. In order to successfully and properly manipulate a name, you must know whether or not it has been transcoded, and how.

PEP 383 limits its transcoding to names that are invalid (per Unicode). Names that cannot be properly decoded to Unicode are decoded to invalid Unicode. Names that are invalid Unicode are encoded to invalid byte sequences (per the encoding scheme specified).

For PEP 383 and Python, transcoded names can be distinguished by checking for the existence of lone surrogates in the str form of the filename, or by attempting to do a strict decoding of the bytes form of the filename, depending on what you have (generally, the former).

For PEP 383 and Python, the names will round trip from the POSIX bytes interfaces to the program, and back to POSIX bytes interfaces, as long as only Python wrappers of system functions are used, and the filesystem encoding is not changed between calls (or is restored). Passing them to 3rd party libraries or other systems requires extra work, if there is a desire to manipulate files with names that are not decodeable to Unicode by the standard decoding algorithm for that encoding.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to