On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson:
On 27Apr2009 18:15, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.
Why is it necessary that you are able to make this distinction?
It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not.
I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)
So you agree they can't... that there are data puns. (OK, you may not have thought that through)

I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
    This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
    This is what open(filename,..) does to turn the filename into bytes
    for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
    This is what you must do to a de novo string to turn it into a
    string suitable for use by open.
    Importantly, for most strings not hand crafted to have weird
    sequences in them, it is a no-op. But it will recode your puns
    for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former "use the user's current locale" behaviour.
(We have locale "C" for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file.
Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns.

See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.

Seems like one would also desire os.pathdecode to do the reverse. And also versions that take or produce bytes from funny-encoded strings.

Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work.

But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists.

If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates?


So when open is handed the string, should it open the file with the name that matches the string, or the file with the name that funny-decodes to the same string? It can't know, unless it knows that the string is a funny-decoded string or not.

True. open() should always expect a funny-encoded name.

So it is already the case that strings get decoded to bytes by
calls like open(). Martin isn't changing that.
I thought the process of converting strings to bytes is called encoding. You seem to be calling it decoding?

My head must be standing in the wrong place. Yes, I probably mean
encoding here. I'm trying to accompany these terms with little pictures
like "string->bytes" to avoid confusion.

I suppose if your program carefully constructs a unicode string riddled
with half-surrogates etc and imagines something specific should happen
to them on the way to being POSIX bytes then you might have a problem...
Right. Or someone else's program does that. I only want to use Unicode file names. But if those other file names exist, I want to be able to access them, and not accidentally get a different file.

Point taken. And I think addressed by the utility function proposed
above.

[...snip normal versus odd chars for the funny-encoding ...]
Also, by avoiding reuse of legitimate characters in the encoding we can
avoid your issue with losing track of where a string came from;
legitimate characters are currently untouched by Martin's scheme, except
for the normal "bytes<->string via the user's locale" translation that
must already happen, and there you're aided by byets and strings being
different types.
There are abnormal characters, but there are no illegal characters.

I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By "legitimate" above I meant things like
half-surrogates which, like quarks, should not occur alone?

"Illegal" just means violating the accepted rules. In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. Hence, while all of the systems under discussion can handle all Unicode characters in one way or another, none of them require that all Unicode rules are followed. Yes, you are correct that lone surrogates are illegal in Unicode. No, none of the accepted rules for these systems require Unicode.


NTFS permits any 16-bit "character" code, including abnormal ones, including half-surrogates, and including full surrogate sequences that decode to PUA characters. POSIX permits all byte sequences, including things that look like UTF-8, things that don't look like UTF-8, things that look like half-surrogates, and things that look like full surrogate sequences that decode to PUA characters.

Sure. I'm not really talking about what filesystem will accept at
the native layer, I was talking in the python funny-encoded space.

[..."escaping is necessary"... I agree...]
I'm certainly not experienced enough in Python development processes or internals to attempt such, as yet. But somewhere in 25 years of programming, I picked up the knowledge that if you want to have a 1-to-1 reversible mapping, you have to avoid data puns, mappings of two different data values into a single data value. Your PEP, as first written, didn't seem to do that... since there are two interfaces from which to obtain data values, one performing a mapping from bytes to "funny invalid" Unicode, and the other performing no mapping, but accepting any sort of Unicode, possibly including "funny invalid" Unicode, the possibility of data puns seems to exist. I may be misunderstanding something about the use cases that prevent these two sources of "funny invalid" Unicode from ever coexisting, but if so, perhaps you could point it out, or clarify the PEP.
Please elucidate the "second source" of strings. I'm presuming you mean
strings egenrated from scratch rather than obtained by something like
listdir().
POSIX has byte APIs for strings, that's one source, that is most under discussion. Windows has both bytes and 16-bit APIs for strings... the 16-bit APIs are generally mapped directly to UTF-16, but are not checked for UTF-16 validity, so all of Martin's funny-decoded files could be used for Windows file names on the 16-bit APIs.

These are existing file objects, I'll take them as source 1. They get
encoded for release by os.listdir() et al.

And yes, strings can be generated from scratch.

I take this to be source 2.

One variation of source 2 is reading output from other programs, such as ls (POSIX) or dir (Windows).

I think I agree with all the discussion that followed, and think the
real problem is lack of utlities functions to funny-encode source 2
strings for use. hence the proposal above.

I think we understand each other now. I think your proposal could work, Cameron, although when recoding applications to use your proposal, I'd find it easier to use the "file name object" that others have proposed. I think that because either your proposal or the object proposals require recoding the application, that they will not be accepted. I think that because the PEP 383 allows data puns, that it should not be accepted in its present form.

I think your if your proposal is accepted, that it then becomes possible to use an encoding that uses visible characters, which makes it easier for people to understand and verify. An encoding such as the one I suggested, but perhaps using a more obscure character, if there is one, but yet doesn't violate true Unicode. I think it should transform all data, from str and bytes interfaces, and produce only str values containing conforming Unicode, escaping all the non-conforming sequences in some manner. This would make the strings truly readable, as long as fonts for all the characters are available. And I had already suggested the utility functions you are suggesting, actually, in my first tirade against PEP 383 (search for "The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme"). I really don't care if you or who gets the credit for the idea, others may have suggested it before me, but I do care that the solution should provide functionality that works without ambiguity/data puns.

The solution that was proposed in the lead up to releasing Python 3.0 was to offer both bytes and str interfaces (so we have those), and then for those that want to have a single portable implementation that can access all data, an object that encapsulates the differences, and the variant system APIs. (file system is one, command line is another, environment is another, I'm not sure if there are more.) I haven't heard if any progress on such an encapsulating object has been made; the people that proposed such have been rather quiet about this PEP. I would expect that an object implementation would provide display strings, and APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them.

Programs that want to use str interfaces on POSIX will see a subset of files on systems that contain files whose bytes filenames are not decodable. If a sysadmin wants to standardize on UTF-8 names universally, they can use something like convmv to clean up existing file names that don't conform. Programs that use str interfaces on POSIX system will work fine, but with a subset of the files. When that is unacceptable, they can either be recoded to use the bytes interfaces, or the hopefully forthcoming object encapsulation. The issue then will be what technique will be used to transform bytes into display names, but since the display names would never be fed back to the objects directly (but the object would have an interface to accept de novo str and de novo bytes) then it is just a display issue, and one that uses visible characters would seem more useful in my mind, than one that uses half-surrogates or PUAs.

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to