Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman Mon, 27 Apr 2009 23:53:35 -0700

On approximately 4/27/2009 7:11 PM, came the following characters fromthe keyboard of Cameron Simpson:

On 27Apr2009 18:15, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.

Why is it necessary that you are able to make this distinction?

It is necessary that programs (not me) can make the distinction, sothat it knows whether or not to do the funny-encoding or not.

I would say this isn't so. It's important that programs know if they're
dealing with strings-for-filenames, but not that they be able to figure
that out "a priori" if handed a bare string (especially since they
can't:-)

So you agree they can't... that there are data puns. (OK, you may nothave thought that through)


I agree you can't examine a string and know if it came from the os.* munging
or from someone else's munging.

I totally disagree that this is a problem.

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
    This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
    This is what open(filename,..) does to turn the filename into bytes
    for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
    This is what you must do to a de novo string to turn it into a
    string suitable for use by open.
    Importantly, for most strings not hand crafted to have weird
    sequences in them, it is a no-op. But it will recode your puns
    for survival.

and for me, I would like to see:

  os.setfilesystemencoding(coding)

Currently os.getfilesystemencoding() returns you the encoding based on
the current locale, and (I trust) the os.* stuff encodes on that basis.
setfilesystemencoding() would override that, unless coding==None in what
case it reverts to the former "use the user's current locale" behaviour.
(We have locale "C" for what one might otherwise expect None to mean:-)

The idea here is to let to program control the codec used for filenames
for special purposes, without working indirectly through the locale.

If a name is funny-decoded when the name is accessed by a directorylisting, it needs to be funny-encoded in order to open the file.
Hmm. I had thought that legitimate unicode strings already get transcoded
to bytes via the mapping specified by sys.getfilesystemencoding()
(the user's locale). That already happens I believe, and Martin's
scheme doesn't change this. He's just funny-encoding non-decodable byte
sequences, not the decoded stuff that surrounds them.
So assume a non-decodable sequence in a name. That puts us intoMartin's funny-decode scheme. His funny-decode scheme produces a barestring, indistinguishable from a bare string that would be produced by astr API that happens to contain that same sequence. Data puns.


See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.

_Lacking_ such a function, your punning concern is valid.

Seems like one would also desire os.pathdecode to do the reverse. Andalso versions that take or produce bytes from funny-encoded strings.

Then, if programs were re-coded to perform these transformations on whatyou call de novo strings, then the scheme would work.

But I think a large part of the incentive for the PEP is to try toinvent a scheme that intentionally allows for the puns, so that programsdo not need to be recoded in this manner, and yet still work. I don'tthink such a scheme exists.

If there is going to be a required transformation from de novo stringsto funny-encoded strings, then why not make one that people can actuallysee and compare and decode from the displayable form, by usingdisplayable characters instead of lone surrogates?

So when open is handed the string, should it open the file with the namethat matches the string, or the file with the name that funny-decodes tothe same string? It can't know, unless it knows that the string is afunny-decoded string or not.
True. open() should always expect a funny-encoded name.
So it is already the case that strings get decoded to bytes by
calls like open(). Martin isn't changing that.
I thought the process of converting strings to bytes is called encoding.You seem to be calling it decoding?
My head must be standing in the wrong place. Yes, I probably mean
encoding here. I'm trying to accompany these terms with little pictures
like "string->bytes" to avoid confusion.
I suppose if your program carefully constructs a unicode string riddled
with half-surrogates etc and imagines something specific should happen
to them on the way to being POSIX bytes then you might have a problem...
Right. Or someone else's program does that. I only want to use Unicodefile names. But if those other file names exist, I want to be able toaccess them, and not accidentally get a different file.
Point taken. And I think addressed by the utility function proposed
above.

[...snip normal versus odd chars for the funny-encoding ...]
Also, by avoiding reuse of legitimate characters in the encoding we can
avoid your issue with losing track of where a string came from;
legitimate characters are currently untouched by Martin's scheme, except
for the normal "bytes<->string via the user's locale" translation that
must already happen, and there you're aided by byets and strings being
different types.
There are abnormal characters, but there are no illegal characters.
I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By "legitimate" above I meant things like
half-surrogates which, like quarks, should not occur alone?

"Illegal" just means violating the accepted rules. In this case, theaccepted rules are those enforced by the file system (at the bytes orstr API levels), and by Python (for the str manipulations). None ofthose rules outlaw lone surrogates. Hence, while all of the systemsunder discussion can handle all Unicode characters in one way oranother, none of them require that all Unicode rules are followed. Yes,you are correct that lone surrogates are illegal in Unicode. No, noneof the accepted rules for these systems require Unicode.

NTFS permits any 16-bit "character" code, including abnormal ones,including half-surrogates, and including full surrogate sequences thatdecode to PUA characters. POSIX permits all byte sequences, includingthings that look like UTF-8, things that don't look like UTF-8, thingsthat look like half-surrogates, and things that look like full surrogatesequences that decode to PUA characters.
Sure. I'm not really talking about what filesystem will accept at
the native layer, I was talking in the python funny-encoded space.

[..."escaping is necessary"... I agree...]
I'm certainly not experienced enough in Python development processesor internals to attempt such, as yet. But somewhere in 25 years ofprogramming, I picked up the knowledge that if you want to have a1-to-1 reversible mapping, you have to avoid data puns, mappings oftwo different data values into a single data value. Your PEP, asfirst written, didn't seem to do that... since there are twointerfaces from which to obtain data values, one performing amapping from bytes to "funny invalid" Unicode, and the otherperforming no mapping, but accepting any sort of Unicode, possiblyincluding "funny invalid" Unicode, the possibility of data punsseems to exist. I may be misunderstanding something about the usecases that prevent these two sources of "funny invalid" Unicode fromever coexisting, but if so, perhaps you could point it out, orclarify the PEP.
Please elucidate the "second source" of strings. I'm presuming you mean
strings egenrated from scratch rather than obtained by something like
listdir().
POSIX has byte APIs for strings, that's one source, that is most underdiscussion. Windows has both bytes and 16-bit APIs for strings... the16-bit APIs are generally mapped directly to UTF-16, but are not checkedfor UTF-16 validity, so all of Martin's funny-decoded files could beused for Windows file names on the 16-bit APIs.
These are existing file objects, I'll take them as source 1. They get
encoded for release by os.listdir() et al.
And yes, strings can begenerated from scratch.
I take this to be source 2.

One variation of source 2 is reading output from other programs, such asls (POSIX) or dir (Windows).

I think I agree with all the discussion that followed, and think the
real problem is lack of utlities functions to funny-encode source 2
strings for use. hence the proposal above.

I think we understand each other now. I think your proposal could work,Cameron, although when recoding applications to use your proposal, I'dfind it easier to use the "file name object" that others have proposed.I think that because either your proposal or the object proposalsrequire recoding the application, that they will not be accepted. Ithink that because the PEP 383 allows data puns, that it should not beaccepted in its present form.

I think your if your proposal is accepted, that it then becomes possibleto use an encoding that uses visible characters, which makes it easierfor people to understand and verify. An encoding such as the one Isuggested, but perhaps using a more obscure character, if there is one,but yet doesn't violate true Unicode. I think it should transform alldata, from str and bytes interfaces, and produce only str valuescontaining conforming Unicode, escaping all the non-conforming sequencesin some manner. This would make the strings truly readable, as long asfonts for all the characters are available. And I had already suggestedthe utility functions you are suggesting, actually, in my first tiradeagainst PEP 383 (search for "The encode and decode functions should beavailable for coders to use, that code to externalinterfaces, either OS or 3rd party packages, that do not use thisencoding scheme"). I really don't care if you or who gets the creditfor the idea, others may have suggested it before me, but I do care thatthe solution should provide functionality that works withoutambiguity/data puns.

The solution that was proposed in the lead up to releasing Python 3.0was to offer both bytes and str interfaces (so we have those), and thenfor those that want to have a single portable implementation that canaccess all data, an object that encapsulates the differences, and thevariant system APIs. (file system is one, command line is another,environment is another, I'm not sure if there are more.) I haven'theard if any progress on such an encapsulating object has been made; thepeople that proposed such have been rather quiet about this PEP. Iwould expect that an object implementation would provide displaystrings, and APIs to submit de novo str and bytes values to an object,which would run the appropriate encoding on them.

Programs that want to use str interfaces on POSIX will see a subset offiles on systems that contain files whose bytes filenames are notdecodable. If a sysadmin wants to standardize on UTF-8 namesuniversally, they can use something like convmv to clean up existingfile names that don't conform. Programs that use str interfaces onPOSIX system will work fine, but with a subset of the files. When thatis unacceptable, they can either be recoded to use the bytes interfaces,or the hopefully forthcoming object encapsulation. The issue then willbe what technique will be used to transform bytes into display names,but since the display names would never be fed back to the objectsdirectly (but the object would have an interface to accept de novo strand de novo bytes) then it is just a display issue, and one that usesvisible characters would seem more useful in my mind, than one that useshalf-surrogates or PUAs.


--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Reply via email to