On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson:
I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).

Close. You at least resolved what you thought my issue was. And, you did make me more comfortable with the idea that I, in programs I write, would not be adversely affected by the PEP if implemented. While I can see that the PEP no doubt solves the os.listdir / open problem on POSIX systems for Python 3 + PEP programs that don't use 3rd party libraries, it does require programs that do use 3rd party libraries to be recoded with your functions -- which so far the PEP hasn't embraced. Or, to use the bytes APIs directly to get file names for 3rd party libraries -- but the directly ported, filenames-as-strings type of applications that could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked to do something different than they did before.


On 27Apr2009 23:52, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson:
[...]
There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
    This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
    This is what open(filename,..) does to turn the filename into bytes
    for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
    This is what you must do to a de novo string to turn it into a
    string suitable for use by open.
    Importantly, for most strings not hand crafted to have weird
    sequences in them, it is a no-op. But it will recode your puns
    for survival.
[...]
So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns.
See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.
_Lacking_ such a function, your punning concern is valid.
Seems like one would also desire os.pathdecode to do the reverse.

Yes.

And also versions that take or produce bytes from funny-encoded strings.

Isn't that the first two functions above?

Yes, sorry.

Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists.

I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
"undecodable" bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.

Right. And I don't intend to generate ill-formed Unicode strings, in my programs. But I might well read their names from other sources.

It is nice, and thank you for emphasizing (although I already did realize it, back there in the far reaches of the brain) that all the data puns are between ill-formed Unicode strings, and undecodable bytes strings. That is a nice property of the PEP's encoding/decoding method. I'm not sure it outweighs the disadvantage of taking unreadable gibberish, and producing indecipherable gibberish (codepoints with no glyphs), though, when there are ways to produce decipherable gibberish instead... or at least mostly-decipherable gibberish. Another idea forms.... described below.

If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates?

Because that would _not_ be a no-op for well formed Unicode strings.

That reason is sufficient for me.

I consider the fact that well-formed Unicode -> funny-encoded is a no-op
to be an enormous feature of Martin's scheme.

Unless I'm missing something, there _are_no_puns_ between funny-encoded
strings and well formed unicode strings.

I think you are correct regarding where the puns are. I agree that not perturbing well-formed Unicode is a benefit.


I suppose if your program carefully constructs a unicode string riddled
with half-surrogates etc and imagines something specific should happen
to them on the way to being POSIX bytes then you might have a problem...
Right.  Or someone else's program does that.

I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a
coffee, reading section 3.9 (Unicode Encoding Forms).

I now do not believe your scenario makes sense.

Someone can construct a Python3 string containing code points that
includes surrogates. Granted.

However such a string is not meaningful because it is not well-formed
(D85).  It's ill-formed (D84). It is not sane to expect it to
translate into a POSIX byte sequence, be it UTF-8 or anything else,
unless it is accompanied by some kind of explicit mapping provided by
the programmer.  Absent that mapping, it's nonsense in much the same
way that a non-decodable UTF-8 byte sequence is nonsense.

For example, Martin's funny-encoding is such an explicit mapping.

Such a string can be meaningful if it is used as a file name... it is the name of the file. I will agree that it would not be a word in any language, because it is composed of things that are not characters / codepoints, if that is what you meant.


I only want to use Unicode file names. But if those other file names exist, I want to be able to access them, and not accidentally get a different file.

But those other names _don't_ exist.

They do if someone constructs them.

Also, by avoiding reuse of legitimate characters in the encoding we can
avoid your issue with losing track of where a string came from;
legitimate characters are currently untouched by Martin's scheme, except
for the normal "bytes<->string via the user's locale" translation that
must already happen, and there you're aided by byets and strings being
different types.
There are abnormal characters, but there are no illegal characters.
I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By "legitimate" above I meant things like
half-surrogates which, like quarks, should not occur alone?
"Illegal" just means violating the accepted rules.

I think that either we've lost track of what each other is saying,
or you're wrong here. And my poor terminology hasn't been helping.

What we've got:

  (1) Byte sequence files names in the POSIX file system.
      It doesn't matter whether the underlying storage is a real POSIX
      filesystem or mostly POSIX one like MacOSX HFS or a remotely
      attached non-POSIX filesystem like a Windows one, because we're
      talking through the POSIX API, and it is handing us byte
      sequences, which will expect may contain anything except a NUL.

  (2) Under Martin's scheme, os.listdir() et al hand us (and accept)
      funny-encoded Python3 strings, which are strings of Unicode code
      units (D77).
      Particularly, if there were bytes in the POSIX byte string that
      did not decode into Unicode scalar values (D76) then each such
      byte is encoded as a surrogate (D71,72,73,74).

      it is important to note here that because surrogates are _not_
      Unicode scalar values, the is no punning between the two sets
      of values.

  (3) Other Python3 strings that have not been through Martin's mangler
      in either direction. Ordinary strings.

Your concern is that, handed a string, a programmer could misuse (3) as
(2) or vice versa because of punning.

In a well-formed unicode string there are no surrogates; surrogates only
occur in UTF-16 _encodings_ of Unicode strings (D75).

Therefore, it _is_ possible to inspect a string, if one cared, to see if
it is funny-encoded or "raw". One may get two different answers:

  - If there are surrogate code units then it must be funny-encoded
    and will therefore work perfectly if handed to a os.* interface.

  - If there are no surrogate code units the it may be funny encoded or it
    may not have been through Martin's funny-encoder, you can't tell.
    However, this doesn't matter because the encoder is a no-op for such
    strings.
    Therefore it will work perfectly if handed to an os.* interface.

The only gap in this is a specially crated string containing surrogate
code points that did not come via Martin's encoder. But such a string
cannot come from a user interface, which will accept only characters
and there only include unicode scalar values.

Such a string can only be explicitly constructed (eg with a \uD802
code point). And if something constructs such a string, it must have in
mind an explicit interpretation of those code points, which means it is
the _constructor_ on whom the burden of translation lies.

Does this make sesne to you, or have you a counter example in mind?

Lots of configuration systems permit schemes like C's \x to be used to create strings. Whether you perceive that to be a user interface or not, or believe that such things should be part of a user interface or not, they exist. Whether they validate that such strings are properly constructed Unicode text or should or should not do such validation, is open for discussion, but I'd be surprised if there are not some such schemes that don't do such checking, and consider it a feature. Why make the file name longer than necessary, when you can just use all these nice illegal codepoints to keep it shorter instead? Instead of 5 characters for a filename sequence counter, someone might stuff it in 1 character, in binary, and think they were clever. I've seen such techniques, although not specifically in Python, since I'm fairly new to reading Python code.

So I consider it not beyond the realm of possibility to encounter lone surrogate code units in strings that haven't been through Martin's funny-encoder. Hence, I disbelieve that the gap you mention can be ignored.


In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. Hence, while all of the systems under discussion can handle all Unicode characters in one way or another, none of them require that all Unicode rules are followed. Yes, you are correct that lone surrogates are illegal in Unicode. No, none of the accepted rules for these systems require Unicode.

However, Martin's scheme explicitly translates these ill-formed
sequences into Python3 strings and back, losslessly. You can have
surrogates in the filesystem storage/API on Windows. You can have
non-UTF-8-decodable sequences in the POSIX filesystem layer too.
They're all taken in and handled.

It is still not clear whether the PEP (1) would be implemented on Windows (2) if it is, if it prevents lone surrogates from being obtained from the str APIs, by transcoding them into 3 lone surrogates, and if doesn't transcode from the str APIs, but does funny-decode from the bytes APIs, then it would seem there is still the possibility of data puns on Windows.

In Python3 space, one might have a bytes object with a raw POSIX
byte filename in it. Presumably one can also have a byte string with a
raw (UTF-16) WIndows filename in it. They're not strings, so no
confusion.

But there's no _string_ for these things without a matching
string<->bytestring mapping associated with it.

If you have a Python3 string which is well-formed Unicode, then you can
hand it to the os.* interfaces and the Right Thing will happen (on
Windows just because it stored Unicode and on POSIX provided you agree
that your locale/getfilesystemencoding() is the right thing).

If you have a string that isn't well-formed, then the meaning of any
code points which are not Unicode scalar values is not well defined
without some auxiliary stuff in the app.

NTFS permits any 16-bit "character" code, including abnormal ones, including half-surrogates, and including full surrogate sequences that decode to PUA characters. POSIX permits all byte sequences, including things that look like UTF-8, things that don't look like UTF-8, things that look like half-surrogates, and things that look like full surrogate sequences that decode to PUA characters.

See above. I think this is addressed.

Without transcoding on the str APIs, which I haven't seen mentioned, I don't think so.

[...]
These are existing file objects, I'll take them as source 1. They get
encoded for release by os.listdir() et al.
And yes, strings can be  generated from scratch.
I take this to be source 2.
One variation of source 2 is reading output from other programs, such as ls (POSIX) or dir (Windows).

Sure. But that is reading byte sequences, and one must again know the
encoding. If that is known and the input decoded happily into Unicode
scalar values, then there is no issue. If the input didn't decode, then
one must make some decision about what the non-decodable bits mean.

Sure. So the PEP needs your functions, or the equivalent. Last I checked, they weren't there.


I think I agree with all the discussion that followed, and think the
real problem is lack of utlities functions to funny-encode source 2
strings for use. hence the proposal above.
I think we understand each other now. I think your proposal could work, Cameron, although when recoding applications to use your proposal, I'd find it easier to use the "file name object" that others have proposed. I think that because either your proposal or the object proposals require recoding the application, that they will not be accepted. I think that because the PEP 383 allows data puns, that it should not be accepted in its present form.

I'm of the option now that the puns can only occur when the source 2
string has surrogates, and either those surrogates are chosen to match
the funny-encoding, in which case the pun is not a pun, or the
surrogates are chosen according to a different scheme in which case
source 2 is obliged to provide a mapping.

A source 2 string of only Unicode scalar values doesn't need remapping.

A correct translation of source 2 strings would be obliged to call one of your functions, that doesn't exist in the PEP, because it appears the PEP wants to assume that such strings don't exist, unless it creates them. So this takes porting effort for programs generating and consuming such strings, to avoid being mangled by the PEP. That isn't necessary today, only post-PEP.

I think your if your proposal is accepted, that it then becomes possible to use an encoding that uses visible characters, which makes it easier for people to understand and verify. An encoding such as the one I suggested, but perhaps using a more obscure character, if there is one, but yet doesn't violate true Unicode.

I think any scheme that uses any Unicode scalar value as an escape
character _inherently_ introduces puns, and puns that are easier to
encounter.

I think the real strength of Martin's scheme is exactly that bytes strings
that needed the funny-encoding _do_ produce ill-formed Unicode strings,
because such strings _cannot_ conflict with well-formed strings.

I think your desire for a human readable encoding is valid, but it should
be a further purely "presentation" step, somewhat like quoted-printable
encoding in MIME, and not the scheme used by Martin.

Another step? Even more porting effort? For a PEP that is trying to avoid porting effort?

But maybe there is a compromise that mostly meets both goals: use U+DC10 as a (high-flying) escape character. It is not printable, so the substitution glyph will likely get displayed by display functions. Then transcode illegal bytes to the range U+0100 to U+01FF, and transcode existing U+DC10 to U+DC10 U+DC10. 1) This is an easy to understand scheme, and illegal byte values would become displayable, but would each be preceded by the substitution glyph for the U+DC10. 2) There would be no need to transcode other lone surrogates... on the other hand, any illegal code values could be treated as illegal bytes and transcoded, making the strings more nearly legal, and more uniformly displayable.

3) The property that all potential data puns are among ill-formed Unicode strings is still retained.

4) Because the result string is nearly legal Unicode (except for the escape characters U+DC10), it becomes uniformly comparable and different strings can be visibly different.

5) It is still necessary to transcode names from str interfaces, to escape any U+DC10 characters, at least, which is also required by this PEP to avoid data puns on systems that have both str and bytes interfaces.


I think it should transform all data, from str and bytes interfaces, and produce only str values containing conforming Unicode, escaping all the non-conforming sequences in some manner. This would make the strings truly readable, as long as fonts for all the characters are available.

But I think it would just move the punning. A human readable string with
readable escapes in it may be funny-encoded. _Or_ it may be "raw", with
funny-encoded yet to happen; after all only might weirdly be dealing
with a filename which contained post-funny-encode visible sequences in
it.

SO you're right back to _guessing_ what you're looking at.

WIth the surrogate scheme you only have to guess if there are surrogates,
but then you _know_ that you're dealing with a special encoding scheme;
it is certain - the guess is about which scheme.

I think you mean you don't have to guess if there are lone surrogates... you can look and see.

If you're working in a domain with no ill-formed strings you never need
to worry at all.

With a visible/printable-encoding such as you advocate the guess is about
whether the scheme have even been used, which is why I think it is worse.

So the above scheme, using a U+DC10 escape character, meets your desirable truisms about lone surrogates being the trigger for knowing that you are dealing with bizarro names, but being uncertain about which kind, and also makes the results lots more readable.

I still think there is a need to provide the encoding and decoding functions, for both bytes and de novo strings.


And I had already suggested the utility functions you are suggesting, actually, in my first tirade against PEP 383 (search for "The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme").

I must have missed that sentence. But it sounds like we want the same
facilities at least.

The solution that was proposed in the lead up to releasing Python 3.0 was to offer both bytes and str interfaces (so we have those), and then for those that want to have a single portable implementation that can access all data, an object that encapsulates the differences, and the variant system APIs. (file system is one, command line is another, environment is another, I'm not sure if there are more.) I haven't heard if any progress on such an encapsulating object has been made; the people that proposed such have been rather quiet about this PEP. I would expect that an object implementation would provide display strings, and APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them.

I think covering these other cases is quite messy, if only because
there's not even agreement amonst existing command line apps about all
that stuff.

Regarding "APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them" I think such a
facility for de novo strings must require the caller to provide a
handler/mapper for the not-well-formed parts of such strings if they
occur.

The caller shouldn't have to supply anything. The same encoding that is applied to str system interfaces that supply strings should be applied to de novo strings. It is just a matter of transcoding a de novo string into the "right form" that it can then be encoded by the system encoder to produce the original string again, if it goes to a str interface, or to an equivalent bytes string, if it goes to a bytes interface.

Programs that want to use str interfaces on POSIX will see a subset of files on systems that contain files whose bytes filenames are not decodable.

Not under Martin's scheme, because all bytes filenames _are_ decoded.

I think I was speaking of the status quo, here, not with the PEP.

If a sysadmin wants to standardize on UTF-8 names universally, they can use something like convmv to clean up existing file names that don't conform. Programs that use str interfaces on POSIX system will work fine, but with a subset of the files. When that is unacceptable, they can either be recoded to use the bytes interfaces, or the hopefully forthcoming object encapsulation. The issue then will be what technique will be used to transform bytes into display names, but since the display names would never be fed back to the objects directly (but the object would have an interface to accept de novo str and de novo bytes) then it is just a display issue, and one that uses visible characters would seem more useful in my mind, than one that uses half-surrogates or PUAs.

I agree it might be handy to have a display function, but isn't repr()
exactly that, now I think of it?

repr is a display function that produces rather ugly results in most non-ASCII cases. But then again, one could use repr as the funny-encoding scheme, too... I don't think we want to use repr for either case, actually. Of course, with Py 3, if the file names were objects, and could have reprlib customizations... :) :)

--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to