Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:52 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is *no* ambiguity in the case you construct. No Martin, the point of reviewing the PEP is to _not_ trust you, even though you are generally very knowledgeable and very trustworthy. It is much easier to find problems before something is released, or even coded, than it is afterwards. By "accessed via the str interface", I assume you do something like fn = "some string" open(fn) You are wrong in assuming "no decoding happens", and that "matches in memory the file on disk" (whatever that means - how do I match a file on disk in memory??). What happens instead is that fn gets *encoded* with the file system encoding, and the python-escape handler. This will *not* produce an ambiguity. You assumed, and maybe I wasn't clear in my statement. By "accessed via the str interface" I mean that (on Windows) the wide string interface would be used to obtain a file name. Now, suppose that the file name returned contains "abc" followed by the half-surrogate U+DC10 -- four 16-bit codes. Then, ask for the same filename via the bytes interface, using UTF-8 encoding. The PEP says that the above name would get translated to "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes used to represent the half-surrogate that is actually in the file name, specifically U+DCED U+DCB0 U+DC90. This means that one name on disk can be seen as two different names in memory. Now posit another file which, when accessed via the str interface, has the name "abc" followed by U+DCED U+DCB0 U+DC90. Looks ambiguous to me. Now if you have a scheme for handling this case, fine, but I don't understand it from what is written in the PEP. If you think there is an ambiguity in that you can use both the byte interface and the string interface to access the same file: this would be a ridiculous interpretation. *Of course* you can access /etc/passwd both as "/etc/passwd" and b"/etc/passwd", there is nothing ambiguous about that. Yes, this would be a ridiculous interpretation of "ambiguous". -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Wed, Apr 29, 2009 at 07:45, "Martin v. Löwis" wrote: > Your claim was > that PEP 383 may have unfortunate effects on Windows, No, I simply think that PEP 383 is not sufficiently specified to be able to tell. > and I'm telling > you that it won't, because the behavior of Python on Windows won't > change at all. A justification for your proposal is that there are differences between Python on UNIX and Windows that you would like to reduce. But depending on where you introduce utf-8b coding on UNIX, you may also have to introduce it on Windows in order to keep the platforms consistent. So whatever the problem - it's there already, and the > PEP is not going to change it. OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere on Windows by default. That's not clear from your proposal. It's also not clear from your proposal where utf-8b will get used on UNIX systems. Some of the places that have been suggested are: open, os.listdir, sys.argv, os.getenv. There are other potential ones, like print, write, and os.system. And what about text file and string conversions: will utf-8b become the default, or optional, or unavailable? Each of those choices potentially has significant implications. I'm just asking what those choices are so that one can then talk about the implications and see whether this proposal is a good one or whether other alternatives are better. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces
> I would like utility functions to perform: > os-bytes->funny-encoded > funny-encoded->os-bytes > or explicit example code snippets for same in the PEP text. Done! Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> I'm more concerned with your (yours? someone else's?) mention of shift > characters. I'm unfamiliar with these encodings: to translate such a > thing into a Latin example, is it the case that there are schemes with > valid encodings that look like: > > [SHIFT] a b c > > which would produce "ABC" in unicode, which is ambiguous with: > > A B C > > which would also produce "ABC"? No: the "shift" in "shift-jis" is not really about the shift key. See http://en.wikipedia.org/wiki/Shift-JIS Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>> The Python UTF-8 codec will happily encode half-surrogates; people argue >> that it is a bug that it does so, however, it would help in this >> specific case. > > Can we use this encoding scheme for writing into files as well? We've > turned the filename with undecodable bytes into a string with half > surrogates. Putting that string into a file has to turn them into bytes > at some level. Can we use the python-escape error handler to achieve > that somehow? Sure: if you are aware that what you write to the stream is actually a file name, you should encode it with the file system encoding, and the python-escape handler. However, it's questionable that the same approach is right for the rest of the data that goes into the file. If you use a different encoding on the stream, yet still use the python-escape handler, you may end up with completely non-sensical bytes. In practice, it probably won't be that bad - python-escape has likely escaped all non-ASCII bytes, so that on re-encoding with a different encoding, only the ASCII characters get encoded, which likely will work fine. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
>>> C. File on disk with the invalid surrogate code, accessed via the str >>> interface, no decoding happens, matches in memory the file on disk with >>> the byte that translates to the same surrogate, accessed via the bytes >>> interface. Ambiguity. >> >> Is that an alternative to A and B? > > I guess it is an adjunct to case B, the current PEP. > > It is what happens when using the PEP on a system that provides both > bytes and str interfaces, and both get used. Your formulation is a bit too stenographic to me, but please trust me that there is *no* ambiguity in the case you construct. By "accessed via the str interface", I assume you do something like fn = "some string" open(fn) You are wrong in assuming "no decoding happens", and that "matches in memory the file on disk" (whatever that means - how do I match a file on disk in memory??). What happens instead is that fn gets *encoded* with the file system encoding, and the python-escape handler. This will *not* produce an ambiguity. If you think there is an ambiguity in that you can use both the byte interface and the string interface to access the same file: this would be a ridiculous interpretation. *Of course* you can access /etc/passwd both as "/etc/passwd" and b"/etc/passwd", there is nothing ambiguous about that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> The wide APIs use UTF-16. UTF-16 suffers from the same problem as > UTF-8: not all sequences of words are valid UTF-16 sequences. In > particular, sequences containing isolated surrogate pairs are not > well-formed according to the Unicode standard. Therefore, the existence > of a wide character API function does not guarantee that the wide > character strings it returns can be converted into valid unicode > strings. And, in fact, Windows Vista happily creates files with > malformed UTF-16 encodings, and os.listdir() happily returns them. Whatever. What does that have to do with PEP 383? Your claim was that PEP 383 may have unfortunate effects on Windows, and I'm telling you that it won't, because the behavior of Python on Windows won't change at all. So whatever the problem - it's there already, and the PEP is not going to change it. I personally don't see a problem here - *of course* os.listdir will report invalid utf-16 encodings, if that's what is stored on disk. It doesn't matter whether the file names are valid wrt. some specification. What matters is that you can access all the files. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> > It cannot crash Python; it can only crash > hypothetical third-party programs or libraries with deficient error > checking and > unreasonable assumptions about input data. The error checking isn't necessarily deficient. For example, a safe and legitimate thing to do is for third party libraries to throw a C++ exception, raise a Python exception, or delete the half surrogate. Any of those would break one of the use cases people have been talking about, namely being able to present the output from os.listdir() to the user, say in a file selector, and then access that file. (and, of course, you haven't even proven those programs or libraries exist) > PEP 383 is a proposal that suggests changing Python such that malformed unicode strings become a required part of Python and such that Pyhon writes illegal UTF-8 encodings to UTF-8 encoded file systems. Those are big changes, and it's legitimate to ask that PEP 383 address the implications of that choice before it's made. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson: I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). Close. You at least resolved what you thought my issue was. And, you did make me more comfortable with the idea that I, in programs I write, would not be adversely affected by the PEP if implemented. While I can see that the PEP no doubt solves the os.listdir / open problem on POSIX systems for Python 3 + PEP programs that don't use 3rd party libraries, it does require programs that do use 3rd party libraries to be recoded with your functions -- which so far the PEP hasn't embraced. Or, to use the bytes APIs directly to get file names for 3rd party libraries -- but the directly ported, filenames-as-strings type of applications that could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked to do something different than they did before. On 27Apr2009 23:52, Glenn Linderman wrote: On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: [...] There may be puns. So what? Use the right strings for the right purpose and all will be well. I think what is missing here, and missing from Martin's PEP, is some utility functions for the os.* namespace. PROPOSAL: add to the PEP the following functions: os.fsdecode(bytes) -> funny-encoded Unicode This is what os.listdir() does to produce the strings it hands out. os.fsencode(funny-string) -> bytes This is what open(filename,..) does to turn the filename into bytes for the POSIX open. os.pathencode(your-string) -> funny-encoded-Unicode This is what you must do to a de novo string to turn it into a string suitable for use by open. Importantly, for most strings not hand crafted to have weird sequences in them, it is a no-op. But it will recode your puns for survival. [...] So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns. See my proposal above. Does it address your concerns? A program still must know the providence of the string, and _if_ you're working with non-decodable sequences in a names then you should transmute then into the funny encoding using the os.pathencode() function described above. In this way the punning issue can be avoided. _Lacking_ such a function, your punning concern is valid. Seems like one would also desire os.pathdecode to do the reverse. Yes. And also versions that take or produce bytes from funny-encoded strings. Isn't that the first two functions above? Yes, sorry. Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists. I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged. Right. And I don't intend to generate ill-formed Unicode strings, in my programs. But I might well read their names from other sources. It is nice, and thank you for emphasizing (although I already did realize it, back there in the far reaches of the brain) that all the data puns are between ill-formed Unicode strings, and undecodable bytes strings. That is a nice property of the PEP's encoding/decoding method. I'm not sure it outweighs the disadvantage of taking unreadable gibberish, and producing indecipherable gibberish (codepoints with no glyphs), though, when there are ways to produce decipherable gibberish instead... or at least mostly-decipherable gibberish. Another idea forms described below. If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. You are missing that the bytes value would get decoded to a str; thus both are str; so ambiguity is possible. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 28Apr2009 13:37, Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >>UTF-8 (the PEP actually never proposed this to happen). >>This introduces an ambiguity: two different files in the same >>directory may decode to the same string name, if one has the PUA >>character, and the other has a non-decodable byte that gets decoded >>to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >>The same ambiguity does *NOT* exist. If a file on disk already >>contains an invalid surrogate code in its file name, then the UTF-8b >>decoder will recognize this as invalid, and decode it byte-for-byte, >>into three surrogate codes. Hence, the file names that are different >>on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is this a Windows example, or (now I think on it) an equivalent POSIX example of using the PEP where the locale encoding is UTF-16? In either case, I would say one could make an argument for being stricter in reading in OS-native sequences. Grant that NTFS doesn't prevent half-surrogates in filenames, and likewise that POSIX won't because to the OS they're just bytes. On decoding, require well-formed data. When you hit ill-formed data, treat the nasty half surrogate as a PAIR of bytes to be escaped in the resulting decode. Ambiguity avoided. I'm more concerned with your (yours? someone else's?) mention of shift characters. I'm unfamiliar with these encodings: to translate such a thing into a Latin example, is it the case that there are schemes with valid encodings that look like: [SHIFT] a b c which would produce "ABC" in unicode, which is ambiguous with: A B C which would also produce "ABC"? Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ Helicopters are considerably more expensive [than fixed wing aircraft], which is only right because they don't actually fly, but just beat the air into submission.- Paul Tomblin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Proposed: a new function-based C API for declaring Python types
EXECUTIVE SUMMARY I've written a patch against py3k trunk creating a new function-based API for creating extension types in C. This allows PyTypeObject to become a (mostly) private structure. THE PROBLEM Here's how you create an extension type using the current API. * First, find some code that already has a working type declaration. Copy and paste their fifty-line PyTypeObject declaration, then hack it up until it looks like what you need. * Next--hey! There *is* no next, you're done. You can immediately create an object using your type and pass it into the Python interpreter and it would work fine. You are encouraged to call PyType_Ready(), but this isn't required and it's often skipped. This approach causes two problems. 1) The Python interpreter *must support* and *cannot change* the PyTypeObject structure, forever. Any meaningful change to the structure will break every extension. This has many consequences: a) Fields that are no longer used must be left in place, forever, as ignored placeholders if need be. Py3k cleaned up a lot of these, but it's already picked up a new one ("tp_compare" is now "tp_reserved"). b) Internal implementation details of the type system must be public. c) The interpreter can't even use a different structure internally, because extensions are free to pass in objects using PyTypeObjects the interpreter has never seen before. 2) As a programming interface this lacks a certain gentility. It clearly *works*, but it requires programmers to copy and paste with a large structure mostly containing NULLs, which they must pick carefully through to change just a few fields. THE SOLUTION My patch creates a new function-based extension type definition API. You create a type by calling PyType_New(), then call various accessor functions on the type (PyType_SetString and the like), and when your type has been completely populated you must call PyType_Activate() to enable it for use. With this API available, extension authors no longer need to directly see the innards of the PyTypeObject structure. Well, most of the fields anyway. There are a few shortcut macros in CPython that need to continue working for performance reasons, so the "tp_flags" and "tp_dealloc" fields need to remain publically visible. One feature worth mentioning is that the API is type-safe. Many such APIs would have had one generic "PyType_SetPointer", taking an identifier for the field and a void * for its value, but this would have lost type safety. Another approach would have been to have one accessor per field ("PyType_SetAddFunction"), but this would have exploded the number of functions in the API. My API splits the difference: each distinct *type* has its own set of accessors ("PyType_GetSSizeT") which takes an identifier specifying which field you wish to get or set. SIDE-EFFECTS OF THE API The major change resulting from this API: all PyTypeObjects must now be *pointers* rather than static instances. For example, the external declaration of PyType_Type itself changes from this: PyAPI_DATA(PyTypeObject) PyType_Type; to this: PyAPI_DATA(PyTypeObject *) PyType_Type; This gives rise to the first headache caused by the API: type casts on type objects. It took me a day and a half to realize that this, from Modules/_weakref.c: PyModule_AddObject(m, "ref", (PyObject *) &_PyWeakref_RefType); really needed to be this: PyModule_AddObject(m, "ref", (PyObject *) _PyWeakref_RefType); Hopefully I've already found most of these in CPython itself, but this sort of code surely lurks in extensions yet to be touched. (Pro-tip: if you're working with this patch, and you see a crash, and gdb shows you something like this at the top of the stack: #0 0x081056d8 in visit_decref (op=0x8247aa0, data=0x0) at Modules/gcmodule.c:323 323 if (PyObject_IS_GC(op)) { your problem is an errant &, likely on a type object you're passing in to the interpreter. Think--what did you touch recently? Or debug it by salting your code with calls to collect(NUM_GENERATIONS-1).) Another irksome side-effect of the API: because of "tp_flags" and "tp_dealloc", I now have two declarations of PyTypeObject. There's the externally-visible one in Include/object.h, which lets external parties see "tp_dealloc" and "tp_flags". Then there's the internal one in Objects/typeprivate.h which is the real structure. Since declaring a type twice is a no-no, the external one is gated on #ifndef PY_TYPEPRIVATE If you're a normal Python extension programmer, you'd include Python.h as normal: #include "Python.h" Python implementation files that need to see the real PyTypeObject structure now look like this: #define PY_TYPEPRIVATE #include "Python.h" #include "../Objects/typeprivate.h" Also, since the structure of
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: >> Since the serialization of the Unicode string is likely to use UTF-8, >> and the string for such a file will include half surrogates, the >> application may raise an exception when encoding the names for a >> configuration file. These encoding exceptions will be as rare as the >> unusual names (which the careful I18N aware developer has probably >> eradicated from his system), and thus will appear late. > > There are trade-offs to any solution; if there was a solution without > trade-offs, it would be implemented already. > > The Python UTF-8 codec will happily encode half-surrogates; people argue > that it is a bug that it does so, however, it would help in this > specific case. Can we use this encoding scheme for writing into files as well? We've turned the filename with undecodable bytes into a string with half surrogates. Putting that string into a file has to turn them into bytes at some level. Can we use the python-escape error handler to achieve that somehow? -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 28Apr2009 14:37, Thomas Breuel wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't require any changes. No it doesn't. It does transcode without throwing exceptions. On POSIX. (On Windows? I doubt it - windows isn't using an 8-bit scheme. I believe.) But it utter destorys any hope of working in any other locale nicely. The PEP lets you work losslessly in other locales. It _may_ require some app care for particular very weird strings that don't come from the filesystem, but as far as I can see only in circumstances where such care would be needed anyway i.e. you've got to do special stuff for weirdness in the first place. Weird == "ill-formed unicode string" here. Cheers, -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ I just kept it wide-open thinking it would correct itself. Then I ran out of talent. - C. Fittipaldi ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Unless I'm missing something, one of these is type str, and the other is type bytes, so no ambiguity. --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces
On 28Apr2009 11:49, Antoine Pitrou wrote: | Paul Moore gmail.com> writes: | > | > I've yet to hear anyone claim that they would have an actual problem | > with a specific piece of code they have written. | | Yep, that's the problem. Lots of theoretical problems noone has ever encountered | brought up against a PEP which resolves some actual problems people encounter on | a regular basis. | | For the record, I'm +1 on the PEP being accepted and implemented as soon as | possible (preferably before 3.1). I am also +1 on this. I would like utility functions to perform: os-bytes->funny-encoded funny-encoded->os-bytes or explicit example code snippets for same in the PEP text. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ This person is currently undergoing electric shock therapy at Agnews Developmental Center in San Jose, California. All his opinions are static, please ignore him. Thank you, Nurse Ratched - the sig quote of Bob "Another beer, please" Christ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Zooko O'Whielacronx wrote: > On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: >> If you switch to iso8859-15 only in the presence of undecodable UTF-8, >> then you have the same round-trip problem as the PEP: both b'\xff' and >> b'\xc3\xbf' will be converted to u'\u00ff' without a way to >> unambiguously recover the original file name. > > Why do you say that? It seems to work as I expected here: > '\xff'.decode('iso-8859-15') > u'\xff' '\xc3\xbf'.decode('iso-8859-15') > u'\xc3\xbf' '\xff'.decode('cp1252') > u'\xff' '\xc3\xbf'.decode('cp1252') > u'\xc3\xbf' > You're not showing that this is a fallback path. What won't work is first trying a local encoding (in the following example, utf-8) and then if that doesn't work, trying a one-byte encoding like iso8859-15: try: file1 = '\xff'.decode('utf-8') except UnicodeDecodeError: file1 = '\xff'.decode('iso-8859-15') print repr(file1) try: file2 = '\xc3\xbf'.decode('utf-8') except UnicodeDecodeError: file2 = '\xc3\xbf'.decode('iso-8859-15') print repr(file2) That prints: u'\xff' u'\xff' The two encodings can map different bytes to the same unicode code point so you can't do this type of thing without recording what encoding was used in the translation. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB: Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s). OK, I won't be offended if you don't call them 'escape codes'. :) But what else to call them? My use of that term is a bit backwards, perhaps... what happens is that because these 256 half surrogates are used to decode otherwise undecodable bytes, they themselves must be "escaped" or translated into something different, when they appear in the byte sequence. The process described reserves a set of codepoints for use, and requires that that same set of codepoints be translated using a similar mechanism to avoid their untranslated appearance in the resulting str. Escape codes have the same sort of characteristic... by replacing their normal use for some other use, they must themselves have a replacement. Anyway, I think we are communicating successfully. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". Yes, your rephrasing is clearer, regarding your intention. For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00. Decoding with UTF-8b might never produce U+DC00, but then again, it won't handle the random byte string, either. I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indisting
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman wrote: > On approximately 4/27/2009 7:11 PM, came the following characters from > the keyboard of Cameron Simpson: [...] >> There may be puns. So what? Use the right strings for the right purpose >> and all will be well. >> >> I think what is missing here, and missing from Martin's PEP, is some >> utility functions for the os.* namespace. >> >> PROPOSAL: add to the PEP the following functions: >> >> os.fsdecode(bytes) -> funny-encoded Unicode >> This is what os.listdir() does to produce the strings it hands out. >> os.fsencode(funny-string) -> bytes >> This is what open(filename,..) does to turn the filename into bytes >> for the POSIX open. >> os.pathencode(your-string) -> funny-encoded-Unicode >> This is what you must do to a de novo string to turn it into a >> string suitable for use by open. >> Importantly, for most strings not hand crafted to have weird >> sequences in them, it is a no-op. But it will recode your puns >> for survival. [...] >>> So assume a non-decodable sequence in a name. That puts us into >>> Martin's funny-decode scheme. His funny-decode scheme produces a >>> bare string, indistinguishable from a bare string that would be >>> produced by a str API that happens to contain that same sequence. >>> Data puns. >>> >> >> See my proposal above. Does it address your concerns? A program still >> must know the providence of the string, and _if_ you're working with >> non-decodable sequences in a names then you should transmute then into >> the funny encoding using the os.pathencode() function described above. >> >> In this way the punning issue can be avoided. >> _Lacking_ such a function, your punning concern is valid. > > Seems like one would also desire os.pathdecode to do the reverse. Yes. > And > also versions that take or produce bytes from funny-encoded strings. Isn't that the first two functions above? > Then, if programs were re-coded to perform these transformations on what > you call de novo strings, then the scheme would work. > But I think a large part of the incentive for the PEP is to try to > invent a scheme that intentionally allows for the puns, so that programs > do not need to be recoded in this manner, and yet still work. I don't > think such a scheme exists. I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged. > If there is going to be a required transformation from de novo strings > to funny-encoded strings, then why not make one that people can actually > see and compare and decode from the displayable form, by using > displayable characters instead of lone surrogates? Because that would _not_ be a no-op for well formed Unicode strings. That reason is sufficient for me. I consider the fact that well-formed Unicode -> funny-encoded is a no-op to be an enormous feature of Martin's scheme. Unless I'm missing something, there _are_no_puns_ between funny-encoded strings and well formed unicode strings. I suppose if your program carefully constructs a unicode string riddled with half-surrogates etc and imagines something specific should happen to them on the way to being POSIX bytes then you might have a problem... >>> Right. Or someone else's program does that. I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a coffee, reading section 3.9 (Unicode Encoding Forms). I now do not believe your scenario makes sense. Someone can construct a Python3 string containing code points that includes surrogates. Granted. However such a string is not meaningful because it is not well-formed (D85). It's ill-formed (D84). It is not sane to expect it to translate into a POSIX byte sequence, be it UTF-8 or anything else, unless it is accompanied by some kind of explicit mapping provided by the programmer. Absent that mapping, it's nonsense in much the same way that a non-decodable UTF-8 byte sequence is nonsense. For example, Martin's funny-encoding is such an explicit mapping. >>>I only want to use >>> Unicode file names. But if those other file names exist, I want to >>> be able to access them, and not accidentally get a different file. But those other
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 2:02 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an adjunct to case B, the current PEP. It is what happens when using the PEP on a system that provides both bytes and str interfaces, and both get used. On a Windows system, perhaps the ambiguous case would be the use of the str API and bytes APIs producing different memory names for the same file that contains a (Unicode-illegal) half surrogate. The half-surrogate would seem to get decoded to 3 half surrogates if accessed via the bytes interface, but only one via the str interface. The version with 3 half surrogates could match another name that actually contains 3 half surrogates, that is accessed via the str interface. I can't actually tell by reading the PEP whether it affects Windows bytes interfaces or is only implemented on POSIX, so that POSIX has a str interface. If it is only implemented on POSIX, then the current scheme (now escaping the hundreds of escape codes) could work, within a single platform... but it would still suffer from displaying garbage (sequences of replacement characters) in file listings displayed or printed. There is no way, once the string is adjusted to contain replacement characters for display, to distinguish one file name from another, if they are identical except for a same-length sequence of different undecodable bytes. The concept of a function that allows the same decoding and encoding process for 3rd party interfaces is still missing from the PEP; implementation of the PEP would require that all interfaces to 3rd party software that accept file names would have to be transcoded by the interface layer. Or else such software would have to use the bytes interfaces directly, and if they do, there is no need for the PEP. So I see the PEP as a partial solution to a limited problem, that on the one hand potentially produces indistinguishable sequences of replacement characters in filenames, rather than the mojibake (which is at least distinguishable), and on the other hand, doesn't help software that also uses 3rd party libraries to avoid the use of bytes APIs for accessing file names. There are other encodings that produce more distinguishable mojibake, and would work in the same situations as the PEP. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Thomas Breuel gmail.com> writes: > > And, in fact, Windows Vista happily creates files with malformed UTF-16 encodings, and os.listdir() happily returns them. The PEP won't change that, so what's the problem exactly? > Under your proposal, passing the output from a correctly implemented file system or other OS function to a correctly written library using unicode strings may crash Python. That's a very dishonest formulation. It cannot crash Python; it can only crash hypothetical third-party programs or libraries with deficient error checking and unreasonable assumptions about input data. (and, of course, you haven't even proven those programs or libraries exist) Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> > On Windows, the Wide APIs are already used throughout the code base, > e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the > specific API for a specific functionality, please read the source code. > [...] > No, I don't assume that. I assume that all functions are strictly > available in a Wide character version, and have verified that they are. The wide APIs use UTF-16. UTF-16 suffers from the same problem as UTF-8: not all sequences of words are valid UTF-16 sequences. In particular, sequences containing isolated surrogate pairs are not well-formed according to the Unicode standard. Therefore, the existence of a wide character API function does not guarantee that the wide character strings it returns can be converted into valid unicode strings. And, in fact, Windows Vista happily creates files with malformed UTF-16 encodings, and os.listdir() happily returns them. > If you can crash Python that way, > nothing gets worse by this PEP - you can then *already* crash Python > in that way. Yes, but AFAIK, Python does not currently have functions that, as part of correct usage and normal operation, are intended to generate malformed unicode strings. Under your proposal, passing the output from a correctly implemented file system or other OS function to a correctly written library using unicode strings may crash Python. In order to avoid that, every library that's built into Python would have to be checked and updated to deal with both the Unicode standard and your extension to it. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >>UTF-8 (the PEP actually never proposed this to happen). >>This introduces an ambiguity: two different files in the same >>directory may decode to the same string name, if one has the PUA >>character, and the other has a non-decodable byte that gets decoded >>to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >>The same ambiguity does *NOT* exist. If a file on disk already >>contains an invalid surrogate code in its file name, then the UTF-8b >>decoder will recognize this as invalid, and decode it byte-for-byte, >>into three surrogate codes. Hence, the file names that are different >>on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is that an alternative to A and B? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s). 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00. I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. Perhaps the escape character should be U+005C. ;-) It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Others have made this suggestion, and it is helpful to the PEP, but not > sufficient. As implemented as an error handler, I'm not sure that the > b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 > decoder is happy with it. Which, in my testing, it is. Rest assured that the utf-8b codec will work the way it is specified. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico: 2009/4/28 Glenn Linderman : The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Wrong. An 8859-1 locale allows any byte sequence to placed into a POSIX filename. And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. This makes 256 different escape codes. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. This provides escaping for the 256 different escape codes, which is lacking from the PEP. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. This reverses the escaping. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. This is confusing. Did you mean "excluding" instead of "including"? I think I've covered all the possibilities. :-) You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+ to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> The UTF-8b representation suffers from the same potential ambiguities as > the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen). This introduces an ambiguity: two different files in the same directory may decode to the same string name, if one has the PUA character, and the other has a non-decodable byte that gets decoded to the same PUA character. B. use UTF-8b, representing the byte will ill-formed surrogate codes. The same ambiguity does *NOT* exist. If a file on disk already contains an invalid surrogate code in its file name, then the UTF-8b decoder will recognize this as invalid, and decode it byte-for-byte, into three surrogate codes. Hence, the file names that are different on disk are also different in memory. No ambiguity. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
MRAB wrote: > Martin v. Löwis wrote: >>> Furthermore, I don't believe that PEP 383 works consistently on Windows, >> >> What makes you say that? PEP 383 will have no effect on Windows, >> compared to the status quo, whatsoever. >> > You could argue that if Windows is actually returning UTF-16 with half > surrogates that they should be altered to conform to what UTF-8 would > have returned. Perhaps - but this is not what the PEP specifies (and intentionally so). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> Your proposal says that utf-8b would be used for file systems, but then > you also say that it might be used for command line arguments and > environment variables. So, which specific APIs will it be used with on > Windows and on POSIX systems? On Windows, the Wide APIs are already used throughout the code base, e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the specific API for a specific functionality, please read the source code. > Or will utf-8b simply not be available > on Windows at all? It will be available, but it won't be used automatically for anything. > What happens if I create a Python version of tar, > utf-8b strings slip in there, and I try to use them on Windows? No need to create it - the tarfile module is already there. By "in there", do you mean on the file system, or in the tarfile? > You also assume that all Windows file system functions strictly conform > to UTF-16 in practice (not just on paper). Have you verified that? No, I don't assume that. I assume that all functions are strictly available in a Wide character version, and have verified that they are. > What's the situation on Windows CE? I can't see how this question is relevant to the PEP. The PEP says this: # On Windows, Python uses the wide character APIs to access # character-oriented APIs, allowing direct conversion of the # environmental data to Python str objects. This is what it already does, and this is what it will continue to do. > Another question on Linux: what happens when I decode a file system path > with utf-8b and then pass the resulting unicode string to Gnome? To > Qt? You probably get moji-bake, or an error, I didn't try. > To windows.forms? To Java? How do you do that, on Linux? > To a unicode regular expression library? You mean, SRE? SRE will match the code points as individual characters, class Cs. You should have been able to find out that for yourself. > To wprintf? Depends on the wprintf implementation. > AFAIK, the behavior of most libraries is > undefined for the kinds of unicode strings you construct, and it may be > undefined in a bad way (crash, buffer overflow, whatever). Indeed so. This is intentional. If you can crash Python that way, nothing gets worse by this PEP - you can then *already* crash Python in that way. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Apr 28, 2009, at 13:01 PM, Thomas Breuel wrote: (2) Should the default UTF-8 encoder for file system operations be allowed to generate illegal byte sequences? I think that's a definite no; if I set the encoding for a device to UTF-8, I never want Python to try to write illegal UTF-8 strings to my device. ... If people really want the option of (3c), then I think encoders related to the file system should by default reject those strings as illegal because the potential problems from writing them are just too serious. Printing routines and UI routines could display them without error (but some clear indication), of course. For what it is worth, sometimes we have to write bytes to a POSIX filesystem even though those bytes are not the encoding of any string in the filesystem's "alleged encoding". The reason is that it is common for there to be filenames which are not the encodings of anything in the filesystem's alleged encoding, and the user expects my tool (Tahoe-LAFS [1]) to copy that name to a distributed storage grid and then copy it back unchanged. Even though, I re-iterate, that name is *not* a valid encoding of anything in the current encoding. This doesn't argue that this behavior has to be the *default* behavior, but it is sometimes necessary. It's too bad that POSIX is so far behind Mac OS X in this respect. (Also so far behind Windows, but I use Mac as the example to show how it is possible to build a better system on top of POSIX.) Hopefully David Wheeler's proposals to tighten the requirements in Linux filesystems will catch on: [2]. Regards, Zooko [1] http://allmydata.org [2] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 20:45, "Martin v. Löwis" wrote: > > Furthermore, I don't believe that PEP 383 works consistently on Windows, > > What makes you say that? PEP 383 will have no effect on Windows, > compared to the status quo, whatsoever. > That's what you believe, but it's not clear to me that that follows from your proposal. Your proposal says that utf-8b would be used for file systems, but then you also say that it might be used for command line arguments and environment variables. So, which specific APIs will it be used with on Windows and on POSIX systems? Or will utf-8b simply not be available on Windows at all? What happens if I create a Python version of tar, utf-8b strings slip in there, and I try to use them on Windows? You also assume that all Windows file system functions strictly conform to UTF-16 in practice (not just on paper). Have you verified that? It certainly isn't true across all versions of Windows (since NT originally used UCS-2). What's the situation on Windows CE? Another question on Linux: what happens when I decode a file system path with utf-8b and then pass the resulting unicode string to Gnome? To Qt? To windows.forms? To Java? To a unicode regular expression library? To wprintf? AFAIK, the behavior of most libraries is undefined for the kinds of unicode strings you construct, and it may be undefined in a bad way (crash, buffer overflow, whatever). Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. It would seem from the definition of ISO-2022 that what it calls "escape sequences" is in your POSIX spec called "locking-shift encoding". Therefore, the second bullet item under the "Character Encoding" heading prohibits use of ISO-2022, for whatever uses that document defines (which, since you referenced it, I assume means locales, and possibly file system encodings, but I'm not familiar with the structure of all the POSIX standards documents). A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the XCU specification or with any of the functions in the XSH specification that do not specifically mention the effects of state-dependent encoding is implementation-dependent. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non-decodable byte from 0x00 to 0x7F, it will be converted to U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), Why is that obvious? The only thing I saw that could exclude EBCDIC would be the requirement that the codes be positive in a char, but on a system where the C compiler treats char as unsigned, EBCDIC would qualify. Of course, the use of EBCDIC would also restrict the other possible code pages to those derived from EBCDIC (rather than the bulk of code pages that are derived from ASCII), due to: If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified. iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with
Re: [Python-Dev] PEP 383 (again)
Martin v. Löwis wrote: Furthermore, I don't believe that PEP 383 works consistently on Windows, What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever. You could argue that if Windows is actually returning UTF-16 with half surrogates that they should be altered to conform to what UTF-8 would have returned. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? For what it is worth, what we have previously planned to do for the Tahoe project is the second of these -- decode using some 1-byte encoding such as iso-8859-1, iso-8859-15, or windows-1252 only in the case that attempting to decode the bytes using the local alleged encoding failed. If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say that? It seems to work as I expected here: >>> '\xff'.decode('iso-8859-15') u'\xff' >>> '\xc3\xbf'.decode('iso-8859-15') u'\xc3\xbf' >>> >>> >>> >>> '\xff'.decode('cp1252') u'\xff' >>> '\xc3\xbf'.decode('cp1252') u'\xc3\xbf' Regards, Zooko ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] a suggestion ... Re: PEP 383 (again)
I think we should break up this problem into several parts: (1) Should the default UTF-8 decoder fail if it gets an illegal byte sequence. It's probably OK for the default decoder to be lenient in some way (see below). (2) Should the default UTF-8 encoder for file system operations be allowed to generate illegal byte sequences? I think that's a definite no; if I set the encoding for a device to UTF-8, I never want Python to try to write illegal UTF-8 strings to my device. (3) What kind of representation should the UTF-8 decoder return for illegal inputs? There are actually several choices: (a) it could guess what the actual encoding is and use that, (b) it could return a valid unicode string that indicates the illegal characters but does not re-encode to the original byte sequence, or (c) it could return some kind of non-standard representation that encodes back into the original byte sequence. PEP 383 violated (2), and I think that's a bad thing. I think the best solution would be to use (3a) and fall back to (3b) if that doesn't work. If people try to write those strings, they will always get written as correctly encoded UTF-8 strings. If people really want the option of (3c), then I think encoders related to the file system should by default reject those strings as illegal because the potential problems from writing them are just too serious. Printing routines and UI routines could display them without error (but some clear indication), of course. There is yet another option, which is arguably the "right" one: make the results of os.listdir() subclasses of string that keep track of where they came from. If you write back to the same device, it just writes the same byte sequence. But if you write to other devices and the byte sequence is illegal according to its encoding, you get an error. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight wrote: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non-decodable byte from 0x00 to 0x7F, it will be converted to U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non-decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. I think I've covered all the possibilities. :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/28/2009 10:00 AM, came the following characters from the keyboard of Martin v. Löwis: An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential ambiguities. So they clearly dislike one risk more than the other. UTF-8b is primarily meant as an in-memory representation. The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... perhaps slightly less likely in practice, due to the use of Unicode-illegal characters, but exactly the same theoretical likelihood in the space of Python-acceptable character codes. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> Furthermore, I don't believe that PEP 383 works consistently on Windows, What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> > However, it is "mission creep": Martin didn't volunteer to > write a PEP for it, he volunteered to write a PEP to solve the > "roundtrip the value of os.listdir()" problem. And he succeeded, up > to some minor details. Yes, it solves that problem. But that doesn't come without cost. Most importantly, now Python writes illegal UTF-8 strings even if the user chose a UTF-8 encoding. That means that illegal UTF-8 encodings can propagate anywhere, without warning. Furthermore, I don't believe that PEP 383 works consistently on Windows, and it causes programs to behave differently in unintuitive ways on Windows and Linux. I'll suggest an alternative in a separate message. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. It would be actually U+DC2f that would turn into /. Yes, I meant to say DC2F, sorry for the confusion. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- decodable byte from 0x00 to 0x7F, it will be converted to U+-U +007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non- decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> If the PEP depends on this being changed, it should be mentioned in the > PEP. The PEP says that the utf-8b codec decodes invalid bytes into low surrogates. I have now clarified that a strict definition of UTF-8 is assumed for utf-8b. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> Since the serialization of the Unicode string is likely to use UTF-8, > and the string for such a file will include half surrogates, the > application may raise an exception when encoding the names for a > configuration file. These encoding exceptions will be as rare as the > unusual names (which the careful I18N aware developer has probably > eradicated from his system), and thus will appear late. There are trade-offs to any solution; if there was a solution without trade-offs, it would be implemented already. The Python UTF-8 codec will happily encode half-surrogates; people argue that it is a bug that it does so, however, it would help in this specific case. An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential ambiguities. So they clearly dislike one risk more than the other. UTF-8b is primarily meant as an in-memory representation. > Or say de/serialization succeeds. Since the resulting Unicode string > differs depending on the encoding (which is a good thing; it is > supposed to make most cases mostly readable), when the filesystem > encoding changes (say from legacy to UTF-8), the "name" changes, and > deserialized references to it become stale. That problem has nothing to do with the PEP. If the encoding changes, LRU entries may get stale even if there were no encoding errors at all. Suppose the old encoding was Latin-1, and the new encoding is KOI8-R, then all file names are decodable before and afterwards, yet the string representation changes. Applications that want to protect themselves against that happening need to store byte representations of the file names, not character representations. Depending on the configuration file format, that may or may not be possible. I find the case pretty artificial, though: if the locale encoding changes, all file names will look incorrect to the user, so he'll quickly switch back, or rename all the files. As an application supporting a LRU list, I would remove/hide all entries that don't correlate to existing files - after all, the user may have as well deleted the file in the LRU list. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is > not a valid Unicode character (not a character at all, really) and the > only way you can put this in a POSIX filename is if you use a very > lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. > > Since this byte sequence doesn't represent a valid character when > decoded with UTF-8, it should simply be considered an invalid UTF-8 > sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* > '\udcff'). > > Martin: maybe the PEP should say this explicitly? Sure, will do. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> If we follow your approach, that ISO8859-15 string will get turned into > an escaped unicode string inside Python. If I understand your proposal > correctly, if it's a output file name and gets passed to Python's open > function, Python will then decode that string and end up with an > ISO8859-15 byte sequence, which it will write to disk literally, even if > the encoding for the system is UTF-8. That's the wrong thing to do. I don't think anything can, or should be, done about that. If you had byte-oriented interfaces (as you do in 2.x), exactly the same thing will happen: the name of the file will be the very same byte sequence as the one passed on the command line. Most Unix users here agree that this is the right thing to happen. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Thomas Breuel writes: > PEP 383 doesn't make it any easier; it just turns one set of > problems into another. That's false. There is an interesting class of problems of the form "get a list of names from the OS and allow the user to select from it, and retrieve corresponding content." People are *very* often able to decode complete gibberish, as long as it's the only gibberish in a list. Ditto partial gibberish. In that case, PEP 383 allows the content retrieval operation to complete. There are probably other problems that this PEP solves. > Actually, it makes it worse, Again, it gives you different problems, which may be better and may be worse according to the user's requirements. Currently, you often get an exception, and running the program again is no help. The user must clean up the list to make progress. This may or may not be within the user's capacity (eg, read-only media). > since any problems that show up now show up far from the source of > the problem, and since it can lead to security problems and/or data > loss. Yes. This is a point I have been at pains to argue elsewhere in this thread. However, it is "mission creep": Martin didn't volunteer to write a PEP for it, he volunteered to write a PEP to solve the "roundtrip the value of os.listdir()" problem. And he succeeded, up to some minor details. > The problem may well be with the program using the wrong encodings or > incorrectly ignoring encoding information. Furthermore, even if it is user > error, the program needs to validate its inputs and put up a meaningful > error message, not mangle the disk. To detect such program bugs, it's > important that when Python detects an incorrect encoding that it doesn't > quietly continue with an incorrect string. I agree. Guido, however, responded that "Practicality beats purity" to a similar point in the PEP 263 discussion. Be aware that you're fighting an uphill battle here. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Paul Moore writes: > But it seems to me that there is an assumption that problems will > arise when code gets a potentially funny-decoded string and doesn't > know where it came from. > > Is that a real concern? Yes, it's a real concern. I don't think it's possible to show a small piece of code one could point at and say "without a better API I bet you can't write this correctly," though. Rather, my experience with Emacs and various mail packages is that without type information it is impossible to keep track of the myriad bits and pieces of text that are recombining like pig flu, and eventually one breaks out and causes an error. It's usually easy to fix, but so are the next hundred similar regressions, and in the meantime a hundred users have suffered more or less damage or at least annoyance. There's no question that dealing with escapes of funny-decoded strings to uprepared code paths is mission creep compared to Martin's stated purpose for PEP 383, but it is also a real problem. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull wrote: > Nobody said we were at the stage of *saving* the [attachment]! But speaking of saving files, I think that's the biggest hole in this that has been nagging at the back of my mind. This PEP intends to allow easy access to filenames and other environment strings which are not restricted to known encodings. What happens if the detected encoding changes? There may be difficulties de/serializing these names, such as for an MRU list. Since the serialization of the Unicode string is likely to use UTF-8, and the string for such a file will include half surrogates, the application may raise an exception when encoding the names for a configuration file. These encoding exceptions will be as rare as the unusual names (which the careful I18N aware developer has probably eradicated from his system), and thus will appear late. Or say de/serialization succeeds. Since the resulting Unicode string differs depending on the encoding (which is a good thing; it is supposed to make most cases mostly readable), when the filesystem encoding changes (say from legacy to UTF-8), the "name" changes, and deserialized references to it become stale. This can probably be handled through careful use of the same encoding/decoding scheme, if relevant, but that sounds like we've just moved the problem from fs/environment access to serialization. Is that good enough? For other uses the API knew whether it was environmentally aware, but serialization probably will not. Should this PEP make recommendations about how to save filenames in configuration files? -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Hrvoje Niksic wrote: > Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8 > sequence, will be converted to the half-surrogate '\udcff'. However, > a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be > converted to '\udcff'. Those are quite different POSIX pathnames; how > will Python know which one it was when I later pass '\udcff' to > open()? > > > [1] > I'm assuming that it's valid UTF8 because it passes through Python > 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 > expert. I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was uploading a file to a Google Search Appliance and it was rejected as invalid UTF-8 despite having been encoded into UTF-8 by Python. The cause was a byte sequence which decoded to a half surrogate similar to your example above. Python will happily decode and encode such sequences, but as I found to my cost other systems reject them. Reading wikipedia implies that Python is wrong to accept these sequences and I think (though I'm not a lawyer) that RFC 3629 also implies this: "The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters." and "Implementations of the decoding algorithm above MUST protect against decoding invalid sequences." ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] One more proposed formatting change for 3.1
2009/4/28 Mark Dickinson : > Here's one more proposed change, this time for formatting > of floats using format() and the empty presentation type. > To avoid repeating myself, here's the text from the issue > I just opened: > > http://bugs.python.org/issue5864 > > """ > In all versions of Python from 2.6 up, I get the following behaviour: > format(123.456, '.4') > '123.5' format(1234.56, '.4') > '1235.0' format(12345.6, '.4') > '1.235e+04' > > The first and third results are as I expect, but the second is somewhat > misleading: it gives 5 significant digits when only 4 were requested, > and moreover the last digit is incorrect. > > I propose that Python 2.7 and Python 3.1 be changed so that the output > for the second line above is '1.235e+03'. > """ > > This issue seems fairly clear cut to me, and I doubt that there's been > enough uptake of 'format' yet for this to risk significant breakage. So > unless there are objections I'll plan to make this change before this > weekend's beta. +1 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Hrvoje Niksic : > Lino Mastrodomenico wrote: >> >> Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid >> character when >> decoded with UTF-8, it should simply be considered an invalid UTF-8 >> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* >> '\udcff'). > > "Should be considered" or "will be considered"? Python 3.0's UTF-8 decoder > happily accepts it and returns u'\udcff': > b'\xed\xb3\xbf'.decode('utf-8') > '\udcff' Only for the new utf-8b encoding (if Martin agrees), while the existing utf-8 is fine as is (or at least waaay outside the scope of this PEP). -- Lino Mastrodomenico ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] lone surrogates in utf-8
Hrvoje Niksic avl.com> writes: > > "Should be considered" or "will be considered"? Python 3.0's UTF-8 > decoder happily accepts it and returns u'\udcff': > > >>> b'\xed\xb3\xbf'.decode('utf-8') > '\udcff' Yes, there is already a bug entry for it: http://bugs.python.org/issue3672 I think we could happily fix it for 3.1 (perhaps leaving 2.7 unchanged for compatibility reasons - I don't know if some people may rely on the current behaviour). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Lino Mastrodomenico wrote: Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). "Should be considered" or "will be considered"? Python 3.0's UTF-8 decoder happily accepts it and returns u'\udcff': >>> b'\xed\xb3\xbf'.decode('utf-8') '\udcff' If the PEP depends on this being changed, it should be mentioned in the PEP. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces
Thomas Breuel gmail.com> writes: > > How can you bring up practical problems against something that hasn't been implemented? The PEP is simple enough that you can simulate its effect by manually computing the resulting unicode string for a hypothetical broken filename. Several people have already done so in this thread. > The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. According to some messages, it seems Java and Mono actually use this kind of workaround. Though I haven't checked (I don't use those languages). > But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works That doesn't work at all. With your proposal, any non-ASCII filename will be unreadable; not only the broken ones. Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Glenn Linderman : > The switch from PUA to half-surrogates does not resolve the issues with the > encoding not being a 1-to-1 mapping, though. The very fact that you think > you can get away with use of lone surrogates means that other people might, > accidentally or intentionally, also use lone surrogates for some other > purpose. Even in file names. It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' -- Lino Mastrodomenico ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, 28 Apr 2009 at 09:30, Thomas Breuel wrote: Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error. This is what happens currently, and users are quite unhappy about it. We need to keep "users" and "programmers" distinct here. Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life. And most programmers won't do it, because most programmers write for an English speaking audience and have no clue about unicode issues. That is probably slowly changing, but it is still true, I think. End users, however, are going to be quite unhappy if they get a string of gibberish for a file name because you decided to interpret some non-Unicode string as UTF-8-with-extra-bytes. No, end users expect the gibberish, because they get it all the time (at least on Unix) when dealing with international filenames. They expect to be able to manipulate such files _despite_ the gibberish. (I speak here as an end user who does this!!) Or some Python program might copy files from an ISO8859-15 encoded file system to a UTF-8 encoded file system, and instead of getting an error when the encodings are set incorrectly, Python would quietly create ISO8859-15 encoded file names, making the target file system inconsistent. As will almost all unix programs, and the unix OS itself. On Unix, you can't make the file system inconsistent by doing this, because filenames are just byte strings with no NULLs. How _does_ Windows handle this? Would a Windows program complain, or would it happily record the gibberish? I suspect the latter, but I don't use Windows so I don't know. There is a lot of potential for major problems for end users with your proposals. In both cases, what should happen is that the end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly. What would actually happen is that the user would abandon the program that didn't work for one (not written in Python) that did. If the programmer was lucky they'd get a bug report, which they wouldn't be able to do anything about since Python wouldn't be providing the tools to let them fix it (ie: there are currently no bytes interfaces for environ or the command line in python3). Yes, users can do that (to a degree), but they are still unhappy about it. The approach actually fails for command line arguments As it should: if I give an ISO8859-15 encoded command line argument to a Python program that expects a UTF-8 encoding, the Python program should tell me that there is something wrong when it notices that. Quietly continuing is the wrong thing to do. Imagine you are on a unix system, and you have gotten from somewhere a file whose name is encoded in something other than UTF-8 (I have a number of those on my system). Now imagine that I want to run a python program against that file, passing the name in on the command line. I type the program name, the first few (non-mangled) characters, and hit tab for completion, and my shell automagically puts the escaped bytes onto the command line. Or perhaps I cut and paste from an 'ls' listing into a quoted string on the command line. Python is now getting the mangled filename passed in on the command line, and if the python program can't manipulate that file like any other file on my disk I am going to be mightily pissed. This is the _reality_ of current unix systems, like it or not. The same apparently applies to Windows, though in that case the mangled names may be fewer and you tend to pick them from a GUI interface rather than do cut-and-paste or tab completion. If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do. Right. Like I said, that's what most (almost all) Unix/Linux programs _do_. Now, in some future world where everyone (including Windows) acts like we are hearing OS/X does and rejects the garbled encoding _at the OS level_, then we'd be able to trust the file system encoding (FSDO trust) and there would be no need for this PEP or any similar solution. --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? If you unconditionally set encoding to iso8859-15, then you are effectively reverting to treating file names as bytes, regardless of the locale. You're also angering a lot of European users who expect iso8859-2, etc. If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Lino Mastrodomenico wrote: Let's suppose that I use Python 2.x or something else to create a file with name b'\xff'. My (Linux) system has a sane configuration and the filesystem encoding is UTF-8, so it's an invalid name but the kernel will blindly accept it anyway. With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'. One question that really bothers me about this proposal is the following: Assume a UTF-8 locale. A file named b'\xff', being an invalid UTF-8 sequence, will be converted to the half-surrogate '\udcff'. However, a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be converted to '\udcff'. Those are quite different POSIX pathnames; how will Python know which one it was when I later pass '\udcff' to open()? A poster hinted at this question, but I haven't seen it answered, yet. [1] I'm assuming that it's valid UTF8 because it passes through Python 2.5's '\xed\xb3\xbf'.decode('utf-8'). I don't claim to be a UTF-8 expert. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
> > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a PEP which resolves some actual problems people > encounter on > a regular basis. How can you bring up practical problems against something that hasn't been implemented? The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system. * If you mount an NFS filesystem from a linux host and that directory contains a file named chr(255) - unix-level tools will see a file with the expected name (just like on linux) - Cocoa's NSFileManager returns u"?" as the filename, that is when the filename cannot be decoded using UTF-8 the name returned by the high- level API is mangled. This is regardless of the setting of LANG. - I haven't found a way yet to access files whose names are not valid UTF-8 using the high-level Cocoa API's. The latter two are interesting because Cocoa has a unicode filesystem API on top of a POSIX C-API, just like Python 3.x. I guess the choosen behaviour works out on OSX (where users are unlikely to run into this issue), but could be more problematic on other POSIX systems. Ronald On 28 Apr, 2009, at 14:03, Michael Foord wrote: Paul Moore wrote: 2009/4/28 Antoine Pitrou : Paul Moore gmail.com> writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Me 2 Michael Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com smime.p7s Description: S/MIME cryptographic signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
2009/4/28 Thomas Breuel : > If we follow PEP 383, you will get lots of errors anyway because those > strings, when encoded in utf-8b, will result in an error when you try to > write them on a Windows file system or any other system that doesn't allow > the byte sequences that the utf-8b encodes. I'm not sure if when you say "write them on a Windows FS" you mean from within Windows itself or a filesystem mounted on another OS, so I'll cover both cases. Let's suppose that I use Python 2.x or something else to create a file with name b'\xff'. My (Linux) system has a sane configuration and the filesystem encoding is UTF-8, so it's an invalid name but the kernel will blindly accept it anyway. With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'. Now if this string somehow ends up in a Python 3.1 program running on Windows and it tries to create a file with this name, it will work (no exception will be raised). The Windows GUI will display the standard "invalid character" symbol (an empty box) when listing this file, but this seems reasonable since the original file was displayed as "?" by the Linux console and with another invalid character symbol by the GNOME file manager. OTOH if I write the same file on a Windows filesystem mounted on another OS, there will be in place an automatic translation (probably done by the OS kernel) from the user-visible filesystem encoding (see e.g. the "iocharset" or "utf8" mount options for vfat on Linux) to UTF-16. Which means that the write will fail with something like: IOError: [Errno 22] invalid filename: b'/media/windows_disk/\xff' (The "problem" is that a vfat filesystem mounted with the "utf8" option on Linux will only accept byte sequences that are valid UTF-8, or at least reasonably similar: e.g. b'\xed\xb3\xbf' is accepted.) Again this seems reasonable since it already happens in Python 2 and with pretty much any other software, including GNU cp. I don't see how Martin can do better than this. Well, ok, I guess he could break into my house and rename the original file to something sane... -- Lino Mastrodomenico ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Can not run under python 2.6
OK, Thanks a lot. On Tue, Apr 28, 2009 at 8:06 PM, Michael Foord wrote: > Jianchun Zhou wrote: > >> Hi, there: >> >> I am new to python, and now I got a trouble: >> >> I have an application named canola, it is written under python 2.5, and >> can run normally under python 2.5 >> >> But when it comes under python 2.6, problem up, it says: >> >> Traceback (most recent call last): >> File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", >> line 151, in _load_plugins >>classes = plg.load() >> File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", >> line 94, in load >>mod = self._ldr.load() >> File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line >> 42, in load >>mod = __import__(modpath, fromlist=[mod_name]) >> ImportError: Import by filename is not supported. >> >> Any body any idea what should I do? >> > > The Python-Dev mailing list is for the development of Python and not with > Python. You will get a much better response asking on the comp.lang.python > (python-list) or python-tutor newsgroups / mailing lists. comp.lang.python > has both google groups and gmane gateways and so is easy to post to. > > For the particular problem you mention it is an intentional change and so > the code in canola will need to be modified in order to run under Python > 2.6. > > All the best, > > Michael Foord > > >> -- >> Best Regards >> >> >> ___ >> Python-Dev mailing list >> Python-Dev@python.org >> http://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk >> >> > > > -- > http://www.ironpythoninaction.com/ > > -- Best Regards ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Can not run under python 2.6
Jianchun Zhou wrote: Hi, there: I am new to python, and now I got a trouble: I have an application named canola, it is written under python 2.5, and can run normally under python 2.5 But when it comes under python 2.6, problem up, it says: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 151, in _load_plugins classes = plg.load() File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 94, in load mod = self._ldr.load() File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line 42, in load mod = __import__(modpath, fromlist=[mod_name]) ImportError: Import by filename is not supported. Any body any idea what should I do? The Python-Dev mailing list is for the development of Python and not with Python. You will get a much better response asking on the comp.lang.python (python-list) or python-tutor newsgroups / mailing lists. comp.lang.python has both google groups and gmane gateways and so is easy to post to. For the particular problem you mention it is an intentional change and so the code in canola will need to be modified in order to run under Python 2.6. All the best, Michael Foord -- Best Regards ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Paul Moore wrote: 2009/4/28 Antoine Pitrou : Paul Moore gmail.com> writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Me 2 Michael Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Antoine Pitrou : > Paul Moore gmail.com> writes: >> >> I've yet to hear anyone claim that they would have an actual problem >> with a specific piece of code they have written. > > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a PEP which resolves some actual problems people encounter > on > a regular basis. > > For the record, I'm +1 on the PEP being accepted and implemented as soon as > possible (preferably before 3.1). In case it's not clear, I am also +1 on the PEP as it stands. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] One more proposed formatting change for 3.1
Here's one more proposed change, this time for formatting of floats using format() and the empty presentation type. To avoid repeating myself, here's the text from the issue I just opened: http://bugs.python.org/issue5864 """ In all versions of Python from 2.6 up, I get the following behaviour: >>> format(123.456, '.4') '123.5' >>> format(1234.56, '.4') '1235.0' >>> format(12345.6, '.4') '1.235e+04' The first and third results are as I expect, but the second is somewhat misleading: it gives 5 significant digits when only 4 were requested, and moreover the last digit is incorrect. I propose that Python 2.7 and Python 3.1 be changed so that the output for the second line above is '1.235e+03'. """ This issue seems fairly clear cut to me, and I doubt that there's been enough uptake of 'format' yet for this to risk significant breakage. So unless there are objections I'll plan to make this change before this weekend's beta. Mark ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Can not run under python 2.6
Hi, there: I am new to python, and now I got a trouble: I have an application named canola, it is written under python 2.5, and can run normally under python 2.5 But when it comes under python 2.6, problem up, it says: Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 151, in _load_plugins classes = plg.load() File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 94, in load mod = self._ldr.load() File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line 42, in load mod = __import__(modpath, fromlist=[mod_name]) ImportError: Import by filename is not supported. Any body any idea what should I do? -- Best Regards ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces
Paul Moore gmail.com> writes: > > I've yet to hear anyone claim that they would have an actual problem > with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 11:32:26AM +0200, Thomas Breuel wrote: > On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann wrote: > > I have an FTP server to which clients with different local encodings > > are connecting. FTP protocol doesn't have a notion of encoding so filenames > > on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one > > directory! What should os.listdir() return for that directory? What is a > > correct encoding for that directory?! > > I don't know what it should do (ftplib needs to worry about that). There is no ftplib there. FTP server is ProFTPd, ftp clients of all sort, one, e.g., an ftp client built-in into an automatic web-camera. I use python programs to process files after they have been uploaded. The programs access FTP directory as a part of local filesystem. > I do know > what it shouldn't do, however: it sould not return a utf-8b string which, > when used to create a file, will create a file reproducing the byte sequence > of the remote machine; that's wrong. That certainly wrong. But at least the approach allows python programs to list all files in a directory - currently AFAIU os.listdir() silently skips undecodeable filenames. And after a program gets all files it can process it further - it can cleanup filenames (base64-encode them, e.g.), but at least it can do something, where currently it cannot. PS. It seems I started to argue for the PEP. Well, well... Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann wrote: > On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote: > > Returning an error for an incorrect encoding doesn't make > > internationalization harder, it makes it easier because it makes > debugging > > easier. > >What is a "correct encoding"? > > I have an FTP server to which clients with different local encodings > are connecting. FTP protocol doesn't have a notion of encoding so filenames > on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one > directory! What should os.listdir() return for that directory? What is a > correct encoding for that directory?! I don't know what it should do (ftplib needs to worry about that). I do know what it shouldn't do, however: it sould not return a utf-8b string which, when used to create a file, will create a file reproducing the byte sequence of the remote machine; that's wrong. If any program starts to raise errors Python becomes completely unusable > for me! But is there anything I can debug here? If we follow PEP 383, you will get lots of errors anyway because those strings, when encoded in utf-8b, will result in an error when you try to write them on a Windows file system or any other system that doesn't allow the byte sequences that the utf-8b encodes. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/28 Glenn Linderman : > So assume a non-decodable sequence in a name. That puts us into Martin's > funny-decode scheme. His funny-decode scheme produces a bare string, > indistinguishable from a bare string that would be produced by a str API > that happens to contain that same sequence. Data puns. > > So when open is handed the string, should it open the file with the name > that matches the string, or the file with the name that funny-decodes to the > same string? It can't know, unless it knows that the string is a > funny-decoded string or not. Sorry for picking on Glenn's comment - it's only one of many in this thread. But it seems to me that there is an assumption that problems will arise when code gets a potentially funny-decoded string and doesn't know where it came from. Is that a real concern? How many programs really don't know where their data came from? Maybe a general-purpose library routine *might* just need to document explicitly how it handles funny-encoded data (I can't actually imagine anything that would, but I'll concede it may be possible) but that's just a matter of documenting your assumptions - no better or worse than many other cases. This all sounds similar to the idea of "tainted" data in security - if you lose track of untrusted data from the environment, you expose yourself to potential security issues. So the same techniques should be relevant here (including ignoring it if your application isn't such that it's s concern!) I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. (NB, if such a claim has been made, feel free to point me to it - I admit I've been skimming this thread at times). Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote: > Returning an error for an incorrect encoding doesn't make > internationalization harder, it makes it easier because it makes debugging > easier. What is a "correct encoding"? I have an FTP server to which clients with different local encodings are connecting. FTP protocol doesn't have a notion of encoding so filenames on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one directory! What should os.listdir() return for that directory? What is a correct encoding for that directory?! If any program starts to raise errors Python becomes completely unusable for me! But is there anything I can debug here? Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> > >Until it's hard there will be no internationalization. A fact of life, > damn it. Programmers are lazy, and have many problems to solve. PEP 383 doesn't make it any easier; it just turns one set of problems into another. Actually, it makes it worse, since any problems that show up now show up far from the source of the problem, and since it can lead to security problems and/or data loss. >And the programmer answers "The program is expected a correct > environment, good filenames, etc." and closes the issue with the resolution > "User error, will not fix". The problem may well be with the program using the wrong encodings or incorrectly ignoring encoding information. Furthermore, even if it is user error, the program needs to validate its inputs and put up a meaningful error message, not mangle the disk. To detect such program bugs, it's important that when Python detects an incorrect encoding that it doesn't quietly continue with an incorrect string. Furthermore, if you don't provide clear error messages, it often takes a significant amount of time for each issue to determine that it is user error. > I am not arguing for or against the PEP in question. Python certainly > has to have a way to make portable i18n less hard or else the number of > portable internationalized program will be about zero. What the way should > be - I don't know. Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 09:30:01AM +0200, Thomas Breuel wrote: > Programmers may find it inconvenient that they have to spend time figuring > out and deal with platform-dependent file system encoding issues and > errors. But internationalization and unicode are hard, that's just a fact > of life. Until it's hard there will be no internationalization. A fact of life, damn it. Programmers are lazy, and have many problems to solve. > end user gets an > error, submits a bug, and the programmer figures out how to deal with the > encoding issues correctly. And the programmer answers "The program is expected a correct environment, good filenames, etc." and closes the issue with the resolution "User error, will not fix". I am not arguing for or against the PEP in question. Python certainly has to have a way to make portable i18n less hard or else the number of portable internationalized program will be about zero. What the way should be - I don't know. Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
> > Therefore, when Python encounters path names on a file system > > that are not consistent with the (assumed) encoding for that file > > system, Python should raise an error. > > This is what happens currently, and users are quite unhappy about it. We need to keep "users" and "programmers" distinct here. Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life. End users, however, are going to be quite unhappy if they get a string of gibberish for a file name because you decided to interpret some non-Unicode string as UTF-8-with-extra-bytes. Or some Python program might copy files from an ISO8859-15 encoded file system to a UTF-8 encoded file system, and instead of getting an error when the encodings are set incorrectly, Python would quietly create ISO8859-15 encoded file names, making the target file system inconsistent. There is a lot of potential for major problems for end users with your proposals. In both cases, what should happen is that the end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly. > Yes, users can do that (to a degree), but they are still unhappy about > it. The approach actually fails for command line arguments As it should: if I give an ISO8859-15 encoded command line argument to a Python program that expects a UTF-8 encoding, the Python program should tell me that there is something wrong when it notices that. Quietly continuing is the wrong thing to do. If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do. As is, these interfaces are incomplete - they don't support command > line arguments, or environment variables. If you want to complete them, > you should write a PEP. There's no point in scratching when there's no itch. Tom PS: > Quietly escaping a bad UTF-8 encoding with private Unicode characters is > > unlikely to be the right thing > > And indeed, the PEP stopped using PUA characters. Let me rephrase this: "quietly escaping a bad UTF-8 encoding is unlikely to be the right thing"; it doesn't matter how you do it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com