Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano writes: [long example] Am I right so far? So the email package uses the surrogate-escape error handler and ends up with this Unicode string: 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' which can be encoded back to the bytes we started with. Yes. Note that technically those three \u... code points are NOT classified as noncharacters. Very unpythonic terminology, easily confusing the nonspecialist. Or the specialist -- I used to know that Unicode gave noncharacter a technical definition but it seems I forgot. But then, Unicode isn't a PSF product, so I guess it's OK to be unpythonic.wink/ They are actually surrogate code points: http://www.unicode.org/faq/private_use.html#nonchar4 http://www.unicode.org/glossary/#surrogate_code_point and they're supposed to be reserved for UTF-16. I'm not sure of the implication of that. It means that any Python program that invokes the surrogateescape handler is not a conforming Unicode process, at least not on the naive interpretation of that definition. A conforming process would interpret them as corrupt characters and raise as soon as detected. A more sophisticated interpretation might argue that Python is multiple processes (in the sense of process used by Unicode), and that the Unicode standard only applies to characters. This is especially true of Pythons implementing PEP 393, since no surrogates should ever appear in text[1] at all. Then the smuggled bytes can be treated as noncharacters in practice although technically it's a violation of the Unicode standard to do so. Footnotes: [1] Meaning, no fair using chr() to inject them into str! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano st...@pearwood.info wrote: On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote: Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. This discussion would probably be a lot more easy to follow, with fewer miscommunications, if there were some examples. Here is my example, perhaps someone can tell me if I'm understanding it correctly. I want to send an email including the header line: 'Subject: âNOBODY expects the Spanish Inquisition!â' Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I do the right thing and encode it as UTF-8: b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' That won't work until email supports RFC 6532. Until then, you can only use ascii and encoded words successfully. So just having the curly quotes is a buggy enough program. but it's not up to Python's email package to throw those invalid bytes out or permantly replace them with something else. Also, we want to work with Unicode strings, not byte strings, so there has to be a way to smuggle those three bytes into Unicode, without ending up with either the replacement bytes: # using the 'replace' error handler 'Subject: ���NOBODY expects the Spanish Inquisition!â' What you'll get if you request a text copy of that header is 'Subject: ���NOBODY expects the Spanish Inquisition!���' Am I right so far? So the email package uses the surrogate-escape error handler and ends up with this Unicode string: 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â' Except that it encodes the closing quote, too :) which can be encoded back to the bytes we started with. Right. If you serialize the message as bytes, the bytes are recovered and output when that header is output. Now, once we support RFC 6532, you will be exactly right, as we will then have the option of handling utf-8 encoded headers, and we will do that using the utf-8 codec to ingest headers, and the surrogateescape error handler to handle exactly the kind of bad data you postulate. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Sorry for the mojibake. I've not yet gotten around to actually using the email package to write a smarter replacement for nmh, which is what I use for email, and I always forget that I need to manually tell nmh when there non-ascii in the message... On Wed, 17 Sep 2014 03:02:33 -0400, R. David Murray rdmur...@bitdance.com wrote: On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano st...@pearwood.info wrote: On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote: Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. This discussion would probably be a lot more easy to follow, with fewer miscommunications, if there were some examples. Here is my example, perhaps someone can tell me if I'm understanding it correctly. I want to send an email including the header line: 'Subject: âNOBODY expects the Spanish Inquisition!â' Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I do the right thing and encode it as UTF-8: b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' That won't work until email supports RFC 6532. Until then, you can only use ascii and encoded words successfully. So just having the curly quotes is a buggy enough program. but it's not up to Python's email package to throw those invalid bytes out or permantly replace them with something else. Also, we want to work with Unicode strings, not byte strings, so there has to be a way to smuggle those three bytes into Unicode, without ending up with either the replacement bytes: # using the 'replace' error handler 'Subject: ���NOBODY expects the Spanish Inquisition!â' What you'll get if you request a text copy of that header is 'Subject: ���NOBODY expects the Spanish Inquisition!���' Am I right so far? So the email package uses the surrogate-escape error handler and ends up with this Unicode string: 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â' Except that it encodes the closing quote, too :) which can be encoded back to the bytes we started with. Right. If you serialize the message as bytes, the bytes are recovered and output when that header is output. Now, once we support RFC 6532, you will be exactly right, as we will then have the option of handling utf-8 encoded headers, and we will do that using the utf-8 codec to ingest headers, and the surrogateescape error handler to handle exactly the kind of bad data you postulate. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: Guido's mantra is something like Python's str doesn't contain characters or even code points[1], it contains code units. But is that true? If it were true, I would expect to be able to make Python text strings containing code units that aren't code points, e.g. something like \U1234 or chr(0x1234) should work, but neither do. As far as I can tell, there is no way to build a string containing items which aren't code points. I don't think it is useful to say that strings *contain* code units, more that they *are made up from* code units. Code units are the implementation: 16-bit code units in narrow builds, 32-bit code units in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and beyond. (I don't know of any Python implementation which uses UTF-8 internally, but if there was one, it would use 8-bit code units.) It isn't very useful to say that in Python 3.3 the string A *contains* the 8-bit code unit 0x41. That's conflating two different levels of explanation (the high-level interface and the underlying implemention) and potentially leads to user confusion like # 8-bit code units are bytes, right? assert b'\41' in A which is Not Even Wrong. http://rationalwiki.org/wiki/Not_even_wrong I think it is correct to say that Python strings are sequences of Unicode code points U+ through U+10. There are no other restrictions, e.g. strings can contain surrogates, noncharacters, or nonsensical combinations of code points such as a U+0300 COMBINING GRAVE ACCENT combined with U+000A (newline). Implying that dealing with characters (or the grapheme globs that occasionally raise their ugly heads here) is an issue for higher-level facilities than str to deal with. Agreed that Python doesn't offer a string type based on graphemes, and that such a facility belongs as a high-level library, not a built-in type. Also agreed that talking about characters is sloppy. Nevertheless, for English speakers at least, code point = character isn't too awful a first approximation. The point being that Basically, we are pretending that the each smuggled byte is single character is something of a misstatement (good enough for present purpose of discussing email, but not good enough for the general case of understanding how this is supposed to work when porting the construct to other Python implementations), while for string parsing purposes...but they don't match any of our parsing constants. is precisely Pythonically correct. You might want to add because all parsing constants contain only valid characters by construction. I don't understand what you are trying to say here. [*] I worried a lot that this was re-introducing the bytes/string problem from python2. It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Nor am I sure what you are trying to say here either. Footnotes: [1] In Unicode terminology, a code unit is the smallest computer object that can represent a character (this is uniquely and sanely defined for all real Unicode transformation formats aka UTFs). A code point is an integer 0 - (17*256*256-1) that can represent a character, but many code points such as surrogates and 0x are defined to be non-characters. Actually not quite. Noncharacter is concretely defined in Unicode, and there are only 66 of them, many fewer than the surrogate code points alone. Surrogates are reserved, not noncharacters. http://www.unicode.org/glossary/#surrogate_code_point http://www.unicode.org/faq/private_use.html#nonchar1 It is wrong to talk about surrogate characters, but perhaps you mean to say that surrogates (by which I understand you to mean surrogate code points) are not human-meaningful characters, which is not the same thing as a Unicode noncharacter. Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved). So characters are code points which are characters, including undefined characters? :-) http://www.unicode.org/glossary/#character -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Seriously, can this discussion move somewhere else? This has nothing to do on python-dev. Thank you Antoine. On Wed, 17 Sep 2014 18:56:02 +1000 Steven D'Aprano st...@pearwood.info wrote: On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: Guido's mantra is something like Python's str doesn't contain characters or even code points[1], it contains code units. But is that true? If it were true, I would expect to be able to make Python text strings containing code units that aren't code points, e.g. something like \U1234 or chr(0x1234) should work, but neither do. As far as I can tell, there is no way to build a string containing items which aren't code points. I don't think it is useful to say that strings *contain* code units, more that they *are made up from* code units. Code units are the implementation: 16-bit code units in narrow builds, 32-bit code units in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and beyond. (I don't know of any Python implementation which uses UTF-8 internally, but if there was one, it would use 8-bit code units.) It isn't very useful to say that in Python 3.3 the string A *contains* the 8-bit code unit 0x41. That's conflating two different levels of explanation (the high-level interface and the underlying implemention) and potentially leads to user confusion like # 8-bit code units are bytes, right? assert b'\41' in A which is Not Even Wrong. http://rationalwiki.org/wiki/Not_even_wrong I think it is correct to say that Python strings are sequences of Unicode code points U+ through U+10. There are no other restrictions, e.g. strings can contain surrogates, noncharacters, or nonsensical combinations of code points such as a U+0300 COMBINING GRAVE ACCENT combined with U+000A (newline). Implying that dealing with characters (or the grapheme globs that occasionally raise their ugly heads here) is an issue for higher-level facilities than str to deal with. Agreed that Python doesn't offer a string type based on graphemes, and that such a facility belongs as a high-level library, not a built-in type. Also agreed that talking about characters is sloppy. Nevertheless, for English speakers at least, code point = character isn't too awful a first approximation. The point being that Basically, we are pretending that the each smuggled byte is single character is something of a misstatement (good enough for present purpose of discussing email, but not good enough for the general case of understanding how this is supposed to work when porting the construct to other Python implementations), while for string parsing purposes...but they don't match any of our parsing constants. is precisely Pythonically correct. You might want to add because all parsing constants contain only valid characters by construction. I don't understand what you are trying to say here. [*] I worried a lot that this was re-introducing the bytes/string problem from python2. It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Nor am I sure what you are trying to say here either. Footnotes: [1] In Unicode terminology, a code unit is the smallest computer object that can represent a character (this is uniquely and sanely defined for all real Unicode transformation formats aka UTFs). A code point is an integer 0 - (17*256*256-1) that can represent a character, but many code points such as surrogates and 0x are defined to be non-characters. Actually not quite. Noncharacter is concretely defined in Unicode, and there are only 66 of them, many fewer than the surrogate code points alone. Surrogates are reserved, not noncharacters. http://www.unicode.org/glossary/#surrogate_code_point http://www.unicode.org/faq/private_use.html#nonchar1 It is wrong to talk about surrogate characters, but perhaps you mean to say that surrogates (by which I understand you to mean surrogate code points) are not human-meaningful characters, which is not the same thing as a Unicode noncharacter. Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved). So characters are code points which are characters, including undefined characters? :-) http://www.unicode.org/glossary/#character ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Am 17.09.14 10:56, schrieb Steven D'Aprano: On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: Guido's mantra is something like Python's str doesn't contain characters or even code points[1], it contains code units. But is that true? It used to be true, and stopped being so with PEP 393. In particular, Python 3.2 and before would expose UTF-16 in the narrow build, so the elements of a string would be code units. Since Python 3.3, the surrogate code points are not longer interpreted as UTF-16 code units. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano writes: On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: Guido's mantra is something like Python's str doesn't contain characters or even code points[1], it contains code units. But is that true? It's not. That's why I wrote the slightly pejorative mantra and qualified it with something like. The precise statement is something like the array property is more important than preserving character boundaries, so slices etc are allowed to do unexpected or even evil things in the presence of astral characters in UTF-16 representations. I don't understand what you are trying to say here. Nor am I sure what you are trying to say here either. We can discuss this off-list if you would like. The natives are getting restless. non-characters. Actually not quite. Noncharacter Note the hyphen! (Just kidding, I will avoid that terminology in the future. I knew, but forgot.) Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved). So characters are code points which are characters, including undefined characters? :-) No, there's a clear hierarchy here. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico ros...@gmail.com wrote: On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull step...@xemacs.org wrote: Jim J. Jewett writes: In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM! These things are the text equivalent of IEEE NaNs. If all you know (as in the stdlib) is that you have generic text, the only fairly safe things to do with them are (1) delete them, (2) substitute an appropriate replacement character for them, (3) pass the text containing them verbatim to other code, and (4) reencode them using the same codec they were read with. Don't forget, these are *errors*. These are bytes that cannot be correctly decoded. That's not something that has any meaning whatsoever in text; so by definition, the only things you can do are the four you list there (as long as codec means both the choice of encoding and the use of the surrogateescape flag). It's like dealing with control characters when you need to print something visually, except that they have an official solution [1] and surrogateescape is unofficial. They're not real text, so you have to deal with them somehow. That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But, as Stephen said, the only things email does with them are the last three of the four he listed (if you read (3) as passing it between parts of the email package): the data comes in as text mixed with binary, and the email package parses it until it knows what the binary is supposed to be, turns it back into bytes, and decodes it properly. The goal is to never let the smuggled bytes escape out the email APIs as surrogateescape encoded text; though, in practice, this being consenting-adults Python and code not being bug free, there are places where people have used the knowledge of how surrogateescape is used by email to work around both API and code bugs. --David [*] Some of the encoded bytes *are* errors (non-ascii in headers or undecodable bytes in whatever the CTE/charset is), and in that case email may just turn them back into error bytes in the output, but only *some* of the smuggled bytes are actually errors (and none are if the message is RFC compliant). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com wrote: That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them as characters. At best, they represent characters but in some encoding other than the one you're using, and you have no idea how many bytes form a character or anything. So you can't, for instance, word-wrap the text, because you can't know how wide these unknown bytes are, whether they represent spaces (wrap points), or newlines, or anything like that. You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico ros...@gmail.com wrote: On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com wrote: That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them as characters. At best, they represent characters but in some encoding other than the one you're using, and you have no idea how many bytes form a character or anything. So you can't, for instance, word-wrap the text, because you can't know how wide these unknown bytes are, whether they represent spaces (wrap points), or newlines, or anything like that. You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. Well, except that I do. The email header parsing algorithms all work fine if I treat the surrogate escaped bytes as 'unknown junk' and just parse based on the valid unicode. (Unless the header is so garbled that it can't be parsed, of course, at which point it becomes an invalid header). You are right about the wrapping, though. If a header with invalid bytes (and in this scenario we *are* talking about errors) needs to be wrapped, we have to first decode the smuggled bytes and turn it into an 'unknown-8bit' encoded word before we can wrap the header. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray rdmur...@bitdance.com wrote: You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. Well, except that I do. The email header parsing algorithms all work fine if I treat the surrogate escaped bytes as 'unknown junk' and just parse based on the valid unicode. (Unless the header is so garbled that it can't be parsed, of course, at which point it becomes an invalid header). Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters. If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride. You are right about the wrapping, though. If a header with invalid bytes (and in this scenario we *are* talking about errors) needs to be wrapped, we have to first decode the smuggled bytes and turn it into an 'unknown-8bit' encoded word before we can wrap the header. Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section? ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Great points here - I especially like the concluding statement you can't treat it as a pure Unicode string - it's a Unicode string with smuggled bytes Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a low surrogate in range (D800, DC00), then a high surrogate in range(DC00, E000). So one can likely assign an interpretation that this is in fact the isolated surrogate, and not an actual codepoint. Of course, if you do actually have a smuggled isolated low surrogate FOLLOWED by a smuggled isolated high surrogate - guess what, the only interpretation is a codepoint. Or perhaps more likely garbage. Of course it doesn't happen so often, so maybe we are fine with the occasional bug ;) I personally suspect that we will resolve this by also supporting UCS-4 as a representation in Jython 3.x for such Unicode strings, albeit with the limitation that we have simply moved the problem to when we try to call Java methods taking java.lang.String objects. - Jim On Tue, Sep 16, 2014 at 9:27 AM, Chris Angelico ros...@gmail.com wrote: On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com wrote: That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them as characters. At best, they represent characters but in some encoding other than the one you're using, and you have no idea how many bytes form a character or anything. So you can't, for instance, word-wrap the text, because you can't know how wide these unknown bytes are, whether they represent spaces (wrap points), or newlines, or anything like that. You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/jbaker%40zyasoft.com -- - Jim jim.baker@{colorado.edu|python.org|rackspace.com|zyasoft.com} twitter.com/jimbaker github.com/jimbaker bitbucket.com/jimbaker ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker jim.ba...@python.org wrote: Of course, if you do actually have a smuggled isolated low surrogate FOLLOWED by a smuggled isolated high surrogate - guess what, the only interpretation is a codepoint. Or perhaps more likely garbage. Of course it doesn't happen so often, so maybe we are fine with the occasional bug ;) I personally suspect that we will resolve this by also supporting UCS-4 as a representation in Jython 3.x for such Unicode strings, albeit with the limitation that we have simply moved the problem to when we try to call Java methods taking java.lang.String objects. That'll cost efficiency, of course, but it'll guarantee correctness. And maybe, just maybe, you'll be able to put some pressure on Java itself to start supporting UCS-4 natively... One can dream. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico ros...@gmail.com wrote: On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray rdmur...@bitdance.com wrote: You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. Well, except that I do. The email header parsing algorithms all work fine if I treat the surrogate escaped bytes as 'unknown junk' and just parse based on the valid unicode. (Unless the header is so garbled that it can't be parsed, of course, at which point it becomes an invalid header). Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters. Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. (It's a string that can't be encoded unless you use the surrogateescape error handler, but it is still a string from Python's POV, which is the point of the error handler). Or, to put it another way, your implication was that there were no string operations that could be usefully applied to a string containing smuggled bytes, but that is not the case. (I may well have read an implication that was not there; if so I apologize and you can ignore the rest of this :) Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. Of course, this only works in contexts where we can ignore or carry along the smuggled bytes as being components of arbitrary text portions of the syntax, and we must take care to either replace them with valid unicode error glyphs or turn the string of which the are a part into binary using the same codec and error handler as we used to ingest them to begin with before emitting them. And, of course, we can't *modify* the sections containing the smuggled bytes, only the syntax-matched sections that surround them; and things like line wrapping are just an invitation to ugliness and bugs even if you kept the smuggled bytes sections internally intact. Finally, to explain what I meant by except that I do: when I added back binary support to the email package in Python3, initially I *did not change the parsing algorithms* in the code. I just smuggled the bytes, and then dealt with the encoding/decoding at the API boundaries. This is the same principle used when dealing with filenames in the API of Python itself. *Except* at that boundary, I do not need to worry about whether a particular string contains smuggled bytes or not.[*] If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride. Quite a bit harder, which is why I don't do that. You are right about the wrapping, though. If a header with invalid bytes (and in this scenario we *are* talking about errors) needs to be wrapped, we have to first decode the smuggled bytes and turn it into an 'unknown-8bit' encoded word before we can wrap the header. Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section? The point of RFC2047 encoded words is that they are an ASCII representation of binary data, so once the bytes are properly Content Transfer Encoded (as being in an unknown charset) the string contains no smuggled bytes and can be wrapped. --David [*] I worried a lot that this was re-introducing the bytes/string problem from python2. The difference is that if the smuggled bytes escape from the email API, that's a bug in the email package. So user code using the library is *not* in danger of getting mysterious encoding errors when one day the input is international where before it was all ASCII. (Absent bugs in the library.) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jim Baker writes: Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a low surrogate in range (D800, DC00), then a high surrogate in range(DC00, E000). Of course, if you do actually have a smuggled isolated low surrogate FOLLOWED by a smuggled isolated high surrogate - guess what, the only interpretation is a codepoint. Or perhaps more likely garbage. Of course it doesn't happen so often, so maybe we are fine with the occasional bug ;) The CPython representation uses trailing surrogates only[1], so it's never possible to interpret them as anything but non-characters -- as soon as you encounter them you know that it's a lone surrogate. Surely you can do the same. As long as the Java string manipulation functions don't check for surrogates, you should be fine with this representation. Of course I suppose your matching functions (etc) don't check for them either, so you will be somewhat vulnerable to bugs due to treating them as characters. But the same is true for CPython, AFAIK. Footnotes: [1] Only 128 bytes are necessary since the 128 ASCII characters are embedded in Unicode as-is. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
R. David Murray writes: Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters. Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. Guido's mantra is something like Python's str doesn't contain characters or even code points[1], it contains code units. Implying that dealing with characters (or the grapheme globs that occasionally raise their ugly heads here) is an issue for higher-level facilities than str to deal with. The point being that Basically, we are pretending that the each smuggled byte is single character is something of a misstatement (good enough for present purpose of discussing email, but not good enough for the general case of understanding how this is supposed to work when porting the construct to other Python implementations), while for string parsing purposes...but they don't match any of our parsing constants. is precisely Pythonically correct. You might want to add because all parsing constants contain only valid characters by construction. [*] I worried a lot that this was re-introducing the bytes/string problem from python2. It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Footnotes: [1] In Unicode terminology, a code unit is the smallest computer object that can represent a character (this is uniquely and sanely defined for all real Unicode transformation formats aka UTFs). A code point is an integer 0 - (17*256*256-1) that can represent a character, but many code points such as surrogates and 0x are defined to be non-characters. Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 08:57:21 +0900, Stephen J. Turnbull step...@xemacs.org wrote: As long as the Java string manipulation functions don't check for surrogates, you should be fine with this representation. Of course I suppose your matching functions (etc) don't check for them either, so you will be somewhat vulnerable to bugs due to treating them as characters. But the same is true for CPython, AFAIK. From my point of view, the string function laxness is a feature, not a bug. But I get what you mean. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote: Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. (It's a string that can't be encoded unless you use the surrogateescape error handler, but it is still a string from Python's POV, which is the point of the error handler). Or, to put it another way, your implication was that there were no string operations that could be usefully applied to a string containing smuggled bytes, but that is not the case. (I may well have read an implication that was not there; if so I apologize and you can ignore the rest of this :) Ahh, I see where we are getting confused. What I said was that you can't treat the string as a *pure* Unicode string. Parts of it are Unicode text, parts of it aren't. Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. I'm fairly sure you're never going to find an encoding in which one unknown byte represents two characters, but there are cases where it takes more than one byte to make up a character (or the bytes are just shift codes or something). Does that ever throw off your regexes? It wouldn't be an issue to a .* between two character markers, but if you ever say .{5} then it might match incorrectly. I think we're in agreement here, just using different words. :) ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote: It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Actually, while it may be true that for the email headers case, all characters are characters, just the encoding is unknown, it is not necessarily true that they are in disjoint sets. Some bytes may decode into characters without needing to be smuggled... maybe not in text-protocols like email, but in the general case. So then some of the bytes that should be interpreted as binary data are not in a disjoint set from characters. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Glenn Linderman writes: Some bytes may decode into characters without needing to be smuggled... maybe not in text-protocols like email, but in the general case. So then some of the bytes that should be interpreted as binary data are not in a disjoint set from characters. True, but irrelevant. The point is that whoever chose the codec is responsible for getting it right, not only the right encoding, but for the assumption that the input data was pure encoded text. The rest of the program can now assume that choice was made correctly, and process text as text. The program cannot be blamed for assuming that the person who chose the codec knew what they were about, and so characters can be *assumed* to be decoded from bytes representing characters. This was not true in Python 2, where it was common practice to represent encoded text by itself internally, implicitly assuming that only one encoding would be encountered in each invocation of the program. This was never true, and with the spread of the Internet and then the WWW, it became a major issue. And that's why we invented Python 3, to let text be text without the encumbrance of always being aware of encodings and converting when different encodings collide, etc. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote: Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. This discussion would probably be a lot more easy to follow, with fewer miscommunications, if there were some examples. Here is my example, perhaps someone can tell me if I'm understanding it correctly. I want to send an email including the header line: 'Subject: “NOBODY expects the Spanish Inquisition!”' Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I do the right thing and encode it as UTF-8: b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' but my mail package, not being written in a language as awesome as Python, is just riddled with bugs, and somehow I end up with this corrupted byte-string instead: b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d' Note that the bytes from the first curly quote bytes are in the wrong order, but the second is okay. (Like I said, it's just *riddled* with bugs.) That means that trying to decode those bytes will fail in Python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9: invalid start byte but it's not up to Python's email package to throw those invalid bytes out or permantly replace them with something else. Also, we want to work with Unicode strings, not byte strings, so there has to be a way to smuggle those three bytes into Unicode, without ending up with either the replacement bytes: # using the 'replace' error handler 'Subject: ���NOBODY expects the Spanish Inquisition!”' or incorrectly interpreting them as valid, but wrong, code points. (If we do the second, we end up with two control characters \x9c\x80 followed by â.) We want to be able to round-trip back to the same bytes we received. Am I right so far? So the email package uses the surrogate-escape error handler and ends up with this Unicode string: 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' which can be encoded back to the bytes we started with. Note that technically those three \u... code points are NOT classified as noncharacters. They are actually surrogate code points: http://www.unicode.org/faq/private_use.html#nonchar4 http://www.unicode.org/glossary/#surrogate_code_point and they're supposed to be reserved for UTF-16. I'm not sure of the implication of that. I'm fairly sure you're never going to find an encoding in which one unknown byte represents two characters, There are encodings which use a shift mechanism, whereby a byte X represents one character by default, and a different character after the shift mechanism. But I don't think that matters, since we're not able to interpret those bytes. If we were, we'd just decode them to a text string and be done with it. but there are cases where it takes more than one byte to make up a character (or the bytes are just shift codes or something). Multi-byte encodings are very common. All the Unicode encodings are multi-byte. So are many East Asian encodings. Does that ever throw off your regexes? It wouldn't be an issue to a .* between two character markers, but if you ever say .{5} then it might match incorrectly. I don't think the idea is to match on these smuggled bytes specifically. I think the idea is to match *around* them. In the example above, we might match everything from Subject: to the end of the line. So long as we never end up with a situation where the smuggled bytes are replaced by something else, or shuffled around into different positions, we should be fine. David, is my understanding correct? -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano st...@pearwood.info writes: On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote: Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all any character matches in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. This discussion would probably be a lot more easy to follow, with fewer miscommunications, if there were some examples. Here is my example, perhaps someone can tell me if I'm understanding it correctly. I want to send an email including the header line: 'Subject: “NOBODY expects the Spanish Inquisition!”' from email.header import Header h = Header('Subject: “NOBODY expects the Spanish Inquisition!”') h.encode('utf-8') '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' h.encode() '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' h.encode('ascii') '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' -- Akira ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote: 1. Java does not really have a Unicode type, therefore not one that validates. It has a String type that is a sequence of UTF-16 code units. There are some String methods and Character methods that deal with code points represented as int. I can put any 16-bit values I like in a String. Including lone surrogates, and invalid characters in general? 2. With proper accounting for indices, and as long as surrogates appear in pairs, I believe operations like find or endswith give correct answers about the unicode, when applied to the UTF-16. This is an attractive implementation option, and mostly what we do. So use it. The fact that you're having to smuggle bytes already guarantees that your data is either invalid or misinterpreted, and bug-free isn't possible. In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire -- so it won't ever match entirely valid strings, except perhaps via a wildcard. And it should still work for .endswith(the same invalid characters). 3. I'm fixing some bugs where we get it wrong beyond the BMP, and the fix involves banning lone surrogates (completely). At present you can't type them in literals but you can sneak them in from Java. So how will you ban them, and what will you do when some java class sends you an invalid sequence anyhow? That is exactly the use case for these smuggled bytes... If you distinguish between a fully constructed PyString and a code-unit-sequence-that-could-be-made-into-a-PyString-later, then you could always have your constructor return an InvalidPyString subclass on the rare occasions when one is needed. If you want to avoid invalid surrogates even then, just use the replacement character and keep a separate list of original characters that got replaced in this string -- a hassle, but no worse than tracking indices for surrogates. 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. If you allow direct write access to the underlying charsequence (as CPython does to C extensions), then you can't really ban invalid sequences. If callers have to go through an API -- even something as minimal as getBytes or getChars -- then you can use whatever internal representation you prefer. Hopefully, the vast majority of strings won't actually have smuggled bytes. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jim J. Jewett writes: In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM! These things are the text equivalent of IEEE NaNs. If all you know (as in the stdlib) is that you have generic text, the only fairly safe things to do with them are (1) delete them, (2) substitute an appropriate replacement character for them, (3) pass the text containing them verbatim to other code, and (4) reencode them using the same codec they were read with. -- so it won't ever match entirely valid strings, except perhaps via a wildcard. And it should still work for .endswith(the same invalid characters). Incorrect, I'm pretty sure, unless you know that both texts containing the same invalid code points were read with the same codec. Eg, consider two filenames encoded in ISO Cyrillic and ISO Hebrew, read with (encoding='ascii', errors='surrogateescape'). Apps that know the semantics of the text may DWIM/DTRT if they want to, but FWIW-IMHO-YMMV-and-any-other-4-letter-caveat-acronyms-that- may-apply Python and the stdlib shouldn't try to guess. Guessing may be unavoidable, of course. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull step...@xemacs.org wrote: Jim J. Jewett writes: In terms of best-effort, it is reasonable to treat the smuggled bytes as representing a character outside of your unicode repertoire I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM! These things are the text equivalent of IEEE NaNs. If all you know (as in the stdlib) is that you have generic text, the only fairly safe things to do with them are (1) delete them, (2) substitute an appropriate replacement character for them, (3) pass the text containing them verbatim to other code, and (4) reencode them using the same codec they were read with. Don't forget, these are *errors*. These are bytes that cannot be correctly decoded. That's not something that has any meaning whatsoever in text; so by definition, the only things you can do are the four you list there (as long as codec means both the choice of encoding and the use of the surrogateescape flag). It's like dealing with control characters when you need to print something visually, except that they have an official solution [1] and surrogateescape is unofficial. They're not real text, so you have to deal with them somehow. The bytes might each represent one character. Several of them together might represent a single character. Or maybe they don't mean anything at all, and they're just part of a chunked data format... like I was finding in the .cwk files that I was reading this weekend (it's mostly MacRoman encoding, but the text is divided into chunks separated by \0\0 and two more bytes - turns out the bytes are chunk lengths, so they don't mean any sort of characters at all). You can't know. ChrisA [1] http://www.unicode.org/charts/PDF/U2400.pdf ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote: 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. It may be worth trying *without* it (i.e. treat surrogateescape as equivalent to strict initially), and seeing how you go. The main purpose of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data round trips work on POSIX aspect of CPython 2, which doesn't apply in exactly the same way on Jython. Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2, surrogateescape is equivalent to strict seems like a relatively small discrepancy in behaviour. Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com wrote: On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote: 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. It may be worth trying *without* it (i.e. treat surrogateescape as equivalent to strict initially), and seeing how you go. The main purpose of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data round trips work on POSIX aspect of CPython 2, which doesn't apply in exactly the same way on Jython. Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2, surrogateescape is equivalent to strict seems like a relatively small discrepancy in behaviour. That would totally break the email package. It would of course be possible to rewrite email to not use surrogate escape, but it is a seriously non-trivial undertaking. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Sat, Sep 13, 2014, 09:33 R. David Murray rdmur...@bitdance.com wrote: On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com wrote: On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote: 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. It may be worth trying *without* it (i.e. treat surrogateescape as equivalent to strict initially), and seeing how you go. The main purpose of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data round trips work on POSIX aspect of CPython 2, which doesn't apply in exactly the same way on Jython. Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2, surrogateescape is equivalent to strict seems like a relatively small discrepancy in behaviour. That would totally break the email package. It would of course be possible to rewrite email to not use surrogate escape, but it is a seriously non-trivial undertaking. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ tlesher%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On 14 Sep 2014 01:33, R. David Murray rdmur...@bitdance.com wrote: On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com wrote: On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote: 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. It may be worth trying *without* it (i.e. treat surrogateescape as equivalent to strict initially), and seeing how you go. The main purpose of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data round trips work on POSIX aspect of CPython 2, which doesn't apply in exactly the same way on Jython. Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2, surrogateescape is equivalent to strict seems like a relatively small discrepancy in behaviour. That would totally break the email package. It would of course be possible to rewrite email to not use surrogate escape, but it is a seriously non-trivial undertaking. That does indeed make for a compelling use case :) Cheers, Nick. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On 12/09/2014 04:28, Stephen J. Turnbull wrote: Jeff Allen writes: A welcome article. One correction should be made, I believe: the area of code point space used for the smuggling of bytes under PEP-383 is not a Unicode Private Use Area, but a portion of the trailing surrogate range. Nice catch. Note that the surrogate range was originally part of the Private Use Area, but it was carved out with the adoption of UTF-16 in about 1993. In practice, I doubt that there are any current implementations claiming compatibility with Unicode 1.0 (IIRC, UTF-16 was made mandatory in Unicode 1.1). That's a helpful bit of history that explains the uncharacteristic inaccuracy. Most I can do to keep the current position clear in my head. I've always thought that the right way to handle the private use area for platforms like Python and Emacs, which may need to use it for their own purposes (such as undecodable bytes) but want to respect its use by applications, is to create an auxiliary table mapping the private use area to objects describing the characters represented by the private use code points. These objects would have attributes such as external representation for text I/O, glyph (for GUI display), repr (for TTY display), various Unicode properties, etc. Simply having a block for private use seems to create an unmanaged space for conflict, reminiscent of the other 128 characters in bilingual programming. I wondered if the way to respect use by applications might be to make it private to a particular sub-class of str, idly however. Jeff Allen ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Fri, 12 Sep 2014 07:54:56 +0100 Jeff Allen ja...@farowl.co.uk wrote: Simply having a block for private use seems to create an unmanaged space for conflict, reminiscent of the other 128 characters in bilingual programming. I wondered if the way to respect use by applications might be to make it private to a particular sub-class of str, idly however. It's not private from Python's point of view, it's actually specified in a PEP. So all Python 3 code has to follow the rule, and there's no conflict internally. The characters shouldn't leak out to other applications, unless the user's code does its I/O very badly :-) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On September 11, 2014, Jeff Allen wrote: ... the area of code point space used for the smuggling of bytes under PEP-383 is not a Unicode Private Use Area, but a portion of the trailing surrogate range. This is a code violation, which I imagine is why surrogateescape is an error handler, not a codec. True, but I believe that is a CPython implementation detail. Other implementations (including jython) should implement the surrogatescape API, but I don't think it is important to use the same internal representation for the invalid bytes. (Well, unless you want to communicate with external tools (GUIs?) that are trying to directly use (effectively bytes rather than strings) in that particular internal encoding when communicating with python.) lone surrogates preclude a naive use of the platform string library Invalid input often causes problems. Are you saying that there are situations where the platform string library could easily handle invalid characters in general, but has a problem with the specific case of lone surrogates? -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jeff Allen writes: Simply having a block for private use seems to create an unmanaged space for conflict, No. The uncharted range of human language (including recently- invented nonsense like emoticons and the annual design a character contest run by a newpaper in Taipei, with the grand prize being your character gets added to the national standard IIRC, but maybe it's just that newspaper's collection of private space characters) already contains those conflicts. Believe me, private use space, manage it yourself was the best they could do. I've been working with the beureaucratic insanity of the Japanese national standard -- it took almost 3 decades before every Japanese citizen could store their names in a computer using government- approved codes -- and the chaos of the Taiwanese national standard -- which contains hordes of characters with one known use and no known meaning, many of them duplicates -- for twenty years now. Neither approach works as well as Unicode's, despite its design-by-committee flaws overlaid with national animosities that can flare into linguicidal vetoes and code-space-stuffing logrolling. reminiscent of the other 128 characters in bilingual programming. I wondered if the way to respect use by applications might be to make it private to a particular sub-class of str, idly however. If I understand your suggestion, that's precisely the intent of PEP 383, to make undecodable bytes in a coded character stream private. But they need to be in the stream one way or another. So PEP 383 chose to use a non-Unicode encoding (based on the lone surrogate device invented by Markus Kuhn for utf-8b) to deal with that, and that does effectively make those elements private to Python (but of course not in the Unicode sense, as they're not even characters in Unicode). But I gather the native Unicode type in Java doesn't allow you to use that dodge because it checks for malformed Unicode internally (ie, at a level not controllable by Jython). So you have to embed such stream elements in the space of Unicode characters. You have the option of the private space or unallocated (reserved) space. The latter seems like asking for trouble, and the only way to avoid it would be to be prepared to move that data around in case of collision. But that's precisely what I'm suggesting doing in private space. Same issue, either way. Private space with a local registry seems saner. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jim, Stephen: It seems like we're off topic here, but to answer all as briefly as possible: 1. Java does not really have a Unicode type, therefore not one that validates. It has a String type that is a sequence of UTF-16 code units. There are some String methods and Character methods that deal with code points represented as int. I can put any 16-bit values I like in a String. 2. With proper accounting for indices, and as long as surrogates appear in pairs, I believe operations like find or endswith give correct answers about the unicode, when applied to the UTF-16. This is an attractive implementation option, and mostly what we do. 3. I'm fixing some bugs where we get it wrong beyond the BMP, and the fix involves banning lone surrogates (completely). At present you can't type them in literals but you can sneak them in from Java. 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. Jeff Allen On 12/09/2014 16:37, Jim J. Jewett wrote: On September 11, 2014, Jeff Allen wrote: ... surrogateescape is an error handler, not a codec. True, but I believe that is a CPython implementation detail. Other implementations (including jython) should implement the surrogatescape API, but I don't think it is important to use the same internal representation for the invalid bytes. lone surrogates preclude a naive use of the platform string library Invalid input often causes problems. Are you saying that there are situations where the platform string library could easily handle invalid characters in general, but has a problem with the specific case of lone surrogates? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
A welcome article. One correction should be made, I believe: the area of code point space used for the smuggling of bytes under PEP-383 is not a Unicode Private Use Area, but a portion of the trailing surrogate range. This is a code violation, which I imagine is why surrogateescape is an error handler, not a codec. http://www.unicode.org/faq/private_use.html I believe the private use area was considered and rejected for PEP-383. In an implementation of the type unicode based on UTF-16 (Jython), lone surrogates preclude a naive use of the platform string library. This is on my mind at the moment as I'm working several bugs in Jython's unicode type, and can see why it has been too difficult. Jeff On 10/09/2014 08:17, Nick Coghlan wrote: Since it may come in handy when discussing Why was Python 3 necessary? with folks, I wanted to point out that my article on the transition to multilingual programming has now been reposted on the Red Hat developer blog: http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/ I wouldn't normally bring the Red Hat brand into an upstream discussion like that, but this myth that Python 3 is killing the language, and that Python 2 could have continued as a viable development platform indefinitely if only Guido and the core development team hadn't decided to go ahead and create Python 3, is just plain wrong, and it really needs to die. I'm hoping that borrowing a bit of Red Hat's enterprise credibility will finally get people to understand that we really do have some idea what we're doing, which is why most of our redistributors and many of our key users are helping to push the migration forward, while we also continue to support existing Python 2 users :) Cheers, Nick. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jeff Allen writes: A welcome article. One correction should be made, I believe: the area of code point space used for the smuggling of bytes under PEP-383 is not a Unicode Private Use Area, but a portion of the trailing surrogate range. Nice catch. Note that the surrogate range was originally part of the Private Use Area, but it was carved out with the adoption of UTF-16 in about 1993. In practice, I doubt that there are any current implementations claiming compatibility with Unicode 1.0 (IIRC, UTF-16 was made mandatory in Unicode 1.1). This is a code violation, which I imagine is why surrogateescape is an error handler, not a codec. Yes. I believe the private use area was considered and rejected for PEP-383. In an implementation of the type unicode based on UTF-16 (Jython), lone surrogates preclude a naive use of the platform string library. This is on my mind at the moment as I'm working several bugs in Jython's unicode type, and can see why it has been too difficult. I've always thought that the right way to handle the private use area for platforms like Python and Emacs, which may need to use it for their own purposes (such as undecodable bytes) but want to respect its use by applications, is to create an auxiliary table mapping the private use area to objects describing the characters represented by the private use code points. These objects would have attributes such as external representation for text I/O, glyph (for GUI display), repr (for TTY display), various Unicode properties, etc. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Multilingual programming article on the Red Hat Developer blog
Since it may come in handy when discussing Why was Python 3 necessary? with folks, I wanted to point out that my article on the transition to multilingual programming has now been reposted on the Red Hat developer blog: http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/ I wouldn't normally bring the Red Hat brand into an upstream discussion like that, but this myth that Python 3 is killing the language, and that Python 2 could have continued as a viable development platform indefinitely if only Guido and the core development team hadn't decided to go ahead and create Python 3, is just plain wrong, and it really needs to die. I'm hoping that borrowing a bit of Red Hat's enterprise credibility will finally get people to understand that we really do have some idea what we're doing, which is why most of our redistributors and many of our key users are helping to push the migration forward, while we also continue to support existing Python 2 users :) Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 10, 2014 at 05:17:57PM +1000, Nick Coghlan wrote: Since it may come in handy when discussing Why was Python 3 necessary? with folks, I wanted to point out that my article on the transition to multilingual programming has now been reposted on the Red Hat developer blog: http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/ That's awesome! Thank you Nick. -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com