Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano wrote: > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > > wrote: > > > > Basically, we are pretending that the each smuggled > > > byte is single character for string parsing purposes...but they don't > > > match any of our parsing constants. They are all "any character" matches > > > in the regexes and what have you. > > > > This is slightly iffy, as you can't be sure that one byte represents > > one character, but as long as you don't much care about that, it's not > > going to be an issue. > > This discussion would probably be a lot more easy to follow, with fewer > miscommunications, if there were some examples. Here is my example, > perhaps someone can tell me if I'm understanding it correctly. > > I want to send an email including the header line: > > 'Subject: âNOBODY expects the Spanish Inquisition!â' > > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I > do the right thing and encode it as UTF-8: > > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' That won't work until email supports RFC 6532. Until then, you can only use ascii and encoded words successfully. So just having the curly quotes is a buggy enough program. > but it's not up to Python's email package to throw those invalid bytes > out or permantly replace them with something else. Also, we want to work > with Unicode strings, not byte strings, so there has to be a way to > smuggle those three bytes into Unicode, without ending up with either > the replacement bytes: > > # using the 'replace' error handler > 'Subject: ���NOBODY expects the Spanish Inquisition!â' What you'll get if you request a text copy of that header is 'Subject: ���NOBODY expects the Spanish Inquisition!���' > Am I right so far? > > So the email package uses the surrogate-escape error handler and ends up > with this Unicode string: > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â' Except that it encodes the closing quote, too :) > which can be encoded back to the bytes we started with. Right. If you serialize the message as bytes, the bytes are recovered and output when that header is output. Now, once we support RFC 6532, you will be exactly right, as we will then have the option of handling utf-8 encoded headers, and we will do that using the utf-8 codec to ingest headers, and the surrogateescape error handler to handle exactly the kind of bad data you postulate. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Sorry for the mojibake. I've not yet gotten around to actually using the email package to write a smarter replacement for nmh, which is what I use for email, and I always forget that I need to manually tell nmh when there non-ascii in the message... On Wed, 17 Sep 2014 03:02:33 -0400, "R. David Murray" wrote: > On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano > wrote: > > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > > > wrote: > > > > > > Basically, we are pretending that the each smuggled > > > > byte is single character for string parsing purposes...but they don't > > > > match any of our parsing constants. They are all "any character" > > > > matches > > > > in the regexes and what have you. > > > > > > This is slightly iffy, as you can't be sure that one byte represents > > > one character, but as long as you don't much care about that, it's not > > > going to be an issue. > > > > This discussion would probably be a lot more easy to follow, with fewer > > miscommunications, if there were some examples. Here is my example, > > perhaps someone can tell me if I'm understanding it correctly. > > > > I want to send an email including the header line: > > > > 'Subject: âNOBODY expects the Spanish Inquisition!â' > > > > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I > > do the right thing and encode it as UTF-8: > > > > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' > > That won't work until email supports RFC 6532. Until then, you can only > use ascii and encoded words successfully. So just having the curly > quotes is a buggy enough program. > > > but it's not up to Python's email package to throw those invalid bytes > > out or permantly replace them with something else. Also, we want to work > > with Unicode strings, not byte strings, so there has to be a way to > > smuggle those three bytes into Unicode, without ending up with either > > the replacement bytes: > > > > # using the 'replace' error handler > > 'Subject: ���NOBODY expects the Spanish Inquisition!â' > > What you'll get if you request a text copy of that header is > > 'Subject: ���NOBODY expects the Spanish Inquisition!���' > > > Am I right so far? > > > > So the email package uses the surrogate-escape error handler and ends up > > with this Unicode string: > > > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â' > > Except that it encodes the closing quote, too :) > > > which can be encoded back to the bytes we started with. > > Right. If you serialize the message as bytes, the bytes are recovered > and output when that header is output. > > Now, once we support RFC 6532, you will be exactly right, as we will > then have the option of handling utf-8 encoded headers, and we will do > that using the utf-8 codec to ingest headers, and the surrogateescape > error handler to handle exactly the kind of bad data you postulate. > > --David > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)
This feels like a jython-dev discussion. But anyway ... On 17/09/2014 00:57, Stephen J. Turnbull wrote: The CPython representation uses trailing surrogates only[1], so it's never possible to interpret them as anything but non-characters -- as soon as you encounter them you know that it's a lone surrogate. Surely you can do the same. As long as the Java string manipulation functions don't check for surrogates, you should be fine with this representation. Of course I suppose your matching functions (etc) don't check for them either, so you will be somewhat vulnerable to bugs due to treating them as characters. But the same is true for CPython, AFAIK. They don't check. I agree that since only the trailing surrogate code points are allowed, you can tell that you have one, even in the UTF-16 form. The problem is that, if strings containing lone trailing surrogates are allowed, then: u'\udc83' in u'abc\U00010083xyz' u'abc\U00010083xyz'.endswith(u'\udc83xyz') are both True, if implemented in the obvious way on the UTF-16 representation. And this should not be so in Jython, which claims to be a wide build. (I can't actually type the second one, but I can get the same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I believe that the usual string operations work correctly on the UTF-16 version of the string, as long as indexes are adjusted correctly. If we think it is ok that code using such methods give the wrong answer when fed strings containing smuggled bytes, then isolated (trailing) surrogates could be allowed. It's the user's fault for calling the method on that data. But I think it kinder that our implementation defend users from these wrong answers. In the latest state of Jython, we do this by rigorously preventing the construction of a PyUnicode containing a lone surrogate, so we can just use UTF-16 operations without further checks. I'm not sure that rigour will be universally welcomed, and clearly it precludes PEP-383 byte smuggling. Jeff ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > Guido's mantra is something like "Python's str doesn't contain > characters or even code points[1], it contains code units." But is that true? If it were true, I would expect to be able to make Python text strings containing code units that aren't code points, e.g. something like "\U1234" or chr(0x1234) should work, but neither do. As far as I can tell, there is no way to build a string containing items which aren't code points. I don't think it is useful to say that strings *contain* code units, more that they *are made up from* code units. Code units are the implementation: 16-bit code units in narrow builds, 32-bit code units in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and beyond. (I don't know of any Python implementation which uses UTF-8 internally, but if there was one, it would use 8-bit code units.) It isn't very useful to say that in Python 3.3 the string "A" *contains* the 8-bit code unit 0x41. That's conflating two different levels of explanation (the high-level interface and the underlying implemention) and potentially leads to user confusion like # 8-bit code units are bytes, right? assert b'\41' in "A" which is Not Even Wrong. http://rationalwiki.org/wiki/Not_even_wrong I think it is correct to say that Python strings are sequences of Unicode code points U+ through U+10. There are no other restrictions, e.g. strings can contain surrogates, noncharacters, or nonsensical combinations of code points such as a U+0300 COMBINING GRAVE ACCENT combined with U+000A (newline). > Implying > that dealing with characters (or the grapheme globs that occasionally > raise their ugly heads here) is an issue for higher-level facilities > than str to deal with. Agreed that Python doesn't offer a string type based on graphemes, and that such a facility belongs as a high-level library, not a built-in type. Also agreed that talking about characters is sloppy. Nevertheless, for English speakers at least, "code point = character" isn't too awful a first approximation. > The point being that > > > Basically, we are pretending that the each smuggled byte is single > > character > > is something of a misstatement (good enough for present purpose of > discussing email, but not good enough for the general case of > understanding how this is supposed to work when porting the construct > to other Python implementations), while > > > for string parsing purposes...but they don't match any of our > > parsing constants. > > is precisely Pythonically correct. You might want to add "because all > parsing constants contain only valid characters by construction." I don't understand what you are trying to say here. > > [*] I worried a lot that this was re-introducing the bytes/string > > problem from python2. > > It isn't, because the bytes/str problem was that given a str object > out of context you could not tell whether it was a binary blob or > text, and if text, you couldn't tell if it was external encoded text > or internal abstract text. > > That is not true here because the representations of characters vs. > smuggled bytes in str are disjoint sets. Nor am I sure what you are trying to say here either. > Footnotes: > [1] In Unicode terminology, a code unit is the smallest computer > object that can represent a character (this is uniquely and sanely > defined for all real Unicode transformation formats aka UTFs). A code > point is an integer 0 - (17*256*256-1) that can represent a character, > but many code points such as surrogates and 0x are defined to be > non-characters. Actually not quite. "Noncharacter" is concretely defined in Unicode, and there are only 66 of them, many fewer than the surrogate code points alone. Surrogates are reserved, not noncharacters. http://www.unicode.org/glossary/#surrogate_code_point http://www.unicode.org/faq/private_use.html#nonchar1 It is wrong to talk about "surrogate characters", but perhaps you mean to say that surrogates (by which I understand you to mean surrogate code points) are "not human-meaningful characters", which is not the same thing as a Unicode noncharacter. > Characters are those code points that may be assigned > an interpretation as a character, including undefined characters > (private space and reserved). So characters are code points which are characters, including undefined characters? :-) http://www.unicode.org/glossary/#character -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Seriously, can this discussion move somewhere else? This has nothing to do on python-dev. Thank you Antoine. On Wed, 17 Sep 2014 18:56:02 +1000 Steven D'Aprano wrote: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > > > Guido's mantra is something like "Python's str doesn't contain > > characters or even code points[1], it contains code units." > > But is that true? If it were true, I would expect to be able to make > Python text strings containing code units that aren't code points, e.g. > something like "\U1234" or chr(0x1234) should work, but neither > do. As far as I can tell, there is no way to build a string containing > items which aren't code points. > > I don't think it is useful to say that strings *contain* code units, > more that they *are made up from* code units. Code units are the > implementation: 16-bit code units in narrow builds, 32-bit code units > in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and > beyond. (I don't know of any Python implementation which uses UTF-8 > internally, but if there was one, it would use 8-bit code units.) > > It isn't very useful to say that in Python 3.3 the string "A" *contains* > the 8-bit code unit 0x41. That's conflating two different levels of > explanation (the high-level interface and the underlying implemention) > and potentially leads to user confusion like > > # 8-bit code units are bytes, right? > assert b'\41' in "A" > > which is Not Even Wrong. > http://rationalwiki.org/wiki/Not_even_wrong > > I think it is correct to say that Python strings are sequences of > Unicode code points U+ through U+10. There are no other > restrictions, e.g. strings can contain surrogates, noncharacters, or > nonsensical combinations of code points such as a U+0300 COMBINING GRAVE > ACCENT combined with U+000A (newline). > > > > Implying > > that dealing with characters (or the grapheme globs that occasionally > > raise their ugly heads here) is an issue for higher-level facilities > > than str to deal with. > > Agreed that Python doesn't offer a string type based on graphemes, and > that such a facility belongs as a high-level library, not a built-in > type. > > Also agreed that talking about characters is sloppy. Nevertheless, for > English speakers at least, "code point = character" isn't too awful a > first approximation. > > > > The point being that > > > > > Basically, we are pretending that the each smuggled byte is single > > > character > > > > is something of a misstatement (good enough for present purpose of > > discussing email, but not good enough for the general case of > > understanding how this is supposed to work when porting the construct > > to other Python implementations), while > > > > > for string parsing purposes...but they don't match any of our > > > parsing constants. > > > > is precisely Pythonically correct. You might want to add "because all > > parsing constants contain only valid characters by construction." > > I don't understand what you are trying to say here. > > > > > [*] I worried a lot that this was re-introducing the bytes/string > > > problem from python2. > > > > It isn't, because the bytes/str problem was that given a str object > > out of context you could not tell whether it was a binary blob or > > text, and if text, you couldn't tell if it was external encoded text > > or internal abstract text. > > > > That is not true here because the representations of characters vs. > > smuggled bytes in str are disjoint sets. > > Nor am I sure what you are trying to say here either. > > > > Footnotes: > > [1] In Unicode terminology, a code unit is the smallest computer > > object that can represent a character (this is uniquely and sanely > > defined for all real Unicode transformation formats aka UTFs). A code > > point is an integer 0 - (17*256*256-1) that can represent a character, > > but many code points such as surrogates and 0x are defined to be > > non-characters. > > Actually not quite. "Noncharacter" is concretely defined in Unicode, and > there are only 66 of them, many fewer than the surrogate code points > alone. Surrogates are reserved, not noncharacters. > > http://www.unicode.org/glossary/#surrogate_code_point > http://www.unicode.org/faq/private_use.html#nonchar1 > > It is wrong to talk about "surrogate characters", but perhaps you mean > to say that surrogates (by which I understand you to mean surrogate code > points) are "not human-meaningful characters", which is not the same > thing as a Unicode noncharacter. > > > > Characters are those code points that may be assigned > > an interpretation as a character, including undefined characters > > (private space and reserved). > > So characters are code points which are characters, including undefined > characters? :-) > > http://www.unicode.org/glossary/#character > > > ___ Pyth
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Am 17.09.14 10:56, schrieb Steven D'Aprano: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > >> Guido's mantra is something like "Python's str doesn't contain >> characters or even code points[1], it contains code units." > > But is that true? It used to be true, and stopped being so with PEP 393. In particular, Python 3.2 and before would expose UTF-16 in the narrow build, so the elements of a string would be code units. Since Python 3.3, the surrogate code points are not longer interpreted as UTF-16 code units. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)
Jeff Allen writes: > This feels like a jython-dev discussion. But anyway ... Well, if the same representation could be used in Jython you could just point to PEP 383 and be done with it. > u'\udc83' in u'abc\U00010083xyz' IMHO being able to type that is a bug. There should be no literal notation for surrogates in Python (that is, if you type a literal that looks like it refers to a surrogate, you should get a UnicodeError). The "right way" (IMHO) to spell that is chr(0xdc83) in u'abc\U00010083xyz' I'm not Guido, and don't claim to channel him on this. But it seems reasonable to me that Unicode literals should conform to Unicode in this way. I might even extend that to noncharacters (the last two code points in each plane and the 32-point "hole" in Arabic). I'll grant that chr() is an unfortunate spelling, but I would imagine we could live with that since chr() goes back forever with these semantics. > u'abc\U00010083xyz'.endswith(u'\udc83xyz') > > are both True, if implemented in the obvious way on the UTF-16 > representation. And this should not be so in Jython, which claims to be > a wide build. (I can't actually type the second one, but I can get the > same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I agree that's very ugly, but AFAIK that's how things would work in narrow CPython (which uses UTF-16 internally for the astral planes). Personally I would document that explicit smuggled bytes are not supported for comparison operations, and leave it at that. > If we think it is ok that code using such methods give the wrong answer > when fed strings containing smuggled bytes, then isolated (trailing) > surrogates could be allowed. It's the user's fault for calling the > method on that data. But I think it kinder that our implementation > defend users from these wrong answers. In the latest state of Jython, we > do this by rigorously preventing the construction of a PyUnicode > containing a lone surrogate, so we can just use UTF-16 operations > without further checks. That seems like a reasonable approach. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano writes: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > > > Guido's mantra is something like "Python's str doesn't contain > > characters or even code points[1], it contains code units." > > But is that true? It's not. That's why I wrote the slightly pejorative "mantra" and qualified it with "something like". The precise statement is "something like" the array property is more important than preserving character boundaries, so slices etc are allowed to do unexpected or even evil things in the presence of astral characters in UTF-16 representations. > I don't understand what you are trying to say here. > Nor am I sure what you are trying to say here either. We can discuss this off-list if you would like. The natives are getting restless. > > non-characters. > > Actually not quite. "Noncharacter" Note the hyphen! (Just kidding, I will avoid that terminology in the future. I knew, but forgot.) > > Characters are those code points that may be assigned > > an interpretation as a character, including undefined characters > > (private space and reserved). > > So characters are code points which are characters, including undefined > characters? :-) No, there's a clear hierarchy here. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com