Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico wrote: > On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull > wrote: > > Jim J. Jewett writes: > > > > > In terms of best-effort, it is reasonable to treat the smuggled bytes > > > as representing a character outside of your unicode repertoire > > > > I have to disagree. If you ever end up passing them to something that > > validates or tries to reencode them without surrogateescape, BOOM! > > These things are the text equivalent of IEEE NaNs. If all you know > > (as in the stdlib) is that you have "generic text", the only fairly > > safe things to do with them are (1) delete them, (2) substitute an > > appropriate replacement character for them, (3) pass the text > > containing them verbatim to other code, and (4) reencode them using > > the same codec they were read with. > > Don't forget, these are *errors*. These are bytes that cannot be > correctly decoded. That's not something that has any meaning > whatsoever in text; so by definition, the only things you can do are > the four you list there (as long as "codec" means both the choice of > encoding and the use of the surrogateescape flag). It's like dealing > with control characters when you need to print something visually, > except that they have an official solution [1] and surrogateescape is > unofficial. They're not real text, so you have to deal with them > somehow. That isn't the case in the email package. The smuggled bytes are not errors[*], they are literally smuggled bytes. But, as Stephen said, the only things email does with them are the last three of the four he listed (if you read (3) as passing it between parts of the email package): the data comes in as text mixed with binary, and the email package parses it until it knows what the binary is supposed to be, turns it back into bytes, and decodes it properly. The goal is to never let the smuggled bytes escape out the email APIs as surrogateescape encoded text; though, in practice, this being consenting-adults Python and code not being bug free, there are places where people have used the knowledge of how surrogateescape is used by email to work around both API and code bugs. --David [*] Some of the encoded bytes *are* errors (non-ascii in headers or undecodable bytes in whatever the CTE/charset is), and in that case email may just turn them back into error bytes in the output, but only *some* of the smuggled bytes are actually errors (and none are if the message is RFC compliant). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray wrote: > That isn't the case in the email package. The smuggled bytes are not > errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them as characters. At best, they represent characters but in some encoding other than the one you're using, and you have no idea how many bytes form a character or anything. So you can't, for instance, word-wrap the text, because you can't know how wide these unknown bytes are, whether they represent spaces (wrap points), or newlines, or anything like that. You can't treat them as characters, so while you have them in your string, you can't treat it as a pure Unicode string - it''s a Unicode string with smuggled bytes. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray > wrote: > > That isn't the case in the email package. The smuggled bytes are not > > errors[*], they are literally smuggled bytes. > > But they're not characters, which is what Stephen and I were saying - > and contrary to what Jim said about treating them as characters. At > best, they represent characters but in some encoding other than the > one you're using, and you have no idea how many bytes form a character > or anything. So you can't, for instance, word-wrap the text, because > you can't know how wide these unknown bytes are, whether they > represent spaces (wrap points), or newlines, or anything like that. > You can't treat them as characters, so while you have them in your > string, you can't treat it as a pure Unicode string - it''s a Unicode > string with smuggled bytes. Well, except that I do. The email header parsing algorithms all work fine if I treat the surrogate escaped bytes as 'unknown junk' and just parse based on the valid unicode. (Unless the header is so garbled that it can't be parsed, of course, at which point it becomes an invalid header). You are right about the wrapping, though. If a header with invalid bytes (and in this scenario we *are* talking about errors) needs to be wrapped, we have to first decode the smuggled bytes and turn it into an 'unknown-8bit' encoded word before we can wrap the header. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray wrote: >> You can't treat them as characters, so while you have them in your >> string, you can't treat it as a pure Unicode string - it''s a Unicode >> string with smuggled bytes. > > Well, except that I do. The email header parsing algorithms all work > fine if I treat the surrogate escaped bytes as 'unknown junk' and just > parse based on the valid unicode. (Unless the header is so garbled that > it can't be parsed, of course, at which point it becomes an invalid > header). Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters. If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride. > You are right about the wrapping, though. If a header with invalid > bytes (and in this scenario we *are* talking about errors) needs to > be wrapped, we have to first decode the smuggled bytes and turn it > into an 'unknown-8bit' encoded word before we can wrap the header. Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section? ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Great points here - I especially like the concluding statement "you can't treat it as a pure Unicode string - it's a Unicode string with smuggled bytes" Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a low surrogate in range (D800, DC00), then a high surrogate in range(DC00, E000). So one can likely assign an interpretation that this is in fact the isolated surrogate, and not an actual codepoint. Of course, if you do actually have a smuggled isolated low surrogate FOLLOWED by a smuggled isolated high surrogate - guess what, the only interpretation is a codepoint. Or perhaps more likely garbage. Of course it doesn't happen so often, so maybe we are fine with the occasional bug ;) I personally suspect that we will resolve this by also supporting UCS-4 as a representation in Jython 3.x for such Unicode strings, albeit with the limitation that we have simply moved the problem to when we try to call Java methods taking java.lang.String objects. - Jim On Tue, Sep 16, 2014 at 9:27 AM, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray > wrote: > > That isn't the case in the email package. The smuggled bytes are not > > errors[*], they are literally smuggled bytes. > > But they're not characters, which is what Stephen and I were saying - > and contrary to what Jim said about treating them as characters. At > best, they represent characters but in some encoding other than the > one you're using, and you have no idea how many bytes form a character > or anything. So you can't, for instance, word-wrap the text, because > you can't know how wide these unknown bytes are, whether they > represent spaces (wrap points), or newlines, or anything like that. > You can't treat them as characters, so while you have them in your > string, you can't treat it as a pure Unicode string - it''s a Unicode > string with smuggled bytes. > > ChrisA > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/jbaker%40zyasoft.com > -- - Jim jim.baker@{colorado.edu|python.org|rackspace.com|zyasoft.com} twitter.com/jimbaker github.com/jimbaker bitbucket.com/jimbaker ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker wrote: > Of course, if you do actually have a smuggled isolated low surrogate > FOLLOWED by a smuggled isolated high surrogate - guess what, the only > interpretation is a codepoint. Or perhaps more likely garbage. Of course it > doesn't happen so often, so maybe we are fine with the occasional bug ;) > > I personally suspect that we will resolve this by also supporting UCS-4 as a > representation in Jython 3.x for such Unicode strings, albeit with the > limitation that we have simply moved the problem to when we try to call Java > methods taking java.lang.String objects. > That'll cost efficiency, of course, but it'll guarantee correctness. And maybe, just maybe, you'll be able to put some pressure on Java itself to start supporting UCS-4 natively... One can dream. ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray > wrote: > >> You can't treat them as characters, so while you have them in your > >> string, you can't treat it as a pure Unicode string - it''s a Unicode > >> string with smuggled bytes. > > > > Well, except that I do. The email header parsing algorithms all work > > fine if I treat the surrogate escaped bytes as 'unknown junk' and just > > parse based on the valid unicode. (Unless the header is so garbled that > > it can't be parsed, of course, at which point it becomes an invalid > > header). > > Do what, exactly? As I understand you, you treat the unknown bytes as > completely opaque, not representing any characters at all. Which is > what I'm saying: those are not characters. Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string. (It's a string that can't be encoded unless you use the surrogateescape error handler, but it is still a string from Python's POV, which is the point of the error handler). Or, to put it another way, your implication was that there were no string operations that could be usefully applied to a string containing smuggled bytes, but that is not the case. (I may well have read an implication that was not there; if so I apologize and you can ignore the rest of this :) Basically, we are pretending that the each smuggled byte is single character for string parsing purposes...but they don't match any of our parsing constants. They are all "any character" matches in the regexes and what have you. Of course, this only works in contexts where we can ignore or "carry along" the smuggled bytes as being components of "arbitrary text" portions of the syntax, and we must take care to either replace them with valid unicode error glyphs or turn the string of which the are a part into binary using the same codec and error handler as we used to ingest them to begin with before emitting them. And, of course, we can't *modify* the sections containing the smuggled bytes, only the syntax-matched sections that surround them; and things like line wrapping are just an invitation to ugliness and bugs even if you kept the smuggled bytes sections internally intact. Finally, to explain what I meant by "except that I do": when I added back binary support to the email package in Python3, initially I *did not change the parsing algorithms* in the code. I just smuggled the bytes, and then dealt with the encoding/decoding at the API boundaries. This is the same principle used when dealing with filenames in the API of Python itself. *Except* at that boundary, I do not need to worry about whether a particular string contains smuggled bytes or not.[*] > If you, instead, represented the header as a list with some str > elements and some bytes, it would be just as valid (though much harder > to work with); all your manipulations are done on the str parts, and > the bytes just tag along for the ride. Quite a bit harder, which is why I don't do that. > > You are right about the wrapping, though. If a header with invalid > > bytes (and in this scenario we *are* talking about errors) needs to > > be wrapped, we have to first decode the smuggled bytes and turn it > > into an 'unknown-8bit' encoded word before we can wrap the header. > > Yeah, and that's going to be a bit messy. If you get 60 characters > followed by 30 unknown bytes, where do you wrap it? Dare you wrap in > the middle of the smuggled section? The point of RFC2047 encoded words is that they are an ASCII representation of binary data, so once the bytes are "properly" Content Transfer Encoded (as being in an unknown charset) the string contains no smuggled bytes and can be wrapped. --David [*] I worried a lot that this was re-introducing the bytes/string problem from python2. The difference is that if the smuggled bytes escape from the email API, that's a bug in the email package. So user code using the library is *not* in danger of getting mysterious encoding errors when one day the input is international where before it was all ASCII. (Absent bugs in the library.) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Jim Baker writes: > Given that Jython uses UTF-16 as its representation, it is possible to > frequently smuggle isolated surrogates in it. A surrogate pair must be a > low surrogate in range (D800, DC00), then a high surrogate in range(DC00, > E000). > > Of course, if you do actually have a smuggled isolated low > surrogate FOLLOWED by a smuggled isolated high surrogate - guess > what, the only interpretation is a codepoint. Or perhaps more > likely garbage. Of course it doesn't happen so often, so maybe we > are fine with the occasional bug ;) The CPython representation uses trailing surrogates only[1], so it's never possible to interpret them as anything but non-characters -- as soon as you encounter them you know that it's a lone surrogate. Surely you can do the same. As long as the Java string manipulation functions don't check for surrogates, you should be fine with this representation. Of course I suppose your matching functions (etc) don't check for them either, so you will be somewhat vulnerable to bugs due to treating them as characters. But the same is true for CPython, AFAIK. Footnotes: [1] Only 128 bytes are necessary since the 128 ASCII characters are embedded in Unicode as-is. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
R. David Murray writes: > > Do what, exactly? As I understand you, you treat the unknown bytes as > > completely opaque, not representing any characters at all. Which is > > what I'm saying: those are not characters. > > Yes. I thought you were saying that one could not treat the string with > smuggled bytes as if it were a string. Guido's mantra is something like "Python's str doesn't contain characters or even code points[1], it contains code units." Implying that dealing with characters (or the grapheme globs that occasionally raise their ugly heads here) is an issue for higher-level facilities than str to deal with. The point being that > Basically, we are pretending that the each smuggled byte is single > character is something of a misstatement (good enough for present purpose of discussing email, but not good enough for the general case of understanding how this is supposed to work when porting the construct to other Python implementations), while > for string parsing purposes...but they don't match any of our > parsing constants. is precisely Pythonically correct. You might want to add "because all parsing constants contain only valid characters by construction." > [*] I worried a lot that this was re-introducing the bytes/string > problem from python2. It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Footnotes: [1] In Unicode terminology, a code unit is the smallest computer object that can represent a character (this is uniquely and sanely defined for all real Unicode transformation formats aka UTFs). A code point is an integer 0 - (17*256*256-1) that can represent a character, but many code points such as surrogates and 0x are defined to be non-characters. Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull" wrote: > As long as the Java string manipulation functions don't check for > surrogates, you should be fine with this representation. Of course I > suppose your matching functions (etc) don't check for them either, so > you will be somewhat vulnerable to bugs due to treating them as > characters. But the same is true for CPython, AFAIK. >From my point of view, the string function laxness is a feature, not a bug. But I get what you mean. --David ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray wrote: > Yes. I thought you were saying that one could not treat the string with > smuggled bytes as if it were a string. (It's a string that can't be > encoded unless you use the surrogateescape error handler, but it is > still a string from Python's POV, which is the point of the error > handler). > > Or, to put it another way, your implication was that there were no > string operations that could be usefully applied to a string containing > smuggled bytes, but that is not the case. (I may well have read an > implication that was not there; if so I apologize and you can ignore the > rest of this :) Ahh, I see where we are getting confused. What I said was that you can't treat the string as a *pure* Unicode string. Parts of it are Unicode text, parts of it aren't. > Basically, we are pretending that the each smuggled > byte is single character for string parsing purposes...but they don't > match any of our parsing constants. They are all "any character" matches > in the regexes and what have you. This is slightly iffy, as you can't be sure that one byte represents one character, but as long as you don't much care about that, it's not going to be an issue. I'm fairly sure you're never going to find an encoding in which one unknown byte represents two characters, but there are cases where it takes more than one byte to make up a character (or the bytes are just shift codes or something). Does that ever throw off your regexes? It wouldn't be an issue to a .* between two character markers, but if you ever say .{5} then it might match incorrectly. I think we're in agreement here, just using different words. :) ChrisA ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote: It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets. Actually, while it may be true that for the email headers case, all characters are characters, just the encoding is unknown, it is not necessarily true that they are in disjoint sets. Some bytes may decode into characters without needing to be smuggled... maybe not in text-protocols like email, but in the general case. So then some of the bytes that should be interpreted as binary data are not in a disjoint set from characters. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Glenn Linderman writes: > Some bytes may decode into characters without needing to be > smuggled... maybe not in text-protocols like email, but in the > general case. So then some of the bytes that should be interpreted > as binary data are not in a disjoint set from characters. True, but irrelevant. The point is that whoever chose the codec is responsible for getting it right, not only the right encoding, but for the assumption that the input data was pure encoded text. The rest of the program can now assume that choice was made correctly, and process text as text. The program cannot be blamed for assuming that the person who chose the codec knew what they were about, and so characters can be *assumed* to be decoded from bytes representing characters. This was not true in Python 2, where it was common practice to represent encoded text by itself internally, implicitly assuming that only one encoding would be encountered in each invocation of the program. This was never true, and with the spread of the Internet and then the WWW, it became a major issue. And that's why we invented Python 3, to let text be text without the encumbrance of always being aware of encodings and converting when different encodings collide, etc. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > wrote: > > Basically, we are pretending that the each smuggled > > byte is single character for string parsing purposes...but they don't > > match any of our parsing constants. They are all "any character" matches > > in the regexes and what have you. > > This is slightly iffy, as you can't be sure that one byte represents > one character, but as long as you don't much care about that, it's not > going to be an issue. This discussion would probably be a lot more easy to follow, with fewer miscommunications, if there were some examples. Here is my example, perhaps someone can tell me if I'm understanding it correctly. I want to send an email including the header line: 'Subject: “NOBODY expects the Spanish Inquisition!”' Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I do the right thing and encode it as UTF-8: b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' but my mail package, not being written in a language as awesome as Python, is just riddled with bugs, and somehow I end up with this corrupted byte-string instead: b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d' Note that the bytes from the first curly quote bytes are in the wrong order, but the second is okay. (Like I said, it's just *riddled* with bugs.) That means that trying to decode those bytes will fail in Python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9: invalid start byte but it's not up to Python's email package to throw those invalid bytes out or permantly replace them with something else. Also, we want to work with Unicode strings, not byte strings, so there has to be a way to smuggle those three bytes into Unicode, without ending up with either the replacement bytes: # using the 'replace' error handler 'Subject: ���NOBODY expects the Spanish Inquisition!”' or incorrectly interpreting them as valid, but wrong, code points. (If we do the second, we end up with two control characters "\x9c\x80" followed by "â".) We want to be able to round-trip back to the same bytes we received. Am I right so far? So the email package uses the surrogate-escape error handler and ends up with this Unicode string: 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' which can be encoded back to the bytes we started with. Note that technically those three \u... code points are NOT classified as "noncharacters". They are actually surrogate code points: http://www.unicode.org/faq/private_use.html#nonchar4 http://www.unicode.org/glossary/#surrogate_code_point and they're supposed to be reserved for UTF-16. I'm not sure of the implication of that. > I'm fairly sure you're never going to find an > encoding in which one unknown byte represents two characters, There are encodings which use a "shift" mechanism, whereby a byte X represents one character by default, and a different character after the shift mechanism. But I don't think that matters, since we're not able to interpret those bytes. If we were, we'd just decode them to a text string and be done with it. > but > there are cases where it takes more than one byte to make up a > character (or the bytes are just shift codes or something). Multi-byte encodings are very common. All the Unicode encodings are multi-byte. So are many East Asian encodings. > Does that > ever throw off your regexes? It wouldn't be an issue to a .* between > two character markers, but if you ever say .{5} then it might match > incorrectly. I don't think the idea is to match on these smuggled bytes specifically. I think the idea is to match *around* them. In the example above, we might match everything from "Subject: " to the end of the line. So long as we never end up with a situation where the smuggled bytes are replaced by something else, or shuffled around into different positions, we should be fine. David, is my understanding correct? -- Steven ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano writes: > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: >> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray >> wrote: > >> > Basically, we are pretending that the each smuggled >> > byte is single character for string parsing purposes...but they don't >> > match any of our parsing constants. They are all "any character" matches >> > in the regexes and what have you. >> >> This is slightly iffy, as you can't be sure that one byte represents >> one character, but as long as you don't much care about that, it's not >> going to be an issue. > > This discussion would probably be a lot more easy to follow, with fewer > miscommunications, if there were some examples. Here is my example, > perhaps someone can tell me if I'm understanding it correctly. > > I want to send an email including the header line: > > 'Subject: “NOBODY expects the Spanish Inquisition!”' > >>> from email.header import Header >>> h = Header('Subject: “NOBODY expects the Spanish Inquisition!”') >>> h.encode('utf-8') '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' >>> h.encode() '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' >>> h.encode('ascii') '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n =?utf-8?q?=E2=80=9D?=' -- Akira ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Steven D'Aprano writes: [long example] > Am I right so far? > > So the email package uses the surrogate-escape error handler and ends up > with this Unicode string: > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' > > which can be encoded back to the bytes we started with. Yes. > Note that technically those three \u... code points are NOT classified > as "noncharacters". Very unpythonic terminology, easily confusing the nonspecialist. Or the specialist -- I used to know that Unicode gave "noncharacter" a technical definition but it seems I forgot. But then, Unicode isn't a PSF product, so I guess it's OK to be unpythonic. > They are actually surrogate code points: > > http://www.unicode.org/faq/private_use.html#nonchar4 > http://www.unicode.org/glossary/#surrogate_code_point > > and they're supposed to be reserved for UTF-16. I'm not sure of the > implication of that. It means that any Python program that invokes the surrogateescape handler is not a "conforming Unicode process", at least not on the naive interpretation of that definition. A conforming process would interpret them as corrupt characters and raise as soon as detected. A more sophisticated interpretation might argue that Python is multiple processes (in the sense of "process" used by Unicode), and that the Unicode standard only applies to characters. This is especially true of Pythons implementing PEP 393, since no surrogates should ever appear in text[1] at all. Then the smuggled bytes can be treated as noncharacters in practice although technically it's a violation of the Unicode standard to do so. Footnotes: [1] Meaning, no fair using chr() to inject them into str! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com