Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico  wrote:
> On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull  
> wrote:
> > Jim J. Jewett writes:
> >
> > > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > > as representing a character outside of your unicode repertoire
> >
> > I have to disagree. If you ever end up passing them to something that
> > validates or tries to reencode them without surrogateescape, BOOM!
> > These things are the text equivalent of IEEE NaNs.  If all you know
> > (as in the stdlib) is that you have "generic text", the only fairly
> > safe things to do with them are (1) delete them, (2) substitute an
> > appropriate replacement character for them, (3) pass the text
> > containing them verbatim to other code, and (4) reencode them using
> > the same codec they were read with.
> 
> Don't forget, these are *errors*. These are bytes that cannot be
> correctly decoded. That's not something that has any meaning
> whatsoever in text; so by definition, the only things you can do are
> the four you list there (as long as "codec" means both the choice of
> encoding and the use of the surrogateescape flag). It's like dealing
> with control characters when you need to print something visually,
> except that they have an official solution [1] and surrogateescape is
> unofficial. They're not real text, so you have to deal with them
> somehow.

That isn't the case in the email package.  The smuggled bytes are not
errors[*], they are literally smuggled bytes.  But, as Stephen said, the
only things email does with them are the last three of the four he
listed (if you read (3) as passing it between parts of the email
package): the data comes in as text mixed with binary, and the email
package parses it until it knows what the binary is supposed to be,
turns it back into bytes, and decodes it properly.  The goal is to never
let the smuggled bytes escape out the email APIs as surrogateescape
encoded text; though, in practice, this being consenting-adults Python
and code not being bug free, there are places where people have used the
knowledge of how surrogateescape is used by email to work around both
API and code bugs.

--David

[*] Some of the encoded bytes *are* errors (non-ascii in headers or
undecodable bytes in whatever the CTE/charset is), and in that case
email may just turn them back into error bytes in the output, but only
*some* of the smuggled bytes are actually errors (and none are if the
message is RFC compliant).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray  wrote:
> That isn't the case in the email package.  The smuggled bytes are not
> errors[*], they are literally smuggled bytes.

But they're not characters, which is what Stephen and I were saying -
and contrary to what Jim said about treating them as characters. At
best, they represent characters but in some encoding other than the
one you're using, and you have no idea how many bytes form a character
or anything. So you can't, for instance, word-wrap the text, because
you can't know how wide these unknown bytes are, whether they
represent spaces (wrap points), or newlines, or anything like that.
You can't treat them as characters, so while you have them in your
string, you can't treat it as a pure Unicode string - it''s a Unicode
string with smuggled bytes.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico  wrote:
> On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray  
> wrote:
> > That isn't the case in the email package.  The smuggled bytes are not
> > errors[*], they are literally smuggled bytes.
> 
> But they're not characters, which is what Stephen and I were saying -
> and contrary to what Jim said about treating them as characters. At
> best, they represent characters but in some encoding other than the
> one you're using, and you have no idea how many bytes form a character
> or anything. So you can't, for instance, word-wrap the text, because
> you can't know how wide these unknown bytes are, whether they
> represent spaces (wrap points), or newlines, or anything like that.
> You can't treat them as characters, so while you have them in your
> string, you can't treat it as a pure Unicode string - it''s a Unicode
> string with smuggled bytes.

Well, except that I do.  The email header parsing algorithms all work
fine if I treat the surrogate escaped bytes as 'unknown junk' and just
parse based on the valid unicode.  (Unless the header is so garbled that
it can't be parsed, of course, at which point it becomes an invalid
header).

You are right about the wrapping, though.  If a header with invalid
bytes (and in this scenario we *are* talking about errors) needs to
be wrapped, we have to first decode the smuggled bytes and turn it
into an 'unknown-8bit' encoded word before we can wrap the header.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray  wrote:
>> You can't treat them as characters, so while you have them in your
>> string, you can't treat it as a pure Unicode string - it''s a Unicode
>> string with smuggled bytes.
>
> Well, except that I do.  The email header parsing algorithms all work
> fine if I treat the surrogate escaped bytes as 'unknown junk' and just
> parse based on the valid unicode.  (Unless the header is so garbled that
> it can't be parsed, of course, at which point it becomes an invalid
> header).

Do what, exactly? As I understand you, you treat the unknown bytes as
completely opaque, not representing any characters at all. Which is
what I'm saying: those are not characters.

If you, instead, represented the header as a list with some str
elements and some bytes, it would be just as valid (though much harder
to work with); all your manipulations are done on the str parts, and
the bytes just tag along for the ride.

> You are right about the wrapping, though.  If a header with invalid
> bytes (and in this scenario we *are* talking about errors) needs to
> be wrapped, we have to first decode the smuggled bytes and turn it
> into an 'unknown-8bit' encoded word before we can wrap the header.

Yeah, and that's going to be a bit messy. If you get 60 characters
followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
the middle of the smuggled section?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Jim Baker
Great points here - I especially like the concluding statement "you can't
treat it as a pure Unicode string - it's a Unicode string with smuggled
bytes"

Given that Jython uses UTF-16 as its representation, it is possible to
frequently smuggle isolated surrogates in it. A surrogate pair must be a
low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
E000). So one can likely assign an interpretation that this is in fact the
isolated surrogate, and not an actual codepoint.

Of course, if you do actually have a smuggled isolated low surrogate
FOLLOWED by a smuggled isolated high surrogate - guess what, the only
interpretation is a codepoint. Or perhaps more likely garbage. Of course it
doesn't happen so often, so maybe we are fine with the occasional bug ;)

I personally suspect that we will resolve this by also supporting UCS-4 as
a representation in Jython 3.x for such Unicode strings, albeit with the
limitation that we have simply moved the problem to when we try to call
Java methods taking java.lang.String objects.

- Jim

On Tue, Sep 16, 2014 at 9:27 AM, Chris Angelico  wrote:

> On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray 
> wrote:
> > That isn't the case in the email package.  The smuggled bytes are not
> > errors[*], they are literally smuggled bytes.
>
> But they're not characters, which is what Stephen and I were saying -
> and contrary to what Jim said about treating them as characters. At
> best, they represent characters but in some encoding other than the
> one you're using, and you have no idea how many bytes form a character
> or anything. So you can't, for instance, word-wrap the text, because
> you can't know how wide these unknown bytes are, whether they
> represent spaces (wrap points), or newlines, or anything like that.
> You can't treat them as characters, so while you have them in your
> string, you can't treat it as a pure Unicode string - it''s a Unicode
> string with smuggled bytes.
>
> ChrisA
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/jbaker%40zyasoft.com
>



-- 
- Jim

jim.baker@{colorado.edu|python.org|rackspace.com|zyasoft.com}
twitter.com/jimbaker
github.com/jimbaker
bitbucket.com/jimbaker
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker  wrote:
> Of course, if you do actually have a smuggled isolated low surrogate
> FOLLOWED by a smuggled isolated high surrogate - guess what, the only
> interpretation is a codepoint. Or perhaps more likely garbage. Of course it
> doesn't happen so often, so maybe we are fine with the occasional bug ;)
>
> I personally suspect that we will resolve this by also supporting UCS-4 as a
> representation in Jython 3.x for such Unicode strings, albeit with the
> limitation that we have simply moved the problem to when we try to call Java
> methods taking java.lang.String objects.
>

That'll cost efficiency, of course, but it'll guarantee correctness.
And maybe, just maybe, you'll be able to put some pressure on Java
itself to start supporting UCS-4 natively...

One can dream.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico  wrote:
> On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray  
> wrote:
> >> You can't treat them as characters, so while you have them in your
> >> string, you can't treat it as a pure Unicode string - it''s a Unicode
> >> string with smuggled bytes.
> >
> > Well, except that I do.  The email header parsing algorithms all work
> > fine if I treat the surrogate escaped bytes as 'unknown junk' and just
> > parse based on the valid unicode.  (Unless the header is so garbled that
> > it can't be parsed, of course, at which point it becomes an invalid
> > header).
> 
> Do what, exactly? As I understand you, you treat the unknown bytes as
> completely opaque, not representing any characters at all. Which is
> what I'm saying: those are not characters.

Yes.  I thought you were saying that one could not treat the string with
smuggled bytes as if it were a string.  (It's a string that can't be
encoded unless you use the surrogateescape error handler, but it is
still a string from Python's POV, which is the point of the error
handler).

Or, to put it another way, your implication was that there were no
string operations that could be usefully applied to a string containing
smuggled bytes, but that is not the case.  (I may well have read an
implication that was not there; if so I apologize and you can ignore the
rest of this :)  Basically, we are pretending that the each smuggled
byte is single character for string parsing purposes...but they don't
match any of our parsing constants.  They are all "any character" matches
in the regexes and what have you.  Of course, this only works in
contexts where we can ignore or "carry along" the smuggled bytes as
being components of "arbitrary text" portions of the syntax, and we must
take care to either replace them with valid unicode error glyphs or turn
the string of which the are a part into binary using the same codec and
error handler as we used to ingest them to begin with before emitting
them.  And, of course, we can't *modify* the sections containing the
smuggled bytes, only the syntax-matched sections that surround them; and
things like line wrapping are just an invitation to ugliness and bugs
even if you kept the smuggled bytes sections internally intact.

Finally, to explain what I meant by "except that I do": when I added
back binary support to the email package in Python3, initially I *did
not change the parsing algorithms* in the code.  I just smuggled the
bytes, and then dealt with the encoding/decoding at the API boundaries.
This is the same principle used when dealing with filenames in the API
of Python itself.  *Except* at that boundary, I do not need to worry
about whether a particular string contains smuggled bytes or not.[*]

> If you, instead, represented the header as a list with some str
> elements and some bytes, it would be just as valid (though much harder
> to work with); all your manipulations are done on the str parts, and
> the bytes just tag along for the ride.

Quite a bit harder, which is why I don't do that.

> > You are right about the wrapping, though.  If a header with invalid
> > bytes (and in this scenario we *are* talking about errors) needs to
> > be wrapped, we have to first decode the smuggled bytes and turn it
> > into an 'unknown-8bit' encoded word before we can wrap the header.
> 
> Yeah, and that's going to be a bit messy. If you get 60 characters
> followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
> the middle of the smuggled section?

The point of RFC2047 encoded words is that they are an ASCII
representation of binary data, so once the bytes are "properly" Content
Transfer Encoded (as being in an unknown charset) the string contains no
smuggled bytes and can be wrapped.

--David

[*] I worried a lot that this was re-introducing the bytes/string
problem from python2.  The difference is that if the smuggled bytes
escape from the email API, that's a bug in the email package.  So user
code using the library is *not* in danger of getting mysterious encoding
errors when one day the input is international where before it was all
ASCII.  (Absent bugs in the library.)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Jim Baker writes:

 > Given that Jython uses UTF-16 as its representation, it is possible to
 > frequently smuggle isolated surrogates in it. A surrogate pair must be a
 > low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
 > E000).
 > 
 > Of course, if you do actually have a smuggled isolated low
 > surrogate FOLLOWED by a smuggled isolated high surrogate - guess
 > what, the only interpretation is a codepoint. Or perhaps more
 > likely garbage. Of course it doesn't happen so often, so maybe we
 > are fine with the occasional bug ;)

The CPython representation uses trailing surrogates only[1], so it's
never possible to interpret them as anything but non-characters -- as
soon as you encounter them you know that it's a lone surrogate.
Surely you can do the same.

As long as the Java string manipulation functions don't check for
surrogates, you should be fine with this representation.  Of course I
suppose your matching functions (etc) don't check for them either, so
you will be somewhat vulnerable to bugs due to treating them as
characters.  But the same is true for CPython, AFAIK.

Footnotes: 
[1]  Only 128 bytes are necessary since the 128 ASCII characters are
embedded in Unicode as-is.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
R. David Murray writes:

 > > Do what, exactly? As I understand you, you treat the unknown bytes as
 > > completely opaque, not representing any characters at all. Which is
 > > what I'm saying: those are not characters.
 > 
 > Yes.  I thought you were saying that one could not treat the string with
 > smuggled bytes as if it were a string.

Guido's mantra is something like "Python's str doesn't contain
characters or even code points[1], it contains code units."  Implying
that dealing with characters (or the grapheme globs that occasionally
raise their ugly heads here) is an issue for higher-level facilities
than str to deal with.

The point being that

 > Basically, we are pretending that the each smuggled byte is single
 > character

is something of a misstatement (good enough for present purpose of
discussing email, but not good enough for the general case of
understanding how this is supposed to work when porting the construct
to other Python implementations), while

 > for string parsing purposes...but they don't match any of our
 > parsing constants.

is precisely Pythonically correct.  You might want to add "because all
parsing constants contain only valid characters by construction."

 > [*] I worried a lot that this was re-introducing the bytes/string
 > problem from python2.

It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.

That is not true here because the representations of characters vs.
smuggled bytes in str are disjoint sets.

Footnotes: 
[1]  In Unicode terminology, a code unit is the smallest computer
object that can represent a character (this is uniquely and sanely
defined for all real Unicode transformation formats aka UTFs).  A code
point is an integer 0 - (17*256*256-1) that can represent a character,
but many code points such as surrogates and 0x are defined to be
non-characters.  Characters are those code points that may be assigned
an interpretation as a character, including undefined characters
(private space and reserved).

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull"  
wrote:
> As long as the Java string manipulation functions don't check for
> surrogates, you should be fine with this representation.  Of course I
> suppose your matching functions (etc) don't check for them either, so
> you will be somewhat vulnerable to bugs due to treating them as
> characters.  But the same is true for CPython, AFAIK.

>From my point of view, the string function laxness is a feature, not a
bug.  But I get what you mean.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray  wrote:
> Yes.  I thought you were saying that one could not treat the string with
> smuggled bytes as if it were a string.  (It's a string that can't be
> encoded unless you use the surrogateescape error handler, but it is
> still a string from Python's POV, which is the point of the error
> handler).
>
> Or, to put it another way, your implication was that there were no
> string operations that could be usefully applied to a string containing
> smuggled bytes, but that is not the case.  (I may well have read an
> implication that was not there; if so I apologize and you can ignore the
> rest of this :)

Ahh, I see where we are getting confused. What I said was that you
can't treat the string as a *pure* Unicode string. Parts of it are
Unicode text, parts of it aren't.

> Basically, we are pretending that the each smuggled
> byte is single character for string parsing purposes...but they don't
> match any of our parsing constants.  They are all "any character" matches
> in the regexes and what have you.

This is slightly iffy, as you can't be sure that one byte represents
one character, but as long as you don't much care about that, it's not
going to be an issue. I'm fairly sure you're never going to find an
encoding in which one unknown byte represents two characters, but
there are cases where it takes more than one byte to make up a
character (or the bytes are just shift codes or something). Does that
ever throw off your regexes? It wouldn't be an issue to a .* between
two character markers, but if you ever say .{5} then it might match
incorrectly.

I think we're in agreement here, just using different words. :)

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Glenn Linderman

On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote:

It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.

That is not true here because the representations of characters vs.
smuggled bytes in str are disjoint sets.


Actually, while it may be true that for the email headers case, all 
characters are characters, just the encoding is unknown, it is not 
necessarily true that they are in disjoint sets. Some bytes may decode 
into characters without needing to be smuggled... maybe not in 
text-protocols like email, but in the general case. So then some of the 
bytes that should be interpreted as binary data are not in a disjoint 
set from characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Glenn Linderman writes:

 > Some bytes may decode into characters without needing to be
 > smuggled... maybe not in text-protocols like email, but in the
 > general case. So then some of the bytes that should be interpreted
 > as binary data are not in a disjoint set from characters.

True, but irrelevant.  The point is that whoever chose the codec is
responsible for getting it right, not only the right encoding, but for
the assumption that the input data was pure encoded text.  The rest of
the program can now assume that choice was made correctly, and process
text as text.  The program cannot be blamed for assuming that the
person who chose the codec knew what they were about, and so
characters can be *assumed* to be decoded from bytes representing
characters.

This was not true in Python 2, where it was common practice to
represent encoded text by itself internally, implicitly assuming that
only one encoding would be encountered in each invocation of the
program.  This was never true, and with the spread of the Internet and
then the WWW, it became a major issue.  And that's why we invented
Python 3, to let text be text without the encumbrance of always being
aware of encodings and converting when different encodings collide,
etc.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray  
> wrote:

> > Basically, we are pretending that the each smuggled
> > byte is single character for string parsing purposes...but they don't
> > match any of our parsing constants.  They are all "any character" matches
> > in the regexes and what have you.
> 
> This is slightly iffy, as you can't be sure that one byte represents
> one character, but as long as you don't much care about that, it's not
> going to be an issue.

This discussion would probably be a lot more easy to follow, with fewer 
miscommunications, if there were some examples. Here is my example, 
perhaps someone can tell me if I'm understanding it correctly.

I want to send an email including the header line:

'Subject: “NOBODY expects the Spanish Inquisition!”'

Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
do the right thing and encode it as UTF-8:

b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

but my mail package, not being written in a language as awesome as 
Python, is just riddled with bugs, and somehow I end up with this 
corrupted byte-string instead:

b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

Note that the bytes from the first curly quote bytes are in the wrong 
order, but the second is okay. (Like I said, it's just *riddled* with 
bugs.) That means that trying to decode those bytes will fail in Python:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9: 
invalid start byte

but it's not up to Python's email package to throw those invalid bytes 
out or permantly replace them with something else. Also, we want to work 
with Unicode strings, not byte strings, so there has to be a way to 
smuggle those three bytes into Unicode, without ending up with either 
the replacement bytes:

# using the 'replace' error handler
'Subject: ���NOBODY expects the Spanish Inquisition!”'

or incorrectly interpreting them as valid, but wrong, code points. (If 
we do the second, we end up with two control characters "\x9c\x80" 
followed by "â".) We want to be able to round-trip back to the same 
bytes we received.

Am I right so far?

So the email package uses the surrogate-escape error handler and ends up 
with this Unicode string:

'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'

which can be encoded back to the bytes we started with.

Note that technically those three \u... code points are NOT classified 
as "noncharacters". They are actually surrogate code points:

http://www.unicode.org/faq/private_use.html#nonchar4
http://www.unicode.org/glossary/#surrogate_code_point

and they're supposed to be reserved for UTF-16. I'm not sure of the 
implication of that.


> I'm fairly sure you're never going to find an
> encoding in which one unknown byte represents two characters,

There are encodings which use a "shift" mechanism, whereby a byte X 
represents one character by default, and a different character after the 
shift mechanism. But I don't think that matters, since we're not able to 
interpret those bytes. If we were, we'd just decode them to a text 
string and be done with it.


> but
> there are cases where it takes more than one byte to make up a
> character (or the bytes are just shift codes or something). 

Multi-byte encodings are very common. All the Unicode encodings are 
multi-byte. So are many East Asian encodings.


> Does that
> ever throw off your regexes? It wouldn't be an issue to a .* between
> two character markers, but if you ever say .{5} then it might match
> incorrectly.

I don't think the idea is to match on these smuggled bytes specifically. 
I think the idea is to match *around* them. In the example above, we 
might match everything from "Subject: " to the end of the line. So long 
as we never end up with a situation where the smuggled bytes are 
replaced by something else, or shuffled around into different positions, 
we should be fine.

David, is my understanding correct?



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Akira Li
Steven D'Aprano  writes:

> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
>> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray  
>> wrote:
>
>> > Basically, we are pretending that the each smuggled
>> > byte is single character for string parsing purposes...but they don't
>> > match any of our parsing constants.  They are all "any character" matches
>> > in the regexes and what have you.
>> 
>> This is slightly iffy, as you can't be sure that one byte represents
>> one character, but as long as you don't much care about that, it's not
>> going to be an issue.
>
> This discussion would probably be a lot more easy to follow, with fewer 
> miscommunications, if there were some examples. Here is my example, 
> perhaps someone can tell me if I'm understanding it correctly.
>
> I want to send an email including the header line:
>
> 'Subject: “NOBODY expects the Spanish Inquisition!”'
>

  >>> from email.header import Header
  >>> h = Header('Subject: “NOBODY expects the Spanish Inquisition!”')
  >>> h.encode('utf-8')
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='
  >>> h.encode()
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='
  >>> h.encode('ascii')
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='


--
Akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Steven D'Aprano writes:

[long example]

 > Am I right so far?
 > 
 > So the email package uses the surrogate-escape error handler and ends up 
 > with this Unicode string:
 > 
 > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
 > 
 > which can be encoded back to the bytes we started with.

Yes.

 > Note that technically those three \u... code points are NOT classified 
 > as "noncharacters".

Very unpythonic terminology, easily confusing the nonspecialist.  Or
the specialist -- I used to know that Unicode gave "noncharacter" a
technical definition but it seems I forgot.  But then, Unicode isn't a
PSF product, so I guess it's OK to be unpythonic.

 > They are actually surrogate code points:
 > 
 > http://www.unicode.org/faq/private_use.html#nonchar4
 > http://www.unicode.org/glossary/#surrogate_code_point
 > 
 > and they're supposed to be reserved for UTF-16. I'm not sure of the 
 > implication of that.

It means that any Python program that invokes the surrogateescape
handler is not a "conforming Unicode process", at least not on the
naive interpretation of that definition.  A conforming process would
interpret them as corrupt characters and raise as soon as detected.

A more sophisticated interpretation might argue that Python is
multiple processes (in the sense of "process" used by Unicode), and
that the Unicode standard only applies to characters.  This is
especially true of Pythons implementing PEP 393, since no surrogates
should ever appear in text[1] at all.  Then the smuggled bytes can be
treated as noncharacters in practice although technically it's a
violation of the Unicode standard to do so.

Footnotes: 
[1]  Meaning, no fair using chr() to inject them into str!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com