Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Stephen J. Turnbull
Steven D'Aprano writes:

[long example]

  Am I right so far?
  
  So the email package uses the surrogate-escape error handler and ends up 
  with this Unicode string:
  
  'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
  
  which can be encoded back to the bytes we started with.

Yes.

  Note that technically those three \u... code points are NOT classified 
  as noncharacters.

Very unpythonic terminology, easily confusing the nonspecialist.  Or
the specialist -- I used to know that Unicode gave noncharacter a
technical definition but it seems I forgot.  But then, Unicode isn't a
PSF product, so I guess it's OK to be unpythonic.wink/

  They are actually surrogate code points:
  
  http://www.unicode.org/faq/private_use.html#nonchar4
  http://www.unicode.org/glossary/#surrogate_code_point
  
  and they're supposed to be reserved for UTF-16. I'm not sure of the 
  implication of that.

It means that any Python program that invokes the surrogateescape
handler is not a conforming Unicode process, at least not on the
naive interpretation of that definition.  A conforming process would
interpret them as corrupt characters and raise as soon as detected.

A more sophisticated interpretation might argue that Python is
multiple processes (in the sense of process used by Unicode), and
that the Unicode standard only applies to characters.  This is
especially true of Pythons implementing PEP 393, since no surrogates
should ever appear in text[1] at all.  Then the smuggled bytes can be
treated as noncharacters in practice although technically it's a
violation of the Unicode standard to do so.

Footnotes: 
[1]  Meaning, no fair using chr() to inject them into str!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano st...@pearwood.info wrote:
 On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
  On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com 
  wrote:
 
   Basically, we are pretending that the each smuggled
   byte is single character for string parsing purposes...but they don't
   match any of our parsing constants.  They are all any character matches
   in the regexes and what have you.
  
  This is slightly iffy, as you can't be sure that one byte represents
  one character, but as long as you don't much care about that, it's not
  going to be an issue.
 
 This discussion would probably be a lot more easy to follow, with fewer 
 miscommunications, if there were some examples. Here is my example, 
 perhaps someone can tell me if I'm understanding it correctly.
 
 I want to send an email including the header line:
 
 'Subject: “NOBODY expects the Spanish Inquisition!”'
 
 Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I 
 do the right thing and encode it as UTF-8:
 
 b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

That won't work until email supports RFC 6532.  Until then, you can only
use ascii and encoded words successfully.  So just having the curly
quotes is a buggy enough program.

 but it's not up to Python's email package to throw those invalid bytes 
 out or permantly replace them with something else. Also, we want to work 
 with Unicode strings, not byte strings, so there has to be a way to 
 smuggle those three bytes into Unicode, without ending up with either 
 the replacement bytes:
 
 # using the 'replace' error handler
 'Subject: ���NOBODY expects the Spanish Inquisition!”'

What you'll get if you request a text copy of that header is

  'Subject: ���NOBODY expects the Spanish Inquisition!���'

 Am I right so far?
 
 So the email package uses the surrogate-escape error handler and ends up 
 with this Unicode string:
 
 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'

Except that it encodes the closing quote, too :)

 which can be encoded back to the bytes we started with.

Right.  If you serialize the message as bytes, the bytes are recovered
and output when that header is output.

Now, once we support RFC 6532, you will be exactly right, as we will
then have the option of handling utf-8 encoded headers, and we will do
that using the utf-8 codec to ingest headers, and the surrogateescape
error handler to handle exactly the kind of bad data you postulate.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
Sorry for the mojibake.  I've not yet gotten around to actually using
the email package to write a smarter replacement for nmh, which is what
I use for email, and I always forget that I need to manually tell nmh
when there non-ascii in the message...

On Wed, 17 Sep 2014 03:02:33 -0400, R. David Murray rdmur...@bitdance.com 
wrote:
 On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano st...@pearwood.info 
 wrote:
  On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
   On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com 
   wrote:
  
Basically, we are pretending that the each smuggled
byte is single character for string parsing purposes...but they don't
match any of our parsing constants.  They are all any character 
matches
in the regexes and what have you.
   
   This is slightly iffy, as you can't be sure that one byte represents
   one character, but as long as you don't much care about that, it's not
   going to be an issue.
  
  This discussion would probably be a lot more easy to follow, with fewer 
  miscommunications, if there were some examples. Here is my example, 
  perhaps someone can tell me if I'm understanding it correctly.
  
  I want to send an email including the header line:
  
  'Subject: “NOBODY expects the Spanish Inquisition!”'
  
  Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I 
  do the right thing and encode it as UTF-8:
  
  b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
 
 That won't work until email supports RFC 6532.  Until then, you can only
 use ascii and encoded words successfully.  So just having the curly
 quotes is a buggy enough program.
 
  but it's not up to Python's email package to throw those invalid bytes 
  out or permantly replace them with something else. Also, we want to work 
  with Unicode strings, not byte strings, so there has to be a way to 
  smuggle those three bytes into Unicode, without ending up with either 
  the replacement bytes:
  
  # using the 'replace' error handler
  'Subject: ���NOBODY expects the Spanish Inquisition!”'
 
 What you'll get if you request a text copy of that header is
 
   'Subject: ���NOBODY expects the Spanish Inquisition!���'
 
  Am I right so far?
  
  So the email package uses the surrogate-escape error handler and ends up 
  with this Unicode string:
  
  'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
 
 Except that it encodes the closing quote, too :)
 
  which can be encoded back to the bytes we started with.
 
 Right.  If you serialize the message as bytes, the bytes are recovered
 and output when that header is output.
 
 Now, once we support RFC 6532, you will be exactly right, as we will
 then have the option of handling utf-8 encoded headers, and we will do
 that using the utf-8 codec to ingest headers, and the surrogateescape
 error handler to handle exactly the kind of bad data you postulate.
 
 --David
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: 
 https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:

 Guido's mantra is something like Python's str doesn't contain
 characters or even code points[1], it contains code units.

But is that true? If it were true, I would expect to be able to make 
Python text strings containing code units that aren't code points, e.g. 
something like \U1234 or chr(0x1234) should work, but neither 
do. As far as I can tell, there is no way to build a string containing 
items which aren't code points.

I don't think it is useful to say that strings *contain* code units, 
more that they *are made up from* code units. Code units are the 
implementation: 16-bit code units in narrow builds, 32-bit code units 
in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
beyond. (I don't know of any Python implementation which uses UTF-8 
internally, but if there was one, it would use 8-bit code units.)

It isn't very useful to say that in Python 3.3 the string A *contains*
the 8-bit code unit 0x41. That's conflating two different levels of 
explanation (the high-level interface and the underlying implemention) 
and potentially leads to user confusion like

# 8-bit code units are bytes, right?
assert b'\41' in A

which is Not Even Wrong.
http://rationalwiki.org/wiki/Not_even_wrong

I think it is correct to say that Python strings are sequences of 
Unicode code points U+ through U+10. There are no other 
restrictions, e.g. strings can contain surrogates, noncharacters, or 
nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
ACCENT combined with U+000A (newline).


 Implying
 that dealing with characters (or the grapheme globs that occasionally
 raise their ugly heads here) is an issue for higher-level facilities
 than str to deal with.

Agreed that Python doesn't offer a string type based on graphemes, and 
that such a facility belongs as a high-level library, not a built-in 
type.

Also agreed that talking about characters is sloppy. Nevertheless, for 
English speakers at least, code point = character isn't too awful a 
first approximation.


 The point being that
 
   Basically, we are pretending that the each smuggled byte is single
   character
 
 is something of a misstatement (good enough for present purpose of
 discussing email, but not good enough for the general case of
 understanding how this is supposed to work when porting the construct
 to other Python implementations), while
 
   for string parsing purposes...but they don't match any of our
   parsing constants.
 
 is precisely Pythonically correct.  You might want to add because all
 parsing constants contain only valid characters by construction.

I don't understand what you are trying to say here.


   [*] I worried a lot that this was re-introducing the bytes/string
   problem from python2.
 
 It isn't, because the bytes/str problem was that given a str object
 out of context you could not tell whether it was a binary blob or
 text, and if text, you couldn't tell if it was external encoded text
 or internal abstract text.
 
 That is not true here because the representations of characters vs.
 smuggled bytes in str are disjoint sets.

Nor am I sure what you are trying to say here either.


 Footnotes: 
 [1]  In Unicode terminology, a code unit is the smallest computer
 object that can represent a character (this is uniquely and sanely
 defined for all real Unicode transformation formats aka UTFs).  A code
 point is an integer 0 - (17*256*256-1) that can represent a character,
 but many code points such as surrogates and 0x are defined to be
 non-characters.

Actually not quite. Noncharacter is concretely defined in Unicode, and 
there are only 66 of them, many fewer than the surrogate code points 
alone. Surrogates are reserved, not noncharacters.

http://www.unicode.org/glossary/#surrogate_code_point
http://www.unicode.org/faq/private_use.html#nonchar1

It is wrong to talk about surrogate characters, but perhaps you mean 
to say that surrogates (by which I understand you to mean surrogate code 
points) are not human-meaningful characters, which is not the same 
thing as a Unicode noncharacter.


 Characters are those code points that may be assigned
 an interpretation as a character, including undefined characters
 (private space and reserved).

So characters are code points which are characters, including undefined 
characters? :-)

http://www.unicode.org/glossary/#character



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Antoine Pitrou

Seriously, can this discussion move somewhere else?
This has nothing to do on python-dev.

Thank you

Antoine.



On Wed, 17 Sep 2014 18:56:02 +1000
Steven D'Aprano st...@pearwood.info wrote:
 On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
 
  Guido's mantra is something like Python's str doesn't contain
  characters or even code points[1], it contains code units.
 
 But is that true? If it were true, I would expect to be able to make 
 Python text strings containing code units that aren't code points, e.g. 
 something like \U1234 or chr(0x1234) should work, but neither 
 do. As far as I can tell, there is no way to build a string containing 
 items which aren't code points.
 
 I don't think it is useful to say that strings *contain* code units, 
 more that they *are made up from* code units. Code units are the 
 implementation: 16-bit code units in narrow builds, 32-bit code units 
 in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
 beyond. (I don't know of any Python implementation which uses UTF-8 
 internally, but if there was one, it would use 8-bit code units.)
 
 It isn't very useful to say that in Python 3.3 the string A *contains*
 the 8-bit code unit 0x41. That's conflating two different levels of 
 explanation (the high-level interface and the underlying implemention) 
 and potentially leads to user confusion like
 
 # 8-bit code units are bytes, right?
 assert b'\41' in A
 
 which is Not Even Wrong.
 http://rationalwiki.org/wiki/Not_even_wrong
 
 I think it is correct to say that Python strings are sequences of 
 Unicode code points U+ through U+10. There are no other 
 restrictions, e.g. strings can contain surrogates, noncharacters, or 
 nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
 ACCENT combined with U+000A (newline).
 
 
  Implying
  that dealing with characters (or the grapheme globs that occasionally
  raise their ugly heads here) is an issue for higher-level facilities
  than str to deal with.
 
 Agreed that Python doesn't offer a string type based on graphemes, and 
 that such a facility belongs as a high-level library, not a built-in 
 type.
 
 Also agreed that talking about characters is sloppy. Nevertheless, for 
 English speakers at least, code point = character isn't too awful a 
 first approximation.
 
 
  The point being that
  
Basically, we are pretending that the each smuggled byte is single
character
  
  is something of a misstatement (good enough for present purpose of
  discussing email, but not good enough for the general case of
  understanding how this is supposed to work when porting the construct
  to other Python implementations), while
  
for string parsing purposes...but they don't match any of our
parsing constants.
  
  is precisely Pythonically correct.  You might want to add because all
  parsing constants contain only valid characters by construction.
 
 I don't understand what you are trying to say here.
 
 
[*] I worried a lot that this was re-introducing the bytes/string
problem from python2.
  
  It isn't, because the bytes/str problem was that given a str object
  out of context you could not tell whether it was a binary blob or
  text, and if text, you couldn't tell if it was external encoded text
  or internal abstract text.
  
  That is not true here because the representations of characters vs.
  smuggled bytes in str are disjoint sets.
 
 Nor am I sure what you are trying to say here either.
 
 
  Footnotes: 
  [1]  In Unicode terminology, a code unit is the smallest computer
  object that can represent a character (this is uniquely and sanely
  defined for all real Unicode transformation formats aka UTFs).  A code
  point is an integer 0 - (17*256*256-1) that can represent a character,
  but many code points such as surrogates and 0x are defined to be
  non-characters.
 
 Actually not quite. Noncharacter is concretely defined in Unicode, and 
 there are only 66 of them, many fewer than the surrogate code points 
 alone. Surrogates are reserved, not noncharacters.
 
 http://www.unicode.org/glossary/#surrogate_code_point
 http://www.unicode.org/faq/private_use.html#nonchar1
 
 It is wrong to talk about surrogate characters, but perhaps you mean 
 to say that surrogates (by which I understand you to mean surrogate code 
 points) are not human-meaningful characters, which is not the same 
 thing as a Unicode noncharacter.
 
 
  Characters are those code points that may be assigned
  an interpretation as a character, including undefined characters
  (private space and reserved).
 
 So characters are code points which are characters, including undefined 
 characters? :-)
 
 http://www.unicode.org/glossary/#character
 
 
 



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Martin v. Löwis
Am 17.09.14 10:56, schrieb Steven D'Aprano:
 On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
 
 Guido's mantra is something like Python's str doesn't contain
 characters or even code points[1], it contains code units.
 
 But is that true?

It used to be true, and stopped being so with PEP 393. In particular,
Python 3.2 and before would expose UTF-16 in the narrow build, so the
elements of a string would be code units. Since Python 3.3, the
surrogate code points are not longer interpreted as UTF-16 code units.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Stephen J. Turnbull
Steven D'Aprano writes:
  On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
  
   Guido's mantra is something like Python's str doesn't contain
   characters or even code points[1], it contains code units.
  
  But is that true?

It's not.  That's why I wrote the slightly pejorative mantra and
qualified it with something like.  The precise statement is
something like the array property is more important than preserving
character boundaries, so slices etc are allowed to do unexpected or
even evil things in the presence of astral characters in UTF-16
representations.

  I don't understand what you are trying to say here.

  Nor am I sure what you are trying to say here either.

We can discuss this off-list if you would like.  The natives are
getting restless.

   non-characters.
  
  Actually not quite. Noncharacter

Note the hyphen!  (Just kidding, I will avoid that terminology in the
future.  I knew, but forgot.)

   Characters are those code points that may be assigned
   an interpretation as a character, including undefined characters
   (private space and reserved).
  
  So characters are code points which are characters, including undefined 
  characters? :-)

No, there's a clear hierarchy here.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico ros...@gmail.com wrote:
 On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull step...@xemacs.org 
 wrote:
  Jim J. Jewett writes:
 
   In terms of best-effort, it is reasonable to treat the smuggled bytes
   as representing a character outside of your unicode repertoire
 
  I have to disagree. If you ever end up passing them to something that
  validates or tries to reencode them without surrogateescape, BOOM!
  These things are the text equivalent of IEEE NaNs.  If all you know
  (as in the stdlib) is that you have generic text, the only fairly
  safe things to do with them are (1) delete them, (2) substitute an
  appropriate replacement character for them, (3) pass the text
  containing them verbatim to other code, and (4) reencode them using
  the same codec they were read with.
 
 Don't forget, these are *errors*. These are bytes that cannot be
 correctly decoded. That's not something that has any meaning
 whatsoever in text; so by definition, the only things you can do are
 the four you list there (as long as codec means both the choice of
 encoding and the use of the surrogateescape flag). It's like dealing
 with control characters when you need to print something visually,
 except that they have an official solution [1] and surrogateescape is
 unofficial. They're not real text, so you have to deal with them
 somehow.

That isn't the case in the email package.  The smuggled bytes are not
errors[*], they are literally smuggled bytes.  But, as Stephen said, the
only things email does with them are the last three of the four he
listed (if you read (3) as passing it between parts of the email
package): the data comes in as text mixed with binary, and the email
package parses it until it knows what the binary is supposed to be,
turns it back into bytes, and decodes it properly.  The goal is to never
let the smuggled bytes escape out the email APIs as surrogateescape
encoded text; though, in practice, this being consenting-adults Python
and code not being bug free, there are places where people have used the
knowledge of how surrogateescape is used by email to work around both
API and code bugs.

--David

[*] Some of the encoded bytes *are* errors (non-ascii in headers or
undecodable bytes in whatever the CTE/charset is), and in that case
email may just turn them back into error bytes in the output, but only
*some* of the smuggled bytes are actually errors (and none are if the
message is RFC compliant).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com wrote:
 That isn't the case in the email package.  The smuggled bytes are not
 errors[*], they are literally smuggled bytes.

But they're not characters, which is what Stephen and I were saying -
and contrary to what Jim said about treating them as characters. At
best, they represent characters but in some encoding other than the
one you're using, and you have no idea how many bytes form a character
or anything. So you can't, for instance, word-wrap the text, because
you can't know how wide these unknown bytes are, whether they
represent spaces (wrap points), or newlines, or anything like that.
You can't treat them as characters, so while you have them in your
string, you can't treat it as a pure Unicode string - it''s a Unicode
string with smuggled bytes.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico ros...@gmail.com wrote:
 On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com 
 wrote:
  That isn't the case in the email package.  The smuggled bytes are not
  errors[*], they are literally smuggled bytes.
 
 But they're not characters, which is what Stephen and I were saying -
 and contrary to what Jim said about treating them as characters. At
 best, they represent characters but in some encoding other than the
 one you're using, and you have no idea how many bytes form a character
 or anything. So you can't, for instance, word-wrap the text, because
 you can't know how wide these unknown bytes are, whether they
 represent spaces (wrap points), or newlines, or anything like that.
 You can't treat them as characters, so while you have them in your
 string, you can't treat it as a pure Unicode string - it''s a Unicode
 string with smuggled bytes.

Well, except that I do.  The email header parsing algorithms all work
fine if I treat the surrogate escaped bytes as 'unknown junk' and just
parse based on the valid unicode.  (Unless the header is so garbled that
it can't be parsed, of course, at which point it becomes an invalid
header).

You are right about the wrapping, though.  If a header with invalid
bytes (and in this scenario we *are* talking about errors) needs to
be wrapped, we have to first decode the smuggled bytes and turn it
into an 'unknown-8bit' encoded word before we can wrap the header.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray rdmur...@bitdance.com wrote:
 You can't treat them as characters, so while you have them in your
 string, you can't treat it as a pure Unicode string - it''s a Unicode
 string with smuggled bytes.

 Well, except that I do.  The email header parsing algorithms all work
 fine if I treat the surrogate escaped bytes as 'unknown junk' and just
 parse based on the valid unicode.  (Unless the header is so garbled that
 it can't be parsed, of course, at which point it becomes an invalid
 header).

Do what, exactly? As I understand you, you treat the unknown bytes as
completely opaque, not representing any characters at all. Which is
what I'm saying: those are not characters.

If you, instead, represented the header as a list with some str
elements and some bytes, it would be just as valid (though much harder
to work with); all your manipulations are done on the str parts, and
the bytes just tag along for the ride.

 You are right about the wrapping, though.  If a header with invalid
 bytes (and in this scenario we *are* talking about errors) needs to
 be wrapped, we have to first decode the smuggled bytes and turn it
 into an 'unknown-8bit' encoded word before we can wrap the header.

Yeah, and that's going to be a bit messy. If you get 60 characters
followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
the middle of the smuggled section?

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Jim Baker
Great points here - I especially like the concluding statement you can't
treat it as a pure Unicode string - it's a Unicode string with smuggled
bytes

Given that Jython uses UTF-16 as its representation, it is possible to
frequently smuggle isolated surrogates in it. A surrogate pair must be a
low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
E000). So one can likely assign an interpretation that this is in fact the
isolated surrogate, and not an actual codepoint.

Of course, if you do actually have a smuggled isolated low surrogate
FOLLOWED by a smuggled isolated high surrogate - guess what, the only
interpretation is a codepoint. Or perhaps more likely garbage. Of course it
doesn't happen so often, so maybe we are fine with the occasional bug ;)

I personally suspect that we will resolve this by also supporting UCS-4 as
a representation in Jython 3.x for such Unicode strings, albeit with the
limitation that we have simply moved the problem to when we try to call
Java methods taking java.lang.String objects.

- Jim

On Tue, Sep 16, 2014 at 9:27 AM, Chris Angelico ros...@gmail.com wrote:

 On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray rdmur...@bitdance.com
 wrote:
  That isn't the case in the email package.  The smuggled bytes are not
  errors[*], they are literally smuggled bytes.

 But they're not characters, which is what Stephen and I were saying -
 and contrary to what Jim said about treating them as characters. At
 best, they represent characters but in some encoding other than the
 one you're using, and you have no idea how many bytes form a character
 or anything. So you can't, for instance, word-wrap the text, because
 you can't know how wide these unknown bytes are, whether they
 represent spaces (wrap points), or newlines, or anything like that.
 You can't treat them as characters, so while you have them in your
 string, you can't treat it as a pure Unicode string - it''s a Unicode
 string with smuggled bytes.

 ChrisA
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/jbaker%40zyasoft.com




-- 
- Jim

jim.baker@{colorado.edu|python.org|rackspace.com|zyasoft.com}
twitter.com/jimbaker
github.com/jimbaker
bitbucket.com/jimbaker
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker jim.ba...@python.org wrote:
 Of course, if you do actually have a smuggled isolated low surrogate
 FOLLOWED by a smuggled isolated high surrogate - guess what, the only
 interpretation is a codepoint. Or perhaps more likely garbage. Of course it
 doesn't happen so often, so maybe we are fine with the occasional bug ;)

 I personally suspect that we will resolve this by also supporting UCS-4 as a
 representation in Jython 3.x for such Unicode strings, albeit with the
 limitation that we have simply moved the problem to when we try to call Java
 methods taking java.lang.String objects.


That'll cost efficiency, of course, but it'll guarantee correctness.
And maybe, just maybe, you'll be able to put some pressure on Java
itself to start supporting UCS-4 natively...

One can dream.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico ros...@gmail.com wrote:
 On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray rdmur...@bitdance.com 
 wrote:
  You can't treat them as characters, so while you have them in your
  string, you can't treat it as a pure Unicode string - it''s a Unicode
  string with smuggled bytes.
 
  Well, except that I do.  The email header parsing algorithms all work
  fine if I treat the surrogate escaped bytes as 'unknown junk' and just
  parse based on the valid unicode.  (Unless the header is so garbled that
  it can't be parsed, of course, at which point it becomes an invalid
  header).
 
 Do what, exactly? As I understand you, you treat the unknown bytes as
 completely opaque, not representing any characters at all. Which is
 what I'm saying: those are not characters.

Yes.  I thought you were saying that one could not treat the string with
smuggled bytes as if it were a string.  (It's a string that can't be
encoded unless you use the surrogateescape error handler, but it is
still a string from Python's POV, which is the point of the error
handler).

Or, to put it another way, your implication was that there were no
string operations that could be usefully applied to a string containing
smuggled bytes, but that is not the case.  (I may well have read an
implication that was not there; if so I apologize and you can ignore the
rest of this :)  Basically, we are pretending that the each smuggled
byte is single character for string parsing purposes...but they don't
match any of our parsing constants.  They are all any character matches
in the regexes and what have you.  Of course, this only works in
contexts where we can ignore or carry along the smuggled bytes as
being components of arbitrary text portions of the syntax, and we must
take care to either replace them with valid unicode error glyphs or turn
the string of which the are a part into binary using the same codec and
error handler as we used to ingest them to begin with before emitting
them.  And, of course, we can't *modify* the sections containing the
smuggled bytes, only the syntax-matched sections that surround them; and
things like line wrapping are just an invitation to ugliness and bugs
even if you kept the smuggled bytes sections internally intact.

Finally, to explain what I meant by except that I do: when I added
back binary support to the email package in Python3, initially I *did
not change the parsing algorithms* in the code.  I just smuggled the
bytes, and then dealt with the encoding/decoding at the API boundaries.
This is the same principle used when dealing with filenames in the API
of Python itself.  *Except* at that boundary, I do not need to worry
about whether a particular string contains smuggled bytes or not.[*]

 If you, instead, represented the header as a list with some str
 elements and some bytes, it would be just as valid (though much harder
 to work with); all your manipulations are done on the str parts, and
 the bytes just tag along for the ride.

Quite a bit harder, which is why I don't do that.

  You are right about the wrapping, though.  If a header with invalid
  bytes (and in this scenario we *are* talking about errors) needs to
  be wrapped, we have to first decode the smuggled bytes and turn it
  into an 'unknown-8bit' encoded word before we can wrap the header.
 
 Yeah, and that's going to be a bit messy. If you get 60 characters
 followed by 30 unknown bytes, where do you wrap it? Dare you wrap in
 the middle of the smuggled section?

The point of RFC2047 encoded words is that they are an ASCII
representation of binary data, so once the bytes are properly Content
Transfer Encoded (as being in an unknown charset) the string contains no
smuggled bytes and can be wrapped.

--David

[*] I worried a lot that this was re-introducing the bytes/string
problem from python2.  The difference is that if the smuggled bytes
escape from the email API, that's a bug in the email package.  So user
code using the library is *not* in danger of getting mysterious encoding
errors when one day the input is international where before it was all
ASCII.  (Absent bugs in the library.)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Jim Baker writes:

  Given that Jython uses UTF-16 as its representation, it is possible to
  frequently smuggle isolated surrogates in it. A surrogate pair must be a
  low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
  E000).
  
  Of course, if you do actually have a smuggled isolated low
  surrogate FOLLOWED by a smuggled isolated high surrogate - guess
  what, the only interpretation is a codepoint. Or perhaps more
  likely garbage. Of course it doesn't happen so often, so maybe we
  are fine with the occasional bug ;)

The CPython representation uses trailing surrogates only[1], so it's
never possible to interpret them as anything but non-characters -- as
soon as you encounter them you know that it's a lone surrogate.
Surely you can do the same.

As long as the Java string manipulation functions don't check for
surrogates, you should be fine with this representation.  Of course I
suppose your matching functions (etc) don't check for them either, so
you will be somewhat vulnerable to bugs due to treating them as
characters.  But the same is true for CPython, AFAIK.

Footnotes: 
[1]  Only 128 bytes are necessary since the 128 ASCII characters are
embedded in Unicode as-is.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
R. David Murray writes:

   Do what, exactly? As I understand you, you treat the unknown bytes as
   completely opaque, not representing any characters at all. Which is
   what I'm saying: those are not characters.
  
  Yes.  I thought you were saying that one could not treat the string with
  smuggled bytes as if it were a string.

Guido's mantra is something like Python's str doesn't contain
characters or even code points[1], it contains code units.  Implying
that dealing with characters (or the grapheme globs that occasionally
raise their ugly heads here) is an issue for higher-level facilities
than str to deal with.

The point being that

  Basically, we are pretending that the each smuggled byte is single
  character

is something of a misstatement (good enough for present purpose of
discussing email, but not good enough for the general case of
understanding how this is supposed to work when porting the construct
to other Python implementations), while

  for string parsing purposes...but they don't match any of our
  parsing constants.

is precisely Pythonically correct.  You might want to add because all
parsing constants contain only valid characters by construction.

  [*] I worried a lot that this was re-introducing the bytes/string
  problem from python2.

It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.

That is not true here because the representations of characters vs.
smuggled bytes in str are disjoint sets.

Footnotes: 
[1]  In Unicode terminology, a code unit is the smallest computer
object that can represent a character (this is uniquely and sanely
defined for all real Unicode transformation formats aka UTFs).  A code
point is an integer 0 - (17*256*256-1) that can represent a character,
but many code points such as surrogates and 0x are defined to be
non-characters.  Characters are those code points that may be assigned
an interpretation as a character, including undefined characters
(private space and reserved).

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 08:57:21 +0900, Stephen J. Turnbull step...@xemacs.org 
wrote:
 As long as the Java string manipulation functions don't check for
 surrogates, you should be fine with this representation.  Of course I
 suppose your matching functions (etc) don't check for them either, so
 you will be somewhat vulnerable to bugs due to treating them as
 characters.  But the same is true for CPython, AFAIK.

From my point of view, the string function laxness is a feature, not a
bug.  But I get what you mean.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com wrote:
 Yes.  I thought you were saying that one could not treat the string with
 smuggled bytes as if it were a string.  (It's a string that can't be
 encoded unless you use the surrogateescape error handler, but it is
 still a string from Python's POV, which is the point of the error
 handler).

 Or, to put it another way, your implication was that there were no
 string operations that could be usefully applied to a string containing
 smuggled bytes, but that is not the case.  (I may well have read an
 implication that was not there; if so I apologize and you can ignore the
 rest of this :)

Ahh, I see where we are getting confused. What I said was that you
can't treat the string as a *pure* Unicode string. Parts of it are
Unicode text, parts of it aren't.

 Basically, we are pretending that the each smuggled
 byte is single character for string parsing purposes...but they don't
 match any of our parsing constants.  They are all any character matches
 in the regexes and what have you.

This is slightly iffy, as you can't be sure that one byte represents
one character, but as long as you don't much care about that, it's not
going to be an issue. I'm fairly sure you're never going to find an
encoding in which one unknown byte represents two characters, but
there are cases where it takes more than one byte to make up a
character (or the bytes are just shift codes or something). Does that
ever throw off your regexes? It wouldn't be an issue to a .* between
two character markers, but if you ever say .{5} then it might match
incorrectly.

I think we're in agreement here, just using different words. :)

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Glenn Linderman

On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote:

It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.

That is not true here because the representations of characters vs.
smuggled bytes in str are disjoint sets.


Actually, while it may be true that for the email headers case, all 
characters are characters, just the encoding is unknown, it is not 
necessarily true that they are in disjoint sets. Some bytes may decode 
into characters without needing to be smuggled... maybe not in 
text-protocols like email, but in the general case. So then some of the 
bytes that should be interpreted as binary data are not in a disjoint 
set from characters.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Glenn Linderman writes:

  Some bytes may decode into characters without needing to be
  smuggled... maybe not in text-protocols like email, but in the
  general case. So then some of the bytes that should be interpreted
  as binary data are not in a disjoint set from characters.

True, but irrelevant.  The point is that whoever chose the codec is
responsible for getting it right, not only the right encoding, but for
the assumption that the input data was pure encoded text.  The rest of
the program can now assume that choice was made correctly, and process
text as text.  The program cannot be blamed for assuming that the
person who chose the codec knew what they were about, and so
characters can be *assumed* to be decoded from bytes representing
characters.

This was not true in Python 2, where it was common practice to
represent encoded text by itself internally, implicitly assuming that
only one encoding would be encountered in each invocation of the
program.  This was never true, and with the spread of the Internet and
then the WWW, it became a major issue.  And that's why we invented
Python 3, to let text be text without the encumbrance of always being
aware of encodings and converting when different encodings collide,
etc.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
 On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com 
 wrote:

  Basically, we are pretending that the each smuggled
  byte is single character for string parsing purposes...but they don't
  match any of our parsing constants.  They are all any character matches
  in the regexes and what have you.
 
 This is slightly iffy, as you can't be sure that one byte represents
 one character, but as long as you don't much care about that, it's not
 going to be an issue.

This discussion would probably be a lot more easy to follow, with fewer 
miscommunications, if there were some examples. Here is my example, 
perhaps someone can tell me if I'm understanding it correctly.

I want to send an email including the header line:

'Subject: “NOBODY expects the Spanish Inquisition!”'

Note the curly quotes. I've read the manifesto UTF-8 Everywhere so I 
do the right thing and encode it as UTF-8:

b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

but my mail package, not being written in a language as awesome as 
Python, is just riddled with bugs, and somehow I end up with this 
corrupted byte-string instead:

b'Subject: \x9c\x80\xe2NOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

Note that the bytes from the first curly quote bytes are in the wrong 
order, but the second is okay. (Like I said, it's just *riddled* with 
bugs.) That means that trying to decode those bytes will fail in Python:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 9: 
invalid start byte

but it's not up to Python's email package to throw those invalid bytes 
out or permantly replace them with something else. Also, we want to work 
with Unicode strings, not byte strings, so there has to be a way to 
smuggle those three bytes into Unicode, without ending up with either 
the replacement bytes:

# using the 'replace' error handler
'Subject: ���NOBODY expects the Spanish Inquisition!”'

or incorrectly interpreting them as valid, but wrong, code points. (If 
we do the second, we end up with two control characters \x9c\x80 
followed by â.) We want to be able to round-trip back to the same 
bytes we received.

Am I right so far?

So the email package uses the surrogate-escape error handler and ends up 
with this Unicode string:

'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'

which can be encoded back to the bytes we started with.

Note that technically those three \u... code points are NOT classified 
as noncharacters. They are actually surrogate code points:

http://www.unicode.org/faq/private_use.html#nonchar4
http://www.unicode.org/glossary/#surrogate_code_point

and they're supposed to be reserved for UTF-16. I'm not sure of the 
implication of that.


 I'm fairly sure you're never going to find an
 encoding in which one unknown byte represents two characters,

There are encodings which use a shift mechanism, whereby a byte X 
represents one character by default, and a different character after the 
shift mechanism. But I don't think that matters, since we're not able to 
interpret those bytes. If we were, we'd just decode them to a text 
string and be done with it.


 but
 there are cases where it takes more than one byte to make up a
 character (or the bytes are just shift codes or something). 

Multi-byte encodings are very common. All the Unicode encodings are 
multi-byte. So are many East Asian encodings.


 Does that
 ever throw off your regexes? It wouldn't be an issue to a .* between
 two character markers, but if you ever say .{5} then it might match
 incorrectly.

I don't think the idea is to match on these smuggled bytes specifically. 
I think the idea is to match *around* them. In the example above, we 
might match everything from Subject:  to the end of the line. So long 
as we never end up with a situation where the smuggled bytes are 
replaced by something else, or shuffled around into different positions, 
we should be fine.

David, is my understanding correct?



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Akira Li
Steven D'Aprano st...@pearwood.info writes:

 On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
 On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray rdmur...@bitdance.com 
 wrote:

  Basically, we are pretending that the each smuggled
  byte is single character for string parsing purposes...but they don't
  match any of our parsing constants.  They are all any character matches
  in the regexes and what have you.
 
 This is slightly iffy, as you can't be sure that one byte represents
 one character, but as long as you don't much care about that, it's not
 going to be an issue.

 This discussion would probably be a lot more easy to follow, with fewer 
 miscommunications, if there were some examples. Here is my example, 
 perhaps someone can tell me if I'm understanding it correctly.

 I want to send an email including the header line:

 'Subject: “NOBODY expects the Spanish Inquisition!”'


   from email.header import Header
   h = Header('Subject: “NOBODY expects the Spanish Inquisition!”')
   h.encode('utf-8')
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='
   h.encode()
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='
   h.encode('ascii')
  '=?utf-8?q?Subject=3A_=E2=80=9CNOBODY_expects_the_Spanish_Inquisition!?=\n 
=?utf-8?q?=E2=80=9D?='


--
Akira

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Jim J. Jewett



On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote:

 1. Java does not really have a Unicode type, therefore not one that 
 validates. It has a String type that is a sequence of UTF-16 code units. 
 There are some String methods and Character methods that deal with code 
 points represented as int. I can put any 16-bit values I like in a String.

Including lone surrogates, and invalid characters in general?

 2. With proper accounting for indices, and as long as surrogates appear 
 in pairs, I believe operations like find or endswith give correct 
 answers about the unicode, when applied to the UTF-16. This is an 
 attractive implementation option, and mostly what we do.

So use it.  The fact that you're having to smuggle bytes already
guarantees that your data is either invalid or misinterpreted, and
bug-free isn't possible.

In terms of best-effort, it is reasonable to treat the smuggled bytes
as representing a character outside of your unicode repertoire -- so
it won't ever match entirely valid strings, except perhaps via a
wildcard.  And it should still work for
   .endswith(the same invalid characters).

 3. I'm fixing some bugs where we get it wrong beyond the BMP, and the 
 fix involves banning lone surrogates (completely).  At present you can't 
 type them in literals but you can sneak them in from Java.

So how will you ban them, and what will you do when some java class
sends you an invalid sequence anyhow?  That is exactly the use case
for these smuggled bytes... 

If you distinguish between a fully constructed PyString and a 
code-unit-sequence-that-could-be-made-into-a-PyString-later,
then you could always have your constructor return an InvalidPyString
subclass on the rare occasions when one is needed.

If you want to avoid invalid surrogates even then, just use the
replacement character and keep a separate list of original
characters that got replaced in this string -- a hassle, but no
worse than tracking indices for surrogates.

 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it 
 would have to do it the same way as CPython, as it is visible. It's not 
 impossible (I think), but is messy. Some are strongly against.

If you allow direct write access to the underlying charsequence
(as CPython does to C extensions), then you can't really ban
invalid sequences.  If callers have to go through an API -- even
something as minimal as  getBytes or getChars -- then you can use
whatever internal representation you prefer.  Hopefully, the vast
majority of strings won't actually have smuggled bytes.


-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Stephen J. Turnbull
Jim J. Jewett writes:

  In terms of best-effort, it is reasonable to treat the smuggled bytes
  as representing a character outside of your unicode repertoire

I have to disagree.  If you ever end up passing them to something that
validates or tries to reencode them without surrogateescape, BOOM!
These things are the text equivalent of IEEE NaNs.  If all you know
(as in the stdlib) is that you have generic text, the only fairly
safe things to do with them are (1) delete them, (2) substitute an
appropriate replacement character for them, (3) pass the text
containing them verbatim to other code, and (4) reencode them using
the same codec they were read with.

  -- so it won't ever match entirely valid strings, except perhaps
  via a wildcard.  And it should still work for .endswith(the same
  invalid characters).

Incorrect, I'm pretty sure, unless you know that both texts containing
the same invalid code points were read with the same codec.  Eg,
consider two filenames encoded in ISO Cyrillic and ISO Hebrew, read
with (encoding='ascii', errors='surrogateescape').

Apps that know the semantics of the text may DWIM/DTRT if they want
to, but FWIW-IMHO-YMMV-and-any-other-4-letter-caveat-acronyms-that-
may-apply Python and the stdlib shouldn't try to guess.

Guessing may be unavoidable, of course.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Chris Angelico
On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull step...@xemacs.org wrote:
 Jim J. Jewett writes:

  In terms of best-effort, it is reasonable to treat the smuggled bytes
  as representing a character outside of your unicode repertoire

 I have to disagree. If you ever end up passing them to something that
 validates or tries to reencode them without surrogateescape, BOOM!
 These things are the text equivalent of IEEE NaNs.  If all you know
 (as in the stdlib) is that you have generic text, the only fairly
 safe things to do with them are (1) delete them, (2) substitute an
 appropriate replacement character for them, (3) pass the text
 containing them verbatim to other code, and (4) reencode them using
 the same codec they were read with.

Don't forget, these are *errors*. These are bytes that cannot be
correctly decoded. That's not something that has any meaning
whatsoever in text; so by definition, the only things you can do are
the four you list there (as long as codec means both the choice of
encoding and the use of the surrogateescape flag). It's like dealing
with control characters when you need to print something visually,
except that they have an official solution [1] and surrogateescape is
unofficial. They're not real text, so you have to deal with them
somehow.

The bytes might each represent one character. Several of them together
might represent a single character. Or maybe they don't mean anything
at all, and they're just part of a chunked data format... like I was
finding in the .cwk files that I was reading this weekend (it's mostly
MacRoman encoding, but the text is divided into chunks separated by
\0\0 and two more bytes - turns out the bytes are chunk lengths, so
they don't mean any sort of characters at all). You can't know.

ChrisA

[1] http://www.unicode.org/charts/PDF/U2400.pdf
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Nick Coghlan
On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote:
 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it
would have to do it the same way as CPython, as it is visible. It's not
impossible (I think), but is messy. Some are strongly against.

It may be worth trying *without* it (i.e. treat surrogateescape as
equivalent to strict initially), and seeing how you go. The main purpose
of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data
round trips work on POSIX aspect of CPython 2, which doesn't apply in
exactly the same way on Jython.

Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2,
surrogateescape is equivalent to strict seems like a relatively small
discrepancy in behaviour.

Cheers,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread R. David Murray
On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com wrote:
 On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote:
  4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it
 would have to do it the same way as CPython, as it is visible. It's not
 impossible (I think), but is messy. Some are strongly against.
 
 It may be worth trying *without* it (i.e. treat surrogateescape as
 equivalent to strict initially), and seeing how you go. The main purpose
 of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data
 round trips work on POSIX aspect of CPython 2, which doesn't apply in
 exactly the same way on Jython.
 
 Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2,
 surrogateescape is equivalent to strict seems like a relatively small
 discrepancy in behaviour.

That would totally break the email package.

It would of course be possible to rewrite email to not use surrogate
escape, but it is a seriously non-trivial undertaking.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Tim Lesher
On Sat, Sep 13, 2014, 09:33 R. David Murray rdmur...@bitdance.com wrote:

 On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com
 wrote:
  On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote:
   4. I think (with Antoine) if Jython supported PEP-383 byte smuggling,
 it
  would have to do it the same way as CPython, as it is visible. It's not
  impossible (I think), but is messy. Some are strongly against.
 
  It may be worth trying *without* it (i.e. treat surrogateescape as
  equivalent to strict initially), and seeing how you go. The main
 purpose
  of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data
  round trips work on POSIX aspect of CPython 2, which doesn't apply in
  exactly the same way on Jython.
 
  Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2,
  surrogateescape is equivalent to strict seems like a relatively small
  discrepancy in behaviour.

 That would totally break the email package.

 It would of course be possible to rewrite email to not use surrogate
 escape, but it is a seriously non-trivial undertaking.

 --David
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe: https://mail.python.org/mailman/options/python-dev/
 tlesher%40gmail.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Nick Coghlan
On 14 Sep 2014 01:33, R. David Murray rdmur...@bitdance.com wrote:

 On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan ncogh...@gmail.com
wrote:
  On 13 Sep 2014 10:18, Jeff Allen ja...@farowl.co.uk wrote:
   4. I think (with Antoine) if Jython supported PEP-383 byte smuggling,
it
  would have to do it the same way as CPython, as it is visible. It's not
  impossible (I think), but is messy. Some are strongly against.
 
  It may be worth trying *without* it (i.e. treat surrogateescape as
  equivalent to strict initially), and seeing how you go. The main
purpose
  of surrogateescape in CPython 3 is to recreate the arbitrary 8-bit data
  round trips work on POSIX aspect of CPython 2, which doesn't apply in
  exactly the same way on Jython.
 
  Compared to the 8-bit vs 16-bit str discrepancy that exists in Python 2,
  surrogateescape is equivalent to strict seems like a relatively small
  discrepancy in behaviour.

 That would totally break the email package.

 It would of course be possible to rewrite email to not use surrogate
 escape, but it is a seriously non-trivial undertaking.

That does indeed make for a compelling use case :)

Cheers,
Nick.


 --David
 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Jeff Allen


On 12/09/2014 04:28, Stephen J. Turnbull wrote:

Jeff Allen writes:

   A welcome article. One correction should be made, I believe: the area of
   code point space used for the smuggling of bytes under PEP-383 is not a
   Unicode Private Use Area, but a portion of the trailing surrogate
   range.

Nice catch.  Note that the surrogate range was originally part of the
Private Use Area, but it was carved out with the adoption of UTF-16 in
about 1993.  In practice, I doubt that there are any current
implementations claiming compatibility with Unicode 1.0 (IIRC, UTF-16
was made mandatory in Unicode 1.1).
That's a helpful bit of history that explains the uncharacteristic 
inaccuracy. Most I can do to keep the current position clear in my head.



I've always thought that the right way to handle the private use
area for platforms like Python and Emacs, which may need to use it
for their own purposes (such as undecodable bytes) but want to
respect its use by applications, is to create an auxiliary table
mapping the private use area to objects describing the characters
represented by the private use code points.  These objects would have
attributes such as external representation for text I/O, glyph (for
GUI display), repr (for TTY display), various Unicode properties, etc.
Simply having a block for private use seems to create an unmanaged 
space for conflict, reminiscent of the other 128 characters in 
bilingual programming. I wondered if the way to respect use by 
applications might be to make it private to a particular sub-class of 
str, idly however.


Jeff Allen

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Antoine Pitrou
On Fri, 12 Sep 2014 07:54:56 +0100
Jeff Allen ja...@farowl.co.uk wrote:
 Simply having a block for private use seems to create an unmanaged 
 space for conflict, reminiscent of the other 128 characters in 
 bilingual programming. I wondered if the way to respect use by 
 applications might be to make it private to a particular sub-class of 
 str, idly however.

It's not private from Python's point of view, it's actually specified
in a PEP. So all Python 3 code has to follow the rule, and there's no
conflict internally.

The characters shouldn't leak out to other applications, unless the
user's code does its I/O very badly :-)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Jim J. Jewett



On September 11, 2014, Jeff Allen wrote:

 ... the area of code point
 space used for the smuggling of bytes under PEP-383 is not a 
 Unicode Private Use Area, but a portion of the trailing surrogate 
 range. This is a code violation, which I imagine is why 
 surrogateescape is an error handler, not a codec.

True, but I believe that is a CPython implementation detail.

Other implementations (including jython) should implement the
surrogatescape API, but I don't think it is important to use the
same internal representation for the invalid bytes.

(Well, unless you want to communicate with external tools (GUIs?)
that are trying to directly use (effectively bytes rather than
strings) in that particular internal encoding when communicating
with python.)

 lone surrogates preclude a naive use of the platform string library

Invalid input often causes problems.  Are you saying that there are
situations where the platform string library could easily handle
invalid characters in general, but has a problem with the specific
case of lone surrogates?

-jJ

--

If there are still threading problems with my replies, please
email me with details, so that I can try to resolve them.  -jJ

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Stephen J. Turnbull
Jeff Allen writes:

  Simply having a block for private use seems to create an unmanaged 
  space for conflict,

No.  The uncharted range of human language (including recently-
invented nonsense like emoticons and the annual design a character
contest run by a newpaper in Taipei, with the grand prize being your
character gets added to the national standard IIRC, but maybe it's
just that newspaper's collection of private space characters) already
contains those conflicts.  Believe me, private use space, manage it
yourself was the best they could do.

I've been working with the beureaucratic insanity of the Japanese
national standard -- it took almost 3 decades before every Japanese
citizen could store their names in a computer using government-
approved codes -- and the chaos of the Taiwanese national standard --
which contains hordes of characters with one known use and no known
meaning, many of them duplicates -- for twenty years now.  Neither
approach works as well as Unicode's, despite its design-by-committee
flaws overlaid with national animosities that can flare into
linguicidal vetoes and code-space-stuffing logrolling.

  reminiscent of the other 128 characters in bilingual
  programming. I wondered if the way to respect use by applications
  might be to make it private to a particular sub-class of str, idly
  however.

If I understand your suggestion, that's precisely the intent of PEP
383, to make undecodable bytes in a coded character stream private.
But they need to be in the stream one way or another.  So PEP 383
chose to use a non-Unicode encoding (based on the lone surrogate
device invented by Markus Kuhn for utf-8b) to deal with that, and that
does effectively make those elements private to Python (but of course
not in the Unicode sense, as they're not even characters in Unicode).

But I gather the native Unicode type in Java doesn't allow you to
use that dodge because it checks for malformed Unicode internally (ie,
at a level not controllable by Jython).  So you have to embed such
stream elements in the space of Unicode characters.  You have the
option of the private space or unallocated (reserved) space.  The
latter seems like asking for trouble, and the only way to avoid it
would be to be prepared to move that data around in case of collision.
But that's precisely what I'm suggesting doing in private space.  Same
issue, either way.  Private space with a local registry seems saner.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Jeff Allen

Jim, Stephen:

It seems like we're off topic here, but to answer all as briefly as 
possible:


1. Java does not really have a Unicode type, therefore not one that 
validates. It has a String type that is a sequence of UTF-16 code units. 
There are some String methods and Character methods that deal with code 
points represented as int. I can put any 16-bit values I like in a String.
2. With proper accounting for indices, and as long as surrogates appear 
in pairs, I believe operations like find or endswith give correct 
answers about the unicode, when applied to the UTF-16. This is an 
attractive implementation option, and mostly what we do.
3. I'm fixing some bugs where we get it wrong beyond the BMP, and the 
fix involves banning lone surrogates (completely). At present you can't 
type them in literals but you can sneak them in from Java.
4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it 
would have to do it the same way as CPython, as it is visible. It's not 
impossible (I think), but is messy. Some are strongly against.


Jeff Allen

On 12/09/2014 16:37, Jim J. Jewett wrote:



On September 11, 2014, Jeff Allen wrote:


... surrogateescape is an error handler, not a codec.

True, but I believe that is a CPython implementation detail.

Other implementations (including jython) should implement the
surrogatescape API, but I don't think it is important to use the
same internal representation for the invalid bytes.


lone surrogates preclude a naive use of the platform string library

Invalid input often causes problems.  Are you saying that there are
situations where the platform string library could easily handle
invalid characters in general, but has a problem with the specific
case of lone surrogates?



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-11 Thread Jeff Allen
A welcome article. One correction should be made, I believe: the area of 
code point space used for the smuggling of bytes under PEP-383 is not a 
Unicode Private Use Area, but a portion of the trailing surrogate 
range. This is a code violation, which I imagine is why 
surrogateescape is an error handler, not a codec.


http://www.unicode.org/faq/private_use.html

I believe the private use area was considered and rejected for PEP-383. 
In an implementation of the type unicode based on UTF-16 (Jython), lone 
surrogates preclude a naive use of the platform string library. This is 
on my mind at the moment as I'm working several bugs in Jython's unicode 
type, and can see why it has been too difficult.


Jeff

On 10/09/2014 08:17, Nick Coghlan wrote:

Since it may come in handy when discussing Why was Python 3
necessary? with folks, I wanted to point out that my article on the
transition to multilingual programming has now been reposted on the
Red Hat developer blog:
http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/

I wouldn't normally bring the Red Hat brand into an upstream
discussion like that, but this myth that Python 3 is killing the
language, and that Python 2 could have continued as a viable
development platform indefinitely if only Guido and the core
development team hadn't decided to go ahead and create Python 3, is
just plain wrong, and it really needs to die.

I'm hoping that borrowing a bit of Red Hat's enterprise credibility
will finally get people to understand that we really do have some idea
what we're doing, which is why most of our redistributors and many of
our key users are helping to push the migration forward, while we also
continue to support existing Python 2 users :)

Cheers,
Nick.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-11 Thread Stephen J. Turnbull
Jeff Allen writes:

  A welcome article. One correction should be made, I believe: the area of 
  code point space used for the smuggling of bytes under PEP-383 is not a 
  Unicode Private Use Area, but a portion of the trailing surrogate 
  range.

Nice catch.  Note that the surrogate range was originally part of the
Private Use Area, but it was carved out with the adoption of UTF-16 in
about 1993.  In practice, I doubt that there are any current
implementations claiming compatibility with Unicode 1.0 (IIRC, UTF-16
was made mandatory in Unicode 1.1).

  This is a code violation, which I imagine is why 
  surrogateescape is an error handler, not a codec.

Yes.

  I believe the private use area was considered and rejected for PEP-383. 
  In an implementation of the type unicode based on UTF-16 (Jython), lone 
  surrogates preclude a naive use of the platform string library. This is 
  on my mind at the moment as I'm working several bugs in Jython's unicode 
  type, and can see why it has been too difficult.

I've always thought that the right way to handle the private use
area for platforms like Python and Emacs, which may need to use it
for their own purposes (such as undecodable bytes) but want to
respect its use by applications, is to create an auxiliary table
mapping the private use area to objects describing the characters
represented by the private use code points.  These objects would have
attributes such as external representation for text I/O, glyph (for
GUI display), repr (for TTY display), various Unicode properties, etc.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-10 Thread Nick Coghlan
Since it may come in handy when discussing Why was Python 3
necessary? with folks, I wanted to point out that my article on the
transition to multilingual programming has now been reposted on the
Red Hat developer blog:
http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/

I wouldn't normally bring the Red Hat brand into an upstream
discussion like that, but this myth that Python 3 is killing the
language, and that Python 2 could have continued as a viable
development platform indefinitely if only Guido and the core
development team hadn't decided to go ahead and create Python 3, is
just plain wrong, and it really needs to die.

I'm hoping that borrowing a bit of Red Hat's enterprise credibility
will finally get people to understand that we really do have some idea
what we're doing, which is why most of our redistributors and many of
our key users are helping to push the migration forward, while we also
continue to support existing Python 2 users :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-10 Thread Steven D'Aprano
On Wed, Sep 10, 2014 at 05:17:57PM +1000, Nick Coghlan wrote:
 Since it may come in handy when discussing Why was Python 3
 necessary? with folks, I wanted to point out that my article on the
 transition to multilingual programming has now been reposted on the
 Red Hat developer blog:
 http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-programming-python/

That's awesome! Thank you Nick.

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com