Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano  wrote:
> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray  
> > wrote:
> 
> > > Basically, we are pretending that the each smuggled
> > > byte is single character for string parsing purposes...but they don't
> > > match any of our parsing constants.  They are all "any character" matches
> > > in the regexes and what have you.
> > 
> > This is slightly iffy, as you can't be sure that one byte represents
> > one character, but as long as you don't much care about that, it's not
> > going to be an issue.
> 
> This discussion would probably be a lot more easy to follow, with fewer 
> miscommunications, if there were some examples. Here is my example, 
> perhaps someone can tell me if I'm understanding it correctly.
> 
> I want to send an email including the header line:
> 
> 'Subject: “NOBODY expects the Spanish Inquisition!”'
> 
> Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
> do the right thing and encode it as UTF-8:
> 
> b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'

That won't work until email supports RFC 6532.  Until then, you can only
use ascii and encoded words successfully.  So just having the curly
quotes is a buggy enough program.

> but it's not up to Python's email package to throw those invalid bytes 
> out or permantly replace them with something else. Also, we want to work 
> with Unicode strings, not byte strings, so there has to be a way to 
> smuggle those three bytes into Unicode, without ending up with either 
> the replacement bytes:
> 
> # using the 'replace' error handler
> 'Subject: ���NOBODY expects the Spanish Inquisition!”'

What you'll get if you request a text copy of that header is

  'Subject: ���NOBODY expects the Spanish Inquisition!���'

> Am I right so far?
> 
> So the email package uses the surrogate-escape error handler and ends up 
> with this Unicode string:
> 
> 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'

Except that it encodes the closing quote, too :)

> which can be encoded back to the bytes we started with.

Right.  If you serialize the message as bytes, the bytes are recovered
and output when that header is output.

Now, once we support RFC 6532, you will be exactly right, as we will
then have the option of handling utf-8 encoded headers, and we will do
that using the utf-8 codec to ingest headers, and the surrogateescape
error handler to handle exactly the kind of bad data you postulate.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
Sorry for the mojibake.  I've not yet gotten around to actually using
the email package to write a smarter replacement for nmh, which is what
I use for email, and I always forget that I need to manually tell nmh
when there non-ascii in the message...

On Wed, 17 Sep 2014 03:02:33 -0400, "R. David Murray"  
wrote:
> On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano  
> wrote:
> > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray  
> > > wrote:
> > 
> > > > Basically, we are pretending that the each smuggled
> > > > byte is single character for string parsing purposes...but they don't
> > > > match any of our parsing constants.  They are all "any character" 
> > > > matches
> > > > in the regexes and what have you.
> > > 
> > > This is slightly iffy, as you can't be sure that one byte represents
> > > one character, but as long as you don't much care about that, it's not
> > > going to be an issue.
> > 
> > This discussion would probably be a lot more easy to follow, with fewer 
> > miscommunications, if there were some examples. Here is my example, 
> > perhaps someone can tell me if I'm understanding it correctly.
> > 
> > I want to send an email including the header line:
> > 
> > 'Subject: “NOBODY expects the Spanish Inquisition!”'
> > 
> > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I 
> > do the right thing and encode it as UTF-8:
> > 
> > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d'
> 
> That won't work until email supports RFC 6532.  Until then, you can only
> use ascii and encoded words successfully.  So just having the curly
> quotes is a buggy enough program.
> 
> > but it's not up to Python's email package to throw those invalid bytes 
> > out or permantly replace them with something else. Also, we want to work 
> > with Unicode strings, not byte strings, so there has to be a way to 
> > smuggle those three bytes into Unicode, without ending up with either 
> > the replacement bytes:
> > 
> > # using the 'replace' error handler
> > 'Subject: ���NOBODY expects the Spanish Inquisition!”'
> 
> What you'll get if you request a text copy of that header is
> 
>   'Subject: ���NOBODY expects the Spanish Inquisition!���'
> 
> > Am I right so far?
> > 
> > So the email package uses the surrogate-escape error handler and ends up 
> > with this Unicode string:
> > 
> > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
> 
> Except that it encodes the closing quote, too :)
> 
> > which can be encoded back to the bytes we started with.
> 
> Right.  If you serialize the message as bytes, the bytes are recovered
> and output when that header is output.
> 
> Now, once we support RFC 6532, you will be exactly right, as we will
> then have the option of handling utf-8 encoded headers, and we will do
> that using the utf-8 codec to ingest headers, and the surrogateescape
> error handler to handle exactly the kind of bad data you postulate.
> 
> --David
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)

2014-09-17 Thread Jeff Allen

This feels like a jython-dev discussion. But anyway ...

On 17/09/2014 00:57, Stephen J. Turnbull wrote:

The CPython representation uses trailing surrogates only[1], so it's
never possible to interpret them as anything but non-characters -- as
soon as you encounter them you know that it's a lone surrogate.
Surely you can do the same.

As long as the Java string manipulation functions don't check for
surrogates, you should be fine with this representation.  Of course I
suppose your matching functions (etc) don't check for them either, so
you will be somewhat vulnerable to bugs due to treating them as
characters.  But the same is true for CPython, AFAIK.
They don't check. I agree that since only the trailing surrogate code 
points are allowed, you can tell that you have one, even in the UTF-16 
form. The problem is that, if strings containing lone trailing 
surrogates are allowed, then:


u'\udc83' in u'abc\U00010083xyz'
u'abc\U00010083xyz'.endswith(u'\udc83xyz')

are both True, if implemented in the obvious way on the UTF-16 
representation. And this should not be so in Jython, which claims to be 
a wide build. (I can't actually type the second one, but I can get the 
same effect in Jython 2.7b3 via a java.lang.StringBuilder.) I believe 
that the usual string operations work correctly on the UTF-16 version of 
the string, as long as indexes are adjusted correctly.


If we think it is ok that code using such methods give the wrong answer 
when fed strings containing smuggled bytes, then isolated (trailing) 
surrogates could be allowed. It's the user's fault for calling the 
method on that data.  But I think it kinder that our implementation 
defend users from these wrong answers. In the latest state of Jython, we 
do this by rigorously preventing the construction of a PyUnicode 
containing a lone surrogate, so we can just use UTF-16 operations 
without further checks.


I'm not sure that rigour will be universally welcomed, and clearly it 
precludes PEP-383 byte smuggling.


Jeff
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:

> Guido's mantra is something like "Python's str doesn't contain
> characters or even code points[1], it contains code units."

But is that true? If it were true, I would expect to be able to make 
Python text strings containing code units that aren't code points, e.g. 
something like "\U1234" or chr(0x1234) should work, but neither 
do. As far as I can tell, there is no way to build a string containing 
items which aren't code points.

I don't think it is useful to say that strings *contain* code units, 
more that they *are made up from* code units. Code units are the 
implementation: 16-bit code units in narrow builds, 32-bit code units 
in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
beyond. (I don't know of any Python implementation which uses UTF-8 
internally, but if there was one, it would use 8-bit code units.)

It isn't very useful to say that in Python 3.3 the string "A" *contains*
the 8-bit code unit 0x41. That's conflating two different levels of 
explanation (the high-level interface and the underlying implemention) 
and potentially leads to user confusion like

# 8-bit code units are bytes, right?
assert b'\41' in "A"

which is Not Even Wrong.
http://rationalwiki.org/wiki/Not_even_wrong

I think it is correct to say that Python strings are sequences of 
Unicode code points U+ through U+10. There are no other 
restrictions, e.g. strings can contain surrogates, noncharacters, or 
nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
ACCENT combined with U+000A (newline).


> Implying
> that dealing with characters (or the grapheme globs that occasionally
> raise their ugly heads here) is an issue for higher-level facilities
> than str to deal with.

Agreed that Python doesn't offer a string type based on graphemes, and 
that such a facility belongs as a high-level library, not a built-in 
type.

Also agreed that talking about characters is sloppy. Nevertheless, for 
English speakers at least, "code point = character" isn't too awful a 
first approximation.


> The point being that
> 
>  > Basically, we are pretending that the each smuggled byte is single
>  > character
> 
> is something of a misstatement (good enough for present purpose of
> discussing email, but not good enough for the general case of
> understanding how this is supposed to work when porting the construct
> to other Python implementations), while
> 
>  > for string parsing purposes...but they don't match any of our
>  > parsing constants.
> 
> is precisely Pythonically correct.  You might want to add "because all
> parsing constants contain only valid characters by construction."

I don't understand what you are trying to say here.


>  > [*] I worried a lot that this was re-introducing the bytes/string
>  > problem from python2.
> 
> It isn't, because the bytes/str problem was that given a str object
> out of context you could not tell whether it was a binary blob or
> text, and if text, you couldn't tell if it was external encoded text
> or internal abstract text.
> 
> That is not true here because the representations of characters vs.
> smuggled bytes in str are disjoint sets.

Nor am I sure what you are trying to say here either.


> Footnotes: 
> [1]  In Unicode terminology, a code unit is the smallest computer
> object that can represent a character (this is uniquely and sanely
> defined for all real Unicode transformation formats aka UTFs).  A code
> point is an integer 0 - (17*256*256-1) that can represent a character,
> but many code points such as surrogates and 0x are defined to be
> non-characters.

Actually not quite. "Noncharacter" is concretely defined in Unicode, and 
there are only 66 of them, many fewer than the surrogate code points 
alone. Surrogates are reserved, not noncharacters.

http://www.unicode.org/glossary/#surrogate_code_point
http://www.unicode.org/faq/private_use.html#nonchar1

It is wrong to talk about "surrogate characters", but perhaps you mean 
to say that surrogates (by which I understand you to mean surrogate code 
points) are "not human-meaningful characters", which is not the same 
thing as a Unicode noncharacter.


> Characters are those code points that may be assigned
> an interpretation as a character, including undefined characters
> (private space and reserved).

So characters are code points which are characters, including undefined 
characters? :-)

http://www.unicode.org/glossary/#character



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Antoine Pitrou

Seriously, can this discussion move somewhere else?
This has nothing to do on python-dev.

Thank you

Antoine.



On Wed, 17 Sep 2014 18:56:02 +1000
Steven D'Aprano  wrote:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
> 
> > Guido's mantra is something like "Python's str doesn't contain
> > characters or even code points[1], it contains code units."
> 
> But is that true? If it were true, I would expect to be able to make 
> Python text strings containing code units that aren't code points, e.g. 
> something like "\U1234" or chr(0x1234) should work, but neither 
> do. As far as I can tell, there is no way to build a string containing 
> items which aren't code points.
> 
> I don't think it is useful to say that strings *contain* code units, 
> more that they *are made up from* code units. Code units are the 
> implementation: 16-bit code units in narrow builds, 32-bit code units 
> in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and 
> beyond. (I don't know of any Python implementation which uses UTF-8 
> internally, but if there was one, it would use 8-bit code units.)
> 
> It isn't very useful to say that in Python 3.3 the string "A" *contains*
> the 8-bit code unit 0x41. That's conflating two different levels of 
> explanation (the high-level interface and the underlying implemention) 
> and potentially leads to user confusion like
> 
> # 8-bit code units are bytes, right?
> assert b'\41' in "A"
> 
> which is Not Even Wrong.
> http://rationalwiki.org/wiki/Not_even_wrong
> 
> I think it is correct to say that Python strings are sequences of 
> Unicode code points U+ through U+10. There are no other 
> restrictions, e.g. strings can contain surrogates, noncharacters, or 
> nonsensical combinations of code points such as a U+0300 COMBINING GRAVE 
> ACCENT combined with U+000A (newline).
> 
> 
> > Implying
> > that dealing with characters (or the grapheme globs that occasionally
> > raise their ugly heads here) is an issue for higher-level facilities
> > than str to deal with.
> 
> Agreed that Python doesn't offer a string type based on graphemes, and 
> that such a facility belongs as a high-level library, not a built-in 
> type.
> 
> Also agreed that talking about characters is sloppy. Nevertheless, for 
> English speakers at least, "code point = character" isn't too awful a 
> first approximation.
> 
> 
> > The point being that
> > 
> >  > Basically, we are pretending that the each smuggled byte is single
> >  > character
> > 
> > is something of a misstatement (good enough for present purpose of
> > discussing email, but not good enough for the general case of
> > understanding how this is supposed to work when porting the construct
> > to other Python implementations), while
> > 
> >  > for string parsing purposes...but they don't match any of our
> >  > parsing constants.
> > 
> > is precisely Pythonically correct.  You might want to add "because all
> > parsing constants contain only valid characters by construction."
> 
> I don't understand what you are trying to say here.
> 
> 
> >  > [*] I worried a lot that this was re-introducing the bytes/string
> >  > problem from python2.
> > 
> > It isn't, because the bytes/str problem was that given a str object
> > out of context you could not tell whether it was a binary blob or
> > text, and if text, you couldn't tell if it was external encoded text
> > or internal abstract text.
> > 
> > That is not true here because the representations of characters vs.
> > smuggled bytes in str are disjoint sets.
> 
> Nor am I sure what you are trying to say here either.
> 
> 
> > Footnotes: 
> > [1]  In Unicode terminology, a code unit is the smallest computer
> > object that can represent a character (this is uniquely and sanely
> > defined for all real Unicode transformation formats aka UTFs).  A code
> > point is an integer 0 - (17*256*256-1) that can represent a character,
> > but many code points such as surrogates and 0x are defined to be
> > non-characters.
> 
> Actually not quite. "Noncharacter" is concretely defined in Unicode, and 
> there are only 66 of them, many fewer than the surrogate code points 
> alone. Surrogates are reserved, not noncharacters.
> 
> http://www.unicode.org/glossary/#surrogate_code_point
> http://www.unicode.org/faq/private_use.html#nonchar1
> 
> It is wrong to talk about "surrogate characters", but perhaps you mean 
> to say that surrogates (by which I understand you to mean surrogate code 
> points) are "not human-meaningful characters", which is not the same 
> thing as a Unicode noncharacter.
> 
> 
> > Characters are those code points that may be assigned
> > an interpretation as a character, including undefined characters
> > (private space and reserved).
> 
> So characters are code points which are characters, including undefined 
> characters? :-)
> 
> http://www.unicode.org/glossary/#character
> 
> 
> 



___
Pyth

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Martin v. Löwis
Am 17.09.14 10:56, schrieb Steven D'Aprano:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
> 
>> Guido's mantra is something like "Python's str doesn't contain
>> characters or even code points[1], it contains code units."
> 
> But is that true?

It used to be true, and stopped being so with PEP 393. In particular,
Python 3.2 and before would expose UTF-16 in the narrow build, so the
elements of a string would be code units. Since Python 3.3, the
surrogate code points are not longer interpreted as UTF-16 code units.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Smuggling bytes in a UTF-16 implementation of str/unicode (was: Multilingual programming article on the Red Hat Developer blog)

2014-09-17 Thread Stephen J. Turnbull
Jeff Allen writes:

 > This feels like a jython-dev discussion. But anyway ...

Well, if the same representation could be used in Jython you could
just point to PEP 383 and be done with it.

 > u'\udc83' in u'abc\U00010083xyz'

IMHO being able to type that is a bug.  There should be no literal
notation for surrogates in Python (that is, if you type a literal that
looks like it refers to a surrogate, you should get a UnicodeError).
The "right way" (IMHO) to spell that is

chr(0xdc83) in u'abc\U00010083xyz'

I'm not Guido, and don't claim to channel him on this.  But it seems
reasonable to me that Unicode literals should conform to Unicode in
this way.  I might even extend that to noncharacters (the last two
code points in each plane and the 32-point "hole" in Arabic).

I'll grant that chr() is an unfortunate spelling, but I would imagine
we could live with that since chr() goes back forever with these
semantics.

 > u'abc\U00010083xyz'.endswith(u'\udc83xyz')
 > 
 > are both True, if implemented in the obvious way on the UTF-16 
 > representation. And this should not be so in Jython, which claims to be 
 > a wide build. (I can't actually type the second one, but I can get the 
 > same effect in Jython 2.7b3 via a java.lang.StringBuilder.)

I agree that's very ugly, but AFAIK that's how things would work in
narrow CPython (which uses UTF-16 internally for the astral planes).

Personally I would document that explicit smuggled bytes are not
supported for comparison operations, and leave it at that.

 > If we think it is ok that code using such methods give the wrong answer 
 > when fed strings containing smuggled bytes, then isolated (trailing) 
 > surrogates could be allowed. It's the user's fault for calling the 
 > method on that data.  But I think it kinder that our implementation 
 > defend users from these wrong answers. In the latest state of Jython, we 
 > do this by rigorously preventing the construction of a PyUnicode 
 > containing a lone surrogate, so we can just use UTF-16 operations 
 > without further checks.

That seems like a reasonable approach.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Stephen J. Turnbull
Steven D'Aprano writes:
 > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
 > 
 > > Guido's mantra is something like "Python's str doesn't contain
 > > characters or even code points[1], it contains code units."
 > 
 > But is that true?

It's not.  That's why I wrote the slightly pejorative "mantra" and
qualified it with "something like".  The precise statement is
"something like" the array property is more important than preserving
character boundaries, so slices etc are allowed to do unexpected or
even evil things in the presence of astral characters in UTF-16
representations.

 > I don't understand what you are trying to say here.

 > Nor am I sure what you are trying to say here either.

We can discuss this off-list if you would like.  The natives are
getting restless.

 > > non-characters.
 > 
 > Actually not quite. "Noncharacter"

Note the hyphen!  (Just kidding, I will avoid that terminology in the
future.  I knew, but forgot.)

 > > Characters are those code points that may be assigned
 > > an interpretation as a character, including undefined characters
 > > (private space and reserved).
 > 
 > So characters are code points which are characters, including undefined 
 > characters? :-)

No, there's a clear hierarchy here.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com