Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Stephen J. Turnbull
Steven D'Aprano writes: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > > > Guido's mantra is something like "Python's str doesn't contain > > characters or even code points[1], it contains code units." > > But is that true? It's not. That's why I wrote the slight

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Martin v. Löwis
Am 17.09.14 10:56, schrieb Steven D'Aprano: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > >> Guido's mantra is something like "Python's str doesn't contain >> characters or even code points[1], it contains code units." > > But is that true? It used to be true, and stop

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Antoine Pitrou
Seriously, can this discussion move somewhere else? This has nothing to do on python-dev. Thank you Antoine. On Wed, 17 Sep 2014 18:56:02 +1000 Steven D'Aprano wrote: > On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > > > Guido's mantra is something like "Python's str

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote: > Guido's mantra is something like "Python's str doesn't contain > characters or even code points[1], it contains code units." But is that true? If it were true, I would expect to be able to make Python text strings containing

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
Sorry for the mojibake. I've not yet gotten around to actually using the email package to write a smarter replacement for nmh, which is what I use for email, and I always forget that I need to manually tell nmh when there non-ascii in the message... On Wed, 17 Sep 2014 03:02:33 -0400, "R. David M

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-17 Thread R. David Murray
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano wrote: > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > > wrote: > > > > Basically, we are pretending that the each smuggled > > > byte is single character for string pars

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Steven D'Aprano writes: [long example] > Am I right so far? > > So the email package uses the surrogate-escape error handler and ends up > with this Unicode string: > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”' > > which can be encoded back to the bytes we

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Akira Li
Steven D'Aprano writes: > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: >> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray >> wrote: > >> > Basically, we are pretending that the each smuggled >> > byte is single character for string parsing purposes...but they don't >> > matc

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Steven D'Aprano
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray > wrote: > > Basically, we are pretending that the each smuggled > > byte is single character for string parsing purposes...but they don't > > match any of our parsing constants. T

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Glenn Linderman writes: > Some bytes may decode into characters without needing to be > smuggled... maybe not in text-protocols like email, but in the > general case. So then some of the bytes that should be interpreted > as binary data are not in a disjoint set from characters. True, but irr

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Glenn Linderman
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote: It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text. That is not true

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray wrote: > Yes. I thought you were saying that one could not treat the string with > smuggled bytes as if it were a string. (It's a string that can't be > encoded unless you use the surrogateescape error handler, but it is > still a string from Pyth

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull" wrote: > As long as the Java string manipulation functions don't check for > surrogates, you should be fine with this representation. Of course I > suppose your matching functions (etc) don't check for them either, so > you will be somewh

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
R. David Murray writes: > > Do what, exactly? As I understand you, you treat the unknown bytes as > > completely opaque, not representing any characters at all. Which is > > what I'm saying: those are not characters. > > Yes. I thought you were saying that one could not treat the string wit

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Stephen J. Turnbull
Jim Baker writes: > Given that Jython uses UTF-16 as its representation, it is possible to > frequently smuggle isolated surrogates in it. A surrogate pair must be a > low surrogate in range (D800, DC00), then a high surrogate in range(DC00, > E000). > > Of course, if you do actually have a

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray > wrote: > >> You can't treat them as characters, so while you have them in your > >> string, you can't treat it as a pure Unicode string - it''s a Unicode > >> string with smuggled bytes

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker wrote: > Of course, if you do actually have a smuggled isolated low surrogate > FOLLOWED by a smuggled isolated high surrogate - guess what, the only > interpretation is a codepoint. Or perhaps more likely garbage. Of course it > doesn't happen so often,

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Jim Baker
Great points here - I especially like the concluding statement "you can't treat it as a pure Unicode string - it's a Unicode string with smuggled bytes" Given that Jython uses UTF-16 as its representation, it is possible to frequently smuggle isolated surrogates in it. A surrogate pair must be a l

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray wrote: >> You can't treat them as characters, so while you have them in your >> string, you can't treat it as a pure Unicode string - it''s a Unicode >> string with smuggled bytes. > > Well, except that I do. The email header parsing algorithms all

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico wrote: > On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray > wrote: > > That isn't the case in the email package. The smuggled bytes are not > > errors[*], they are literally smuggled bytes. > > But they're not characters, which is what Stephen

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread Chris Angelico
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray wrote: > That isn't the case in the email package. The smuggled bytes are not > errors[*], they are literally smuggled bytes. But they're not characters, which is what Stephen and I were saying - and contrary to what Jim said about treating them a

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-16 Thread R. David Murray
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico wrote: > On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull > wrote: > > Jim J. Jewett writes: > > > > > In terms of best-effort, it is reasonable to treat the smuggled bytes > > > as representing a character outside of your unicode repertoire

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Chris Angelico
On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull wrote: > Jim J. Jewett writes: > > > In terms of best-effort, it is reasonable to treat the smuggled bytes > > as representing a character outside of your unicode repertoire > > I have to disagree. If you ever end up passing them to something th

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Stephen J. Turnbull
Jim J. Jewett writes: > In terms of best-effort, it is reasonable to treat the smuggled bytes > as representing a character outside of your unicode repertoire I have to disagree. If you ever end up passing them to something that validates or tries to reencode them without surrogateescape, BOOM

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-15 Thread Jim J. Jewett
On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote: > 1. Java does not really have a Unicode type, therefore not one that > validates. It has a String type that is a sequence of UTF-16 code units. > There are some String methods and Character methods that deal with code > points represented

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Nick Coghlan
On 14 Sep 2014 01:33, "R. David Murray" wrote: > > On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan wrote: > > On 13 Sep 2014 10:18, "Jeff Allen" wrote: > > > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it > > would have to do it the same way as CPython, as it is visibl

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Tim Lesher
On Sat, Sep 13, 2014, 09:33 R. David Murray wrote: > On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan > wrote: > > On 13 Sep 2014 10:18, "Jeff Allen" wrote: > > > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, > it > > would have to do it the same way as CPython, as it is

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread R. David Murray
On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan wrote: > On 13 Sep 2014 10:18, "Jeff Allen" wrote: > > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it > would have to do it the same way as CPython, as it is visible. It's not > impossible (I think), but is messy. Some are

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-13 Thread Nick Coghlan
On 13 Sep 2014 10:18, "Jeff Allen" wrote: > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it would have to do it the same way as CPython, as it is visible. It's not impossible (I think), but is messy. Some are strongly against. It may be worth trying *without* it (i.e. tre

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Jeff Allen
Jim, Stephen: It seems like we're off topic here, but to answer all as briefly as possible: 1. Java does not really have a Unicode type, therefore not one that validates. It has a String type that is a sequence of UTF-16 code units. There are some String methods and Character methods that de

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Stephen J. Turnbull
Jeff Allen writes: > Simply having a block "for private use" seems to create an unmanaged > space for conflict, No. The uncharted range of human language (including recently- invented nonsense like "emoticons" and the annual "design a character" contest run by a newpaper in Taipei, with the g

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Jim J. Jewett
On September 11, 2014, Jeff Allen wrote: > ... the area of code point > space used for the smuggling of bytes under PEP-383 is not a > "Unicode Private Use Area", but a portion of the trailing surrogate > range. This is a code violation, which I imagine is why > "surrogateescape" is an error

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-12 Thread Antoine Pitrou
On Fri, 12 Sep 2014 07:54:56 +0100 Jeff Allen wrote: > Simply having a block "for private use" seems to create an unmanaged > space for conflict, reminiscent of the "other 128 characters" in > bilingual programming. I wondered if the way to respect use by > applications might be to make it priv

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-11 Thread Jeff Allen
On 12/09/2014 04:28, Stephen J. Turnbull wrote: Jeff Allen writes: > A welcome article. One correction should be made, I believe: the area of > code point space used for the smuggling of bytes under PEP-383 is not a > "Unicode Private Use Area", but a portion of the trailing surrogate >

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-11 Thread Stephen J. Turnbull
Jeff Allen writes: > A welcome article. One correction should be made, I believe: the area of > code point space used for the smuggling of bytes under PEP-383 is not a > "Unicode Private Use Area", but a portion of the trailing surrogate > range. Nice catch. Note that the surrogate range

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-11 Thread Jeff Allen
A welcome article. One correction should be made, I believe: the area of code point space used for the smuggling of bytes under PEP-383 is not a "Unicode Private Use Area", but a portion of the trailing surrogate range. This is a code violation, which I imagine is why "surrogateescape" is an er

Re: [Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-10 Thread Steven D'Aprano
On Wed, Sep 10, 2014 at 05:17:57PM +1000, Nick Coghlan wrote: > Since it may come in handy when discussing "Why was Python 3 > necessary?" with folks, I wanted to point out that my article on the > transition to multilingual programming has now been reposted on the > Red Hat developer blog: > http:

[Python-Dev] Multilingual programming article on the Red Hat Developer blog

2014-09-10 Thread Nick Coghlan
Since it may come in handy when discussing "Why was Python 3 necessary?" with folks, I wanted to point out that my article on the transition to multilingual programming has now been reposted on the Red Hat developer blog: http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-program