Steven D'Aprano writes:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
>
> > Guido's mantra is something like "Python's str doesn't contain
> > characters or even code points[1], it contains code units."
>
> But is that true?
It's not. That's why I wrote the slight
Am 17.09.14 10:56, schrieb Steven D'Aprano:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
>
>> Guido's mantra is something like "Python's str doesn't contain
>> characters or even code points[1], it contains code units."
>
> But is that true?
It used to be true, and stop
Seriously, can this discussion move somewhere else?
This has nothing to do on python-dev.
Thank you
Antoine.
On Wed, 17 Sep 2014 18:56:02 +1000
Steven D'Aprano wrote:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
>
> > Guido's mantra is something like "Python's str
On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
> Guido's mantra is something like "Python's str doesn't contain
> characters or even code points[1], it contains code units."
But is that true? If it were true, I would expect to be able to make
Python text strings containing
Sorry for the mojibake. I've not yet gotten around to actually using
the email package to write a smarter replacement for nmh, which is what
I use for email, and I always forget that I need to manually tell nmh
when there non-ascii in the message...
On Wed, 17 Sep 2014 03:02:33 -0400, "R. David M
On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano wrote:
> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray
> > wrote:
>
> > > Basically, we are pretending that the each smuggled
> > > byte is single character for string pars
Steven D'Aprano writes:
[long example]
> Am I right so far?
>
> So the email package uses the surrogate-escape error handler and ends up
> with this Unicode string:
>
> 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!”'
>
> which can be encoded back to the bytes we
Steven D'Aprano writes:
> On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
>> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray
>> wrote:
>
>> > Basically, we are pretending that the each smuggled
>> > byte is single character for string parsing purposes...but they don't
>> > matc
On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray
> wrote:
> > Basically, we are pretending that the each smuggled
> > byte is single character for string parsing purposes...but they don't
> > match any of our parsing constants. T
Glenn Linderman writes:
> Some bytes may decode into characters without needing to be
> smuggled... maybe not in text-protocols like email, but in the
> general case. So then some of the bytes that should be interpreted
> as binary data are not in a disjoint set from characters.
True, but irr
On 9/16/2014 5:21 PM, Stephen J. Turnbull wrote:
It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.
That is not true
On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray wrote:
> Yes. I thought you were saying that one could not treat the string with
> smuggled bytes as if it were a string. (It's a string that can't be
> encoded unless you use the surrogateescape error handler, but it is
> still a string from Pyth
On Wed, 17 Sep 2014 08:57:21 +0900, "Stephen J. Turnbull"
wrote:
> As long as the Java string manipulation functions don't check for
> surrogates, you should be fine with this representation. Of course I
> suppose your matching functions (etc) don't check for them either, so
> you will be somewh
R. David Murray writes:
> > Do what, exactly? As I understand you, you treat the unknown bytes as
> > completely opaque, not representing any characters at all. Which is
> > what I'm saying: those are not characters.
>
> Yes. I thought you were saying that one could not treat the string wit
Jim Baker writes:
> Given that Jython uses UTF-16 as its representation, it is possible to
> frequently smuggle isolated surrogates in it. A surrogate pair must be a
> low surrogate in range (D800, DC00), then a high surrogate in range(DC00,
> E000).
>
> Of course, if you do actually have a
On Wed, 17 Sep 2014 04:02:11 +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray
> wrote:
> >> You can't treat them as characters, so while you have them in your
> >> string, you can't treat it as a pure Unicode string - it''s a Unicode
> >> string with smuggled bytes
On Wed, Sep 17, 2014 at 3:55 AM, Jim Baker wrote:
> Of course, if you do actually have a smuggled isolated low surrogate
> FOLLOWED by a smuggled isolated high surrogate - guess what, the only
> interpretation is a codepoint. Or perhaps more likely garbage. Of course it
> doesn't happen so often,
Great points here - I especially like the concluding statement "you can't
treat it as a pure Unicode string - it's a Unicode string with smuggled
bytes"
Given that Jython uses UTF-16 as its representation, it is possible to
frequently smuggle isolated surrogates in it. A surrogate pair must be a
l
On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray wrote:
>> You can't treat them as characters, so while you have them in your
>> string, you can't treat it as a pure Unicode string - it''s a Unicode
>> string with smuggled bytes.
>
> Well, except that I do. The email header parsing algorithms all
On Wed, 17 Sep 2014 01:27:44 +1000, Chris Angelico wrote:
> On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray
> wrote:
> > That isn't the case in the email package. The smuggled bytes are not
> > errors[*], they are literally smuggled bytes.
>
> But they're not characters, which is what Stephen
On Wed, Sep 17, 2014 at 1:00 AM, R. David Murray wrote:
> That isn't the case in the email package. The smuggled bytes are not
> errors[*], they are literally smuggled bytes.
But they're not characters, which is what Stephen and I were saying -
and contrary to what Jim said about treating them a
On Tue, 16 Sep 2014 13:51:23 +1000, Chris Angelico wrote:
> On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull
> wrote:
> > Jim J. Jewett writes:
> >
> > > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > > as representing a character outside of your unicode repertoire
On Tue, Sep 16, 2014 at 1:34 PM, Stephen J. Turnbull wrote:
> Jim J. Jewett writes:
>
> > In terms of best-effort, it is reasonable to treat the smuggled bytes
> > as representing a character outside of your unicode repertoire
>
> I have to disagree. If you ever end up passing them to something th
Jim J. Jewett writes:
> In terms of best-effort, it is reasonable to treat the smuggled bytes
> as representing a character outside of your unicode repertoire
I have to disagree. If you ever end up passing them to something that
validates or tries to reencode them without surrogateescape, BOOM
On Sat Sep 13 00:16:30 CEST 2014, Jeff Allen wrote:
> 1. Java does not really have a Unicode type, therefore not one that
> validates. It has a String type that is a sequence of UTF-16 code units.
> There are some String methods and Character methods that deal with code
> points represented
On 14 Sep 2014 01:33, "R. David Murray" wrote:
>
> On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan
wrote:
> > On 13 Sep 2014 10:18, "Jeff Allen" wrote:
> > > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling,
it
> > would have to do it the same way as CPython, as it is visibl
On Sat, Sep 13, 2014, 09:33 R. David Murray wrote:
> On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan
> wrote:
> > On 13 Sep 2014 10:18, "Jeff Allen" wrote:
> > > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling,
> it
> > would have to do it the same way as CPython, as it is
On Sat, 13 Sep 2014 21:06:21 +1200, Nick Coghlan wrote:
> On 13 Sep 2014 10:18, "Jeff Allen" wrote:
> > 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it
> would have to do it the same way as CPython, as it is visible. It's not
> impossible (I think), but is messy. Some are
On 13 Sep 2014 10:18, "Jeff Allen" wrote:
> 4. I think (with Antoine) if Jython supported PEP-383 byte smuggling, it
would have to do it the same way as CPython, as it is visible. It's not
impossible (I think), but is messy. Some are strongly against.
It may be worth trying *without* it (i.e. tre
Jim, Stephen:
It seems like we're off topic here, but to answer all as briefly as
possible:
1. Java does not really have a Unicode type, therefore not one that
validates. It has a String type that is a sequence of UTF-16 code units.
There are some String methods and Character methods that de
Jeff Allen writes:
> Simply having a block "for private use" seems to create an unmanaged
> space for conflict,
No. The uncharted range of human language (including recently-
invented nonsense like "emoticons" and the annual "design a character"
contest run by a newpaper in Taipei, with the g
On September 11, 2014, Jeff Allen wrote:
> ... the area of code point
> space used for the smuggling of bytes under PEP-383 is not a
> "Unicode Private Use Area", but a portion of the trailing surrogate
> range. This is a code violation, which I imagine is why
> "surrogateescape" is an error
On Fri, 12 Sep 2014 07:54:56 +0100
Jeff Allen wrote:
> Simply having a block "for private use" seems to create an unmanaged
> space for conflict, reminiscent of the "other 128 characters" in
> bilingual programming. I wondered if the way to respect use by
> applications might be to make it priv
On 12/09/2014 04:28, Stephen J. Turnbull wrote:
Jeff Allen writes:
> A welcome article. One correction should be made, I believe: the area of
> code point space used for the smuggling of bytes under PEP-383 is not a
> "Unicode Private Use Area", but a portion of the trailing surrogate
>
Jeff Allen writes:
> A welcome article. One correction should be made, I believe: the area of
> code point space used for the smuggling of bytes under PEP-383 is not a
> "Unicode Private Use Area", but a portion of the trailing surrogate
> range.
Nice catch. Note that the surrogate range
A welcome article. One correction should be made, I believe: the area of
code point space used for the smuggling of bytes under PEP-383 is not a
"Unicode Private Use Area", but a portion of the trailing surrogate
range. This is a code violation, which I imagine is why
"surrogateescape" is an er
On Wed, Sep 10, 2014 at 05:17:57PM +1000, Nick Coghlan wrote:
> Since it may come in handy when discussing "Why was Python 3
> necessary?" with folks, I wanted to point out that my article on the
> transition to multilingual programming has now been reposted on the
> Red Hat developer blog:
> http:
Since it may come in handy when discussing "Why was Python 3
necessary?" with folks, I wanted to point out that my article on the
transition to multilingual programming has now been reposted on the
Red Hat developer blog:
http://developerblog.redhat.com/2014/09/09/transition-to-multilingual-program
38 matches
Mail list logo