subject:"flaming vs accuracy \[was Re\: Performance of int\/long in Python 3\]"

On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:


 More seriously Ive never seen anyone -- cause or person -- aided by
 the use of excessively strong language.

Of course not. By definition, if it helps, it wasn't *excessively* strong 
language.


 IOW I repeat my support for Ned's request: Ad hominiem attacks are not
 welcome, irrespective of the context/provocation.

Insults are not ad hominem attacks.

You sir, are a bounder and a cad. Furthermore, your 
argument is wrong, because of reasons.

may very well be an insult, but it also may be correct, and the reasons 
logically valid.

Your argument is wrong, because you are a bounder 
and a cad.

is an ad hominem fallacy, because even bounders and cads may tell the 
truth occasionally, or be correct by accident.

I find it interesting that nobody has yet felt the need to defend JMF, 
and tell me I was factually incorrect about him (as opposed to merely 
impolite or mean-spirited).

In any case, I don't want this to be specifically about any one person, 
so let's move away from JMF. I disagree that hostile language is *always* 
inappropriate, although I agree that it is *usually* inappropriate.

Although even that depends on what you define as hostile -- I would 
much prefer that people confronted me for being (supposedly) dishonest 
than silently shunning me without giving me any way to respond or correct 
either my behaviour or their (mis)apprehensions. Quite frankly, I think 
that the passive-aggressive silent treatment (kill-filing) is MUCH more 
hostile and mean-spirited[1] than honest, respectful, direct criticism, 
even when that criticism is about character (you sir are a lying 
scoundrel).

I treat people the way I hope to be treated. As galling as it would be to 
be accused of lying, I would rather that you called me a liar to my face 
and gave me the opportunity to respond, than for you to ignore everything 
I said.

I hope that we all agree that we want a nice, friendly, productive 
community where everyone is welcome. But some people simply cannot or 
will not behave in ways that are compatible with those community values. 
There are some people whom we *do not want here* -- spoilers and messers, 
vandals and spammers and cheats and liars and trolls and crackpots of all 
sorts. We only disagree as to the best way to make it clear to them that 
they are not welcome so long as they continue their behaviour.



[1] Although sadly, given the reality of communication on the Internet, 
sometimes kill-filing is the least-worst option.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 07:12, Ethan Furman et...@stoneleaf.us wrote:
 On 03/27/2013 08:49 PM, rusi wrote:

  In particular You are a liar is as bad as You are an idiot
  The same statement can be made non-abusively thus: ... is not true
  because ...

 I don't agree.  With all the posts and micro benchmarks and other drivel that 
 jmf has inflicted on us, I find it /very/
 hard to believe that he forgot -- which means he was deliberately lying.

 At some point we have to stop being gentle / polite / politically correct and 
 call a shovel a shovel... er, spade.

 --
 ~Ethan~

---

The problem is elsewhere. Nobody understand the examples
I gave on this list, because nobody understand Unicode.
These examples are not random examples, they are well
thought.

If you were understanding the coding of the characters,
Unicode and what this flexible representation does, it
would not be a problem for you to create analog examples.

So, we are turning into circles.

This flexible representation succeeds to cumulate in one
shoot all the design mistakes it is possible to do, when
one wishes to implements Unicode.

Example of a good Unicode understanding.
If you wish 1) to preserve memory, 2) to cover the whole range
of Unicode, 3) to keep maximum performance while preserving the
good work Unicode.org as done (normalization, sorting), there
is only one solution: utf-8. For this you have to understand,
what is really a unicode transformation format.

Why all the actors, active in the text field, like MicroSoft,
Apple, Adobe, the unicode compliant TeX engines, the foundries,
the organisation in charge of the OpenType font specifications,
are able to handle all this stuff correctly (understanding +
implementation) and Python not?, I should say this is going
beyond my understanding.

Python has certainly and definitvely not revolutionize
Unicode.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ian Foote


On 28/03/13 09:03, jmfauth wrote:

The problem is elsewhere. Nobody understand the examples
I gave on this list, because nobody understand Unicode.
These examples are not random examples, they are well
thought.

If you were understanding the coding of the characters,
Unicode and what this flexible representation does, it
would not be a problem for you to create analog examples.

So, we are turning into circles.

This flexible representation succeeds to cumulate in one
shoot all the design mistakes it is possible to do, when
one wishes to implements Unicode.

Example of a good Unicode understanding.
If you wish 1) to preserve memory, 2) to cover the whole range
of Unicode, 3) to keep maximum performance while preserving the
good work Unicode.org as done (normalization, sorting), there
is only one solution: utf-8. For this you have to understand,
what is really a unicode transformation format.

Why all the actors, active in the text field, like MicroSoft,
Apple, Adobe, the unicode compliant TeX engines, the foundries,
the organisation in charge of the OpenType font specifications,
are able to handle all this stuff correctly (understanding +
implementation) and Python not?, I should say this is going
beyond my understanding.

Python has certainly and definitvely not revolutionize
Unicode.

jmf



You're confusing python's choice of internal string representation with 
the programmer's choice of encoding for communicating with other programs.


I think most people agree that utf-8 is usually the best encoding to use 
for interoperating with other unicode aware software, but as a 
variable-length encoding it has disadvantages that make it unsuitable 
for use as an internal representation.


Specifically, indexing a variable-length encoding like utf-8 is not as 
efficient as indexing a fixed-length encoding.


Regards,
Ian F
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Oscar Benjamin

On 28 March 2013 09:03, jmfauth wxjmfa...@gmail.com wrote:

 The problem is elsewhere. Nobody understand the examples
 I gave on this list, because nobody understand Unicode.
 These examples are not random examples, they are well
 thought.

There are many people here and among the Python devs who understand
unicode. Similarly they have understood the examples that you have
given. It has been accepted that there are a handful of cases where
performance has been reduced as a result of the change. There are also
many cases where the performance has improved. It is certainly not
clear that there is an *overall* performance reduction for people
using non latin-1 characters as you have often suggested.

The reason your initial posts received a poor reception is that they
were accompanied with pointless rants and arrogant claims that no one
understood the problem. Had you simply reported the timing differences
without the rants then I imagine that you would have received a
response like Okay, there might be a few regressions. Can you open an
issue on the tracker please?.

Since then you have been relentlessly hijacking unrelated threads and
this is clearly just trolling.


 If you were understanding the coding of the characters,
 Unicode and what this flexible representation does, it
 would not be a problem for you to create analog examples.

 So, we are turning into circles.

 This flexible representation succeeds to cumulate in one
 shoot all the design mistakes it is possible to do, when
 one wishes to implements Unicode.

This is clearly untrue.The most significant design mistakes are the
ones that lead to incorrect handling of unicode characters. This new
implementation in Python 3.3 has been designed in a way that makes it
possible to handle all unicode characters correctly.


 Example of a good Unicode understanding.
 If you wish 1) to preserve memory, 2) to cover the whole range
 of Unicode, 3) to keep maximum performance while preserving the
 good work Unicode.org as done (normalization, sorting), there
 is only one solution: utf-8. For this you have to understand,
 what is really a unicode transformation format.

Again you pretend that others here don't understand. Most people here
are well aware of utf-8 is. Your suggestion that maximum performance
would be achieved if Python use utf-8 internally ignores the fact that
it would have many negative performance implications for slicing and
indexing and so on.


 Why all the actors, active in the text field, like MicroSoft,
 Apple, Adobe, the unicode compliant TeX engines, the foundries,
 the organisation in charge of the OpenType font specifications,
 are able to handle all this stuff correctly (understanding +
 implementation) and Python not?, I should say this is going
 beyond my understanding.

 Python has certainly and definitvely not revolutionize
 Unicode.

Perhaps not, but it does now correctly handle all unicode characters
(unlike many other languages and pieces of software).


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Thu, Mar 28, 2013 at 4:20 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 27 Mar 2013 20:49:20 -0700, rusi wrote:

 In particular You are a liar is as bad as You are an idiot The same
 statement can be made non-abusively thus: ... is not true because ...

 I accept that criticism, even if I disagree with it. Does that make
 sense? I mean it in the sense that I accept that your opinion differs
 from mine.

 Politeness does not always trump honesty, and stating that somebody's
 statement is not true because... is not the same as stating that they
 are deliberately telling lies (rather than merely being mistaken or
 confused).

There comes a time when a bit of rudeness is a small cost to pay for
forum maintenance. Before you criticize someone for nit-picking, think
what happens when someone reads the thread archive. Of course, that
particular example can be done courteously too - cf the def vs
class nit from a recent thread. But it'd still be of value even if
done rudely, so the hundreds of subsequent readers would have a chance
to know what's going on. I was researching a problem with ALSA a
couple of weeks ago, and came across a forum thread that discussed
exactly what I needed to know. A dozen or so courteous posts delivered
misinformation; finally someone had the guts to be rude and call
people out for posting incorrect points (and got criticized for doing
so), and that one post was the most useful in the whole thread.

I'd rather this list have some vinegar than it devolve into
uselessness. Or, worse, if there's a hard-and-fast rule about
courtesy, devolve into aspartame... everyone's courteous in words but
hates each other underneath. Or am I taking the analogy too far? :)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Thu, Mar 28, 2013 at 8:03 PM, jmfauth wxjmfa...@gmail.com wrote:
 Example of a good Unicode understanding.
 If you wish 1) to preserve memory, 2) to cover the whole range
 of Unicode, 3) to keep maximum performance while preserving the
 good work Unicode.org as done (normalization, sorting), there
 is only one solution: utf-8. For this you have to understand,
 what is really a unicode transformation format.

You really REALLY need to sort out in your head the difference between
correctness and performance. I still haven't seen one single piece of
evidence from you that Python 3.3 fails on any point of Unicode
correctness. Covering the whole range of Unicode has never been a
problem.

In terms of memory usage and performance, though, there's one obvious
solution. Fork CPython 3.3 (or the current branch head[1]), change the
internal representation of a string to be UTF-8 (by the way, that's
the official spelling), and run the string benchmarks. Then post your
code and benchmark figures so other people can replicate your results.

 Python has certainly and definitvely not revolutionize
 Unicode.

This is one place where you're actually correct, though, because PEP
393 isn't the first instance of this kind of format - Pike's had it
for years. Funny though, I don't think that was your point :)

[1] Apologies if my terminology is wrong, I'm a git user and did one
quick Google search to see if hg uses the same term.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]


Ian Foote:


Specifically, indexing a variable-length encoding like utf-8 is not as
efficient as indexing a fixed-length encoding.


   Many common string operations do not require indexing by character 
which reduces the impact of this inefficiency. UTF-8 seems like a 
reasonable choice for an internal representation to me. One benefit of 
UTF-8 over Python's flexible representation is that it is, on average, 
more compact over a wide set of samples.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Mark Lawrence


On 28/03/2013 03:18, Ethan Furman wrote:


I wouldn't call it unproductive -- a half-dozen amusing posts followed
because of Mark's initial post, and they were a great relief from the
tedium and (dare I say it?) idiocy of jmf's posts.

--
~Ethan~


Thanks for those words.  They're a tonic as I've just clawed my way out 
of bed at 12:00 GMT having slept for 15 hours.


Once the PEP393 unicode debacle has been sorted, does anyone have a cure 
for Chronic Fatigue Syndrome? :)


--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:

 Ian Foote:
 
 Specifically, indexing a variable-length encoding like utf-8 is not as
 efficient as indexing a fixed-length encoding.
 
 Many common string operations do not require indexing by character
 which reduces the impact of this inefficiency. 

Which common string operations do you have in mind?

Specifically in Python's case, the most obvious counter-example is the 
length of a string. But that's only because Python strings are immutable 
objects, and include a field that records the length. So once the string 
is created, checking its length takes constant time.

Some string operations need to inspect every character, e.g. str.upper(). 
Even for them, the increased complexity of a variable-width encoding 
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or 
4 bytes per character. You have to walk the string grabbing 1 byte at a 
time, and then decide whether you need another 1, 2 or 3 bytes. Even 
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost. 

Any string method that takes a starting offset requires the method to 
walk the string byte-by-byte. I've even seen languages put responsibility 
for dealing with that onto the programmer: the start offset is given in 
*bytes*, not characters. I don't remember what language this was... it 
might have been Haskell? Whatever it was, it horrified me.


 UTF-8 seems like a
 reasonable choice for an internal representation to me.

It's not. Haskell, for example, uses UTF-8 internally, and it warns that 
this makes string operations O(N) instead of O(1) precisely because of 
the need to walk the string inspecting every byte.

Remember, when your string primitives are O(N), it is very easy to write 
code that becomes O(N**2). Using UTF-8 internally is just begging for 
user-written code to be O(N**2).


 One benefit of
 UTF-8 over Python's flexible representation is that it is, on average,
 more compact over a wide set of samples.

Sure. And over a different set of samples, it is less compact. If you 
write a lot of Latin-1, Python will use one byte per character, while 
UTF-8 will use two bytes per character.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 11:30, Chris Angelico ros...@gmail.com wrote:
 On Thu, Mar 28, 2013 at 8:03 PM, jmfauth wxjmfa...@gmail.com wrote:

-

 You really REALLY need to sort out in your head the difference between
 correctness and performance. I still haven't seen one single piece of
 evidence from you that Python 3.3 fails on any point of Unicode
 correctness.

That's because you are not understanding unicode. Unicode takes
you from the character to the unicoded transformed fomat via
the code point, working with a unique set of characters with
a contigoous range of code points.
Then it is up to the implementors (languages, compilers, ...)
to implement this utf.

 Covering the whole range of Unicode has never been a
 problem.

... for all those, who are following the scheme explained above.
And it magically works smoothly. Of course, there are some variations
due to the Character Encoding Form wich is later influenced by the
Character Encoding Scheme (the serialization of the character Encoding
Scheme).

Rough explanation in other words.
I does not matter if you are using utf-8, -16, -32, ucs2 or ucs4.
All the single characters are handled in the same way with the same
algorithm.

---

The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

PS I never propose to use utf-8. I only spoke about utf-8
as an example. If you start to discuss indexing, you are off-topic.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 14:01, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 28 Mar 2013 23:11:55 +1100, Neil Hodgson wrote:
  Ian Foote:


  One benefit of
  UTF-8 over Python's flexible representation is that it is, on average,
  more compact over a wide set of samples.

 Sure. And over a different set of samples, it is less compact. If you
 write a lot of Latin-1, Python will use one byte per character, while
 UTF-8 will use two bytes per character.


This flexible string representation is so absurd that not only
it does not know you can not write Western European Languages
with latin-1, it penalizes you by just attempting to optimize
latin-1. Shown in my multiple examples.

(This is a similar case of the long and short int question/dicussion
Chris Angelico opened).


PS1: I received plenty of private mails. I'm suprise, how the dev
do not understand unicode.

PS2: Question I received once from a registrated French Python
Developper (in another context). What are those French characters
you can handle with cp1252 and not with latin-1?

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 1:12 AM, jmfauth wxjmfa...@gmail.com wrote:
 This flexible string representation is so absurd that not only
 it does not know you can not write Western European Languages
 with latin-1, it penalizes you by just attempting to optimize
 latin-1. Shown in my multiple examples.

PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?) Both are
optimized down to a single byte per character.

Option 2 is optimized to two bytes per character.

Option 3 is stored in UTF-32.

Once again, jmf, you are forgetting that option 2 is a safe and
bug-free optimization.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread MRAB


On 28/03/2013 12:11, Neil Hodgson wrote:

Ian Foote:


Specifically, indexing a variable-length encoding like utf-8 is not
as efficient as indexing a fixed-length encoding.


Many common string operations do not require indexing by character
which reduces the impact of this inefficiency. UTF-8 seems like a
reasonable choice for an internal representation to me. One benefit
of UTF-8 over Python's flexible representation is that it is, on
average, more compact over a wide set of samples.


Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 1:51 AM, MRAB pyt...@mrabarnett.plus.com wrote:
 On 28/03/2013 12:11, Neil Hodgson wrote:

 Ian Foote:

 Specifically, indexing a variable-length encoding like utf-8 is not
 as efficient as indexing a fixed-length encoding.


 Many common string operations do not require indexing by character
 which reduces the impact of this inefficiency. UTF-8 seems like a
 reasonable choice for an internal representation to me. One benefit
 of UTF-8 over Python's flexible representation is that it is, on
 average, more compact over a wide set of samples.

 Implementing the regex module (http://pypi.python.org/pypi/regex) would
 have been more difficult if the internal representation had been UTF-8,
 because of the need to decode, and the implementation would also have
 been slower for that reason.

In fact, nearly ALL string parsing operations would need to be done
differently. The only method that I can think of that wouldn't be
impacted is a linear state-machine parser - something that could be
written inside a for character in string loop.

text = []

def initial(c):
global state
if c=='': state=tag
else: text.append(c)

def tag(c):
global state
if c=='': state=initial

state = initial
for character in string:
state(character)

print(''.join(text))


I'm pretty sure this will run in O(N) time, even with UTF-8 strings.
But it's an *extremely* simple parser.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 15:38, Chris Angelico ros...@gmail.com wrote:
 On Fri, Mar 29, 2013 at 1:12 AM, jmfauth wxjmfa...@gmail.com wrote:
  This flexible string representation is so absurd that not only
  it does not know you can not write Western European Languages
  with latin-1, it penalizes you by just attempting to optimize
  latin-1. Shown in my multiple examples.

 PEP393 strings have two optimizations, or kinda three:

 1a) ASCII-only strings
 1b) Latin1-only strings
 2) BMP-only strings
 3) Everything else

 Options 1a and 1b are almost identical - I'm not sure what the detail
 is, but there's something flagging those strings that fit inside seven
 bits. (Something to do with optimizing encodings later?) Both are
 optimized down to a single byte per character.

 Option 2 is optimized to two bytes per character.

 Option 3 is stored in UTF-32.

 Once again, jmf, you are forgetting that option 2 is a safe and
 bug-free optimization.

 ChrisA

As long as you are attempting to devide a set of characters in
chunks and try to handle them seperately, it will never work.

Read my previous post about the unicode transformation format.
I know what pep393 does.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 2:14 AM, jmfauth wxjmfa...@gmail.com wrote:
 As long as you are attempting to devide a set of characters in
 chunks and try to handle them seperately, it will never work.

Okay. Let's look at integers. To properly represent the Python 3 'int'
type (or the Python 2 'long'), we need to be able to encode ANY
integer. And of course, any attempt to divide them up into chunks will
never work. So we need a single representation that will cover ANY
integer, right?

Perfect. We already have one of those, detailed in RFC 2795. (It's
coming up to its thirteenth anniversary in a day or two. Very
appropriate.)

http://tools.ietf.org/html/rfc2795#section-4

Are you saying Python's integers should be stored as I-TAGs?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 16:14, jmfauth wxjmfa...@gmail.com wrote:
 On 28 mar, 15:38, Chris Angelico ros...@gmail.com wrote:









  On Fri, Mar 29, 2013 at 1:12 AM, jmfauth wxjmfa...@gmail.com wrote:
   This flexible string representation is so absurd that not only
   it does not know you can not write Western European Languages
   with latin-1, it penalizes you by just attempting to optimize
   latin-1. Shown in my multiple examples.

  PEP393 strings have two optimizations, or kinda three:

  1a) ASCII-only strings
  1b) Latin1-only strings
  2) BMP-only strings
  3) Everything else

  Options 1a and 1b are almost identical - I'm not sure what the detail
  is, but there's something flagging those strings that fit inside seven
  bits. (Something to do with optimizing encodings later?) Both are
  optimized down to a single byte per character.

  Option 2 is optimized to two bytes per character.

  Option 3 is stored in UTF-32.

  Once again, jmf, you are forgetting that option 2 is a safe and
  bug-free optimization.

  ChrisA

 As long as you are attempting to devide a set of characters in
 chunks and try to handle them seperately, it will never work.

 Read my previous post about the unicode transformation format.
 I know what pep393 does.

 jmf

Addendum.

This was you correctly percieved in one another thread.
You qualified it as a switch. Now you have to understand
from where this switch is coming from.

jmf

by toy with
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ian Kelly

On Thu, Mar 28, 2013 at 7:01 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Any string method that takes a starting offset requires the method to
 walk the string byte-by-byte. I've even seen languages put responsibility
 for dealing with that onto the programmer: the start offset is given in
 *bytes*, not characters. I don't remember what language this was... it
 might have been Haskell? Whatever it was, it horrified me.

Go does this.  I remember because it came up in one of these threads,
where jmf (or was it Ranting Rick?) was praising Go for just getting
Unicode right.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Terry Reedy


On 3/28/2013 10:38 AM, Chris Angelico wrote:


PEP393 strings have two optimizations, or kinda three:

1a) ASCII-only strings
1b) Latin1-only strings
2) BMP-only strings
3) Everything else

Options 1a and 1b are almost identical - I'm not sure what the detail
is, but there's something flagging those strings that fit inside seven
bits. (Something to do with optimizing encodings later?)


Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding 
amounts to a simple copy of the internal bytes. I do not know if *all* 
the codecs for such encodings are 393-aware, but I do know that the 
utf-8 and latin-1 group are. This is one operation that 3.3+ does much 
faster than 3.2-



--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ian Kelly

On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico ros...@gmail.com wrote:
 PEP393 strings have two optimizations, or kinda three:

 1a) ASCII-only strings
 1b) Latin1-only strings
 2) BMP-only strings
 3) Everything else

 Options 1a and 1b are almost identical - I'm not sure what the detail
 is, but there's something flagging those strings that fit inside seven
 bits. (Something to do with optimizing encodings later?) Both are
 optimized down to a single byte per character.

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header.  The smaller header omits the utf8
pointer (which optionally points to an additional UTF-8 representation
of the string) and its associated length variable.  These are not
needed for ASCII-only strings because an ASCII string can be directly
interpreted as a UTF-8 string for the same result.  The smaller header
also omits the wstr_length field which, according to the PEP,
differs from length only if there are surrogate pairs in the
representation.  For an ASCII string, of course there would not be
any surrogate pairs.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 3:01 AM, Terry Reedy tjre...@udel.edu wrote:
 On 3/28/2013 10:38 AM, Chris Angelico wrote:

 PEP393 strings have two optimizations, or kinda three:

 1a) ASCII-only strings
 1b) Latin1-only strings
 2) BMP-only strings
 3) Everything else

 Options 1a and 1b are almost identical - I'm not sure what the detail
 is, but there's something flagging those strings that fit inside seven
 bits. (Something to do with optimizing encodings later?)


 Yes. 'Encoding' an ascii-only string to any ascii-compatible encoding
 amounts to a simple copy of the internal bytes. I do not know if *all* the
 codecs for such encodings are 393-aware, but I do know that the utf-8 and
 latin-1 group are. This is one operation that 3.3+ does much faster than
 3.2-

Thanks Terry. So that's not so much a representation difference as a
flag that costs little or nothing to retain, and can improve
performance in the encode later on. Sounds like a useful tweak to the
basics of flexible string representation, without being particularly
germane to jmf's complaints.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ian Kelly

On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
 The flexible string representation takes the problem from the
 other side, it attempts to work with the characters by using
 their representations and it (can only) fails...

This is false.  As I've pointed out to you before, the FSR does not
divide characters up by representation.  It divides them up by
codepoint -- more specifically, by the *bit-width* of the codepoint.
We call the internal format of the string ASCII or Latin-1 or
UCS-2 for conciseness and a point of reference, but fundamentally
all of the FSR formats are simply byte arrays of *codepoints* -- you
know, those things you keep harping on.  The major optimization
performed by the FSR is to consistently truncate the leading zero
bytes from each codepoint when it is possible to do so safely.  But
regardless of to what extent this truncation is applied, the string is
*always* internally just an array of codepoints, and the same
algorithms apply for all representations.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

Chris,

Your problem with int/long, the start of this thread, is
very intersting.

This is not a demonstration, a proof, rather an illustration.

Assume you have a set of integers {0...9} and an operator,
let say, the addition.

Idea.
Just devide this set in two chunks, {0...4} and {5...9}
and work hardly to optimize the addition of 2 operands in
the sets {0...4}.

The problems.
- When optimizing {0...4}, your algorithm will most probably
weaken {5...9}.
- When using {5...9}, you do not benefit from your algorithm, you
will be penalized just by the fact you has optimized {0...4}
- And the first mistake, you are just penalized and impacted by the
fact you have to select in which subset you operands are when
working with {0...9}.

Very interestingly, working with the representation (bytes) of
these integers will not help. You have to consider conceptually
{0..9} as numbers.

Now, replace numbers by characters, bytes by encoded code points,
and you have qualitatively the flexible string representation.

In Unicode, there is one more level of abstraction: one conceptually
neither works with characters, nor with encoded code points, but
with unicode transformed formated entities. (see my previous post).

That means you can work very hardly on the bytes levels,
you will never solves the problem which is one level higher
in the unicode hierarchy:
character - code point - utf - bytes (implementation)
with the important fact that this construct can only go
from left to right.

---

In fact, by proposing a flexible representation of ints, you may
just fall in the same trap the flexible string representation
presents.



All this stuff is explained in good books about the coding of the
characters and/or unicode.
The unicode.org documention explains it too. It is a little
bit harder to discover, because the doc is presenting always
this stuff from a technical perspective.
You get it when reading a large part of the Unicode doc.

jmf



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 3:55 AM, jmfauth wxjmfa...@gmail.com wrote:
 Assume you have a set of integers {0...9} and an operator,
 let say, the addition.

 Idea.
 Just devide this set in two chunks, {0...4} and {5...9}
 and work hardly to optimize the addition of 2 operands in
 the sets {0...4}.

 The problems.
 - When optimizing {0...4}, your algorithm will most probably
 weaken {5...9}.
 - When using {5...9}, you do not benefit from your algorithm, you
 will be penalized just by the fact you has optimized {0...4}
 - And the first mistake, you are just penalized and impacted by the
 fact you have to select in which subset you operands are when
 working with {0...9}.

 Very interestingly, working with the representation (bytes) of
 these integers will not help. You have to consider conceptually
 {0..9} as numbers.

Yeah, and there's an easy representation of those numbers. But let's
look at Python's representations of integers. I have a sneaking
suspicion something takes note of how large the number is before
deciding how to represent it. Look!

 sys.getsizeof(1)
14
 sys.getsizeof(12)
14
 sys.getsizeof(14)
14
 sys.getsizeof(18)
14
 sys.getsizeof(131)
18
 sys.getsizeof(130)
18
 sys.getsizeof(116)
16
 sys.getsizeof(112345)
1660
 sys.getsizeof(1123456)
16474

Small numbers are represented more compactly than large ones! And it's
not like in REXX, where all numbers are stored as strings.

Go fork CPython and make the changes you suggest. Then run real-world
code on it and see how it performs. Or at very least, run plausible
benchmarks like the strings benchmark from the standard tests.

My original post about integers was based on two comparisons: Python 2
and Pike. Both languages have an optimization for small integers
(where small is within machine word - on rechecking some of my
stats, I find that I perhaps should have used a larger offset, as the
64-bit Linux Python I used appeared to be a lot faster than it should
have been), which Python 3 doesn't have. Real examples, real
statistics, real discussion. (I didn't include Pike stats in what I
posted, for two reasons: firstly, it would require a reworking of the
code, rather than simply run this under both interpreters; and
secondly, Pike performance is completely different from CPython
performance, and is non-comparable. Pike is more similar to PyPy, able
to compile - in certain circumstances - to machine code. So the
comparisons were Py2 vs Py3.)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
  The flexible string representation takes the problem from the
  other side, it attempts to work with the characters by using
  their representations and it (can only) fails...

 This is false.  As I've pointed out to you before, the FSR does not
 divide characters up by representation.  It divides them up by
 codepoint -- more specifically, by the *bit-width* of the codepoint.
 We call the internal format of the string ASCII or Latin-1 or
 UCS-2 for conciseness and a point of reference, but fundamentally
 all of the FSR formats are simply byte arrays of *codepoints* -- you
 know, those things you keep harping on.  The major optimization
 performed by the FSR is to consistently truncate the leading zero
 bytes from each codepoint when it is possible to do so safely.  But
 regardless of to what extent this truncation is applied, the string is
 *always* internally just an array of codepoints, and the same
 algorithms apply for all representations.

-

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an a, é, € or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the jambon cru
and a good bottle a dry white wine.

jmf




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 4:48 AM, jmfauth wxjmfa...@gmail.com wrote:
 If Python had imlemented Unicode correctly, there would
 be no difference in using an a, é, € or any character,
 what the narrow builds did.

I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread rurpy

On 03/28/2013 01:48 AM, Steven D'Aprano wrote:
 On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:
 More seriously Ive never seen anyone -- cause or person -- aided by
 the use of excessively strong language.
 
 Of course not. By definition, if it helps, it wasn't *excessively* strong 
 language.

For someone who delights in pointing out the logical errors 
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

 IOW I repeat my support for Ned's request: Ad hominiem attacks are not
 welcome, irrespective of the context/provocation.
 
 Insults are not ad hominem attacks.

Insults may or may not be ad hominem attacks.  There is nothing 
mutually exclusive about those terms.

 You sir, are a bounder and a cad. Furthermore, your 
 argument is wrong, because of reasons.
 
 may very well be an insult, but it also may be correct, and the reasons 
 logically valid.

Those are two different statements.  The first is an ad hominem 
attack and is not welcome here.  The second is an acceptable 
response.

 Your argument is wrong, because you are a bounder 
 and a cad.
 
 is an ad hominem fallacy, because even bounders and cads may tell the 
 truth occasionally, or be correct by accident.

That it is a fallacy does not mean it is not also an attack.

 I find it interesting that nobody has yet felt the need to defend JMF, 
 and tell me I was factually incorrect about him (as opposed to merely 
 impolite or mean-spirited).

Nothing interesting about it at all.  Most of us (perhaps
unlike you) are not interested in discussing the personal
characteristics of posters here (in contrast to discussing
the technical opinions they post).

Further, liar is both so non-objective and so pejoratively 
emotive that it is a word much more likely to be used by 
someone interested in trolling than in a serious discussion, 
so most sensible people here likely would not bite.

[...] 
 I would rather that you called me a liar to my face 
 and gave me the opportunity to respond, than for you to ignore everything 
 I said.

Even if you personally would prefer someone to respond by 
calling you a liar, your personal preferences do not form 
a basis for desirable posting behavior here.

But again you're creating a false dichotomy.  Those are not 
the only two choices.  A third choice is neither ignore you 
nor call you a liar but to factually point out where you are 
wrong, or (if it is a matter of opinion) why one holds a 
different opinion.  That was the point Ned Deily was making 
I believe.

 I hope that we all agree that we want a nice, friendly, productive 
 community where everyone is welcome. 

I hope so too but it is likely that some people want a place 
to develop and assert some sense of influence, engage in verbal 
duels, instigate arguments, etc.  That can be true of regulars
here as well as drive-by posters.

 But some people simply cannot or 
 will not behave in ways that are compatible with those community values. 
 There are some people whom we *do not want here* 

In other words, everyone is NOT welcome.

 -- spoilers and messers, 
 vandals and spammers and cheats and liars and trolls and crackpots of all 
 sorts. 

Where those terms are defined by you and a handful of other 
voracious posters.  Troll in particular is often used to 
mean someone who disagrees with the borg mind here, or who 
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in 
a sufficiently submissive way.

 We only disagree as to the best way to make it clear to them that 
 they are not welcome so long as they continue their behaviour.

No, we disagree on who fits those definitions and even 
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original 
posts are.

 [1] Although sadly, given the reality of communication on the Internet, 
 sometimes kill-filing is the least-worst option.

Please, please, killfile jmfauth, ranting rick, xaw lee and 
anyone else you don't like so that the rest of us can be spared 
the orders of magnitude larger, more disruptive and more offensive
posts generated by your (plural) responses to them.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think. 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ned Deily

In article 
captjjmozdhsmuqx7vcpuii2bwrcnzcx76pm-6unb1duq4do...@mail.gmail.com,
 Chris Angelico ros...@gmail.com wrote:
 I'd rather this list have some vinegar than it devolve into
 uselessness. Or, worse, if there's a hard-and-fast rule about
 courtesy, devolve into aspartame... everyone's courteous in words but
 hates each other underneath. Or am I taking the analogy too far? :)

I think you are positing false choices.  No one - at least I'm not - is 
advocating to avoid challenging false or misleading statements in the 
interests of maintaining some false see how well we all get along 
facade.  The point is we can have meaningful, hard-nosed discussions 
without resorting to personal insults, i.e. flaming.  I think the 
discussion in this topic over the past 24 hours or so demonstrates that.

-- 
 Ned Deily,
 n...@acm.org

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 18:55, Chris Angelico ros...@gmail.com wrote:
 On Fri, Mar 29, 2013 at 4:48 AM, jmfauth wxjmfa...@gmail.com wrote:
  If Python had imlemented Unicode correctly, there would
  be no difference in using an a, é, € or any character,
  what the narrow builds did.

 I'm not following your grammar perfectly here, but if Python were
 implementing Unicode correctly, there would be no difference between
 any of those characters, which is the way a *wide* build works. With a
 narrow build, there is a difference between BMP and non-BMP
 characters.

 ChrisA



The wide build (I never used) is in my mind as correct as
the narrow build. It just covers a different range in unicode
(the whole range).

Claiming that the narrow build is buggy, because it does not
cover the whole unicode is not correct.

Unicode does not stipulate, one has to cover the whole range.
Unicode expects that every character in a range behaves the same
way. This is clearly not realized with the flexible string
representation. An user should not be somehow penalized
simply because it not an ascii user.

If you take the fonts in consideration (btw a problem nobody
is speaking about) and you ensure your application, toolkit, ...
is MES-X or WGL4 compliant, your are also deliberately (and
correctly) working with a restriced unicode range.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Benjamin Kaplan

On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wxjmfa...@gmail.com wrote:
 On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
  The flexible string representation takes the problem from the
  other side, it attempts to work with the characters by using
  their representations and it (can only) fails...

 This is false.  As I've pointed out to you before, the FSR does not
 divide characters up by representation.  It divides them up by
 codepoint -- more specifically, by the *bit-width* of the codepoint.
 We call the internal format of the string ASCII or Latin-1 or
 UCS-2 for conciseness and a point of reference, but fundamentally
 all of the FSR formats are simply byte arrays of *codepoints* -- you
 know, those things you keep harping on.  The major optimization
 performed by the FSR is to consistently truncate the leading zero
 bytes from each codepoint when it is possible to do so safely.  But
 regardless of to what extent this truncation is applied, the string is
 *always* internally just an array of codepoints, and the same
 algorithms apply for all representations.

 -

 You know, we can discuss this ad nauseam. What is important
 is Unicode.

 You have transformed Python back in an ascii oriented product.

 If Python had imlemented Unicode correctly, there would
 be no difference in using an a, é, € or any character,
 what the narrow builds did.

 If I am practically the only one, who speakes /discusses about
 this, I can ensure you, this has been noticed.

 Now, it's time to prepare the Asparagus, the jambon cru
 and a good bottle a dry white wine.

 jmf


You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings  are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of \u1f435 is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Ethan Furman

On 03/28/2013 12:54 PM, ru...@yahoo.com wrote:

On 03/28/2013 01:48 AM, Steven D'Aprano wrote:

On Wed, 27 Mar 2013 22:42:18 -0700, rusi wrote:

For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

It can also be the case when language less strong would be useless.

Further, liar is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.

Non-objective? If today poster B says X, and tomorrow poster B says s/he was unaware of X until just now, is not liar
a reasonable conclusion?

I hope that we all agree that we want a nice, friendly, productive
community where everyone is welcome.

I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.

But some people simply cannot or
will not behave in ways that are compatible with those community values.
There are some people whom we *do not want here*

In other words, everyone is NOT welcome.

Correct. Do you not agree?

-- spoilers and messers,
vandals and spammers and cheats and liars and trolls and crackpots of all
sorts.

Where those terms are defined by you and a handful of other
voracious posters. Troll in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.

I cannot speak for the borg mind, but for myself a troll is anyone who continually posts rants (such as RR XL) or who
continuously hijacks threads to talk about their pet peeve (such as jmf).

We only disagree as to the best way to make it clear to them that
they are not welcome so long as they continue their behaviour.

No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.

I completely disagree, and I am grateful to those who bother to take the time to continually point out the errors from
those posters and to warn newcomers that those posters should not be believed.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.

If one of my first few posts on c.l.p netted a response from a troll I would greatly appreciate a reply from one of the
regulars saying that was a troll so I didn't waste time trying to use whatever they said, or be concerned that the
language I was trying to use and learn was horribly flawed.

If the truthsquad posts are so offensive to you, why don't you kill-file them?

--
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 21:29, Benjamin Kaplan benjamin.kap...@case.edu wrote:
 On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wxjmfa...@gmail.com wrote:
  On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
  On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
   The flexible string representation takes the problem from the
   other side, it attempts to work with the characters by using
   their representations and it (can only) fails...

  This is false.  As I've pointed out to you before, the FSR does not
  divide characters up by representation.  It divides them up by
  codepoint -- more specifically, by the *bit-width* of the codepoint.
  We call the internal format of the string ASCII or Latin-1 or
  UCS-2 for conciseness and a point of reference, but fundamentally
  all of the FSR formats are simply byte arrays of *codepoints* -- you
  know, those things you keep harping on.  The major optimization
  performed by the FSR is to consistently truncate the leading zero
  bytes from each codepoint when it is possible to do so safely.  But
  regardless of to what extent this truncation is applied, the string is
  *always* internally just an array of codepoints, and the same
  algorithms apply for all representations.

  -

  You know, we can discuss this ad nauseam. What is important
  is Unicode.

  You have transformed Python back in an ascii oriented product.

  If Python had imlemented Unicode correctly, there would
  be no difference in using an a, é, € or any character,
  what the narrow builds did.

  If I am practically the only one, who speakes /discusses about
  this, I can ensure you, this has been noticed.

  Now, it's time to prepare the Asparagus, the jambon cru
  and a good bottle a dry white wine.

  jmf

 You still have yet to explain how Python's string representation is
 wrong. Just how it isn't optimal for one specific case. Here's how I
 understand it:

 1) Strings are sequences of stuff. Generally, we talk about strings as
 either sequences of bytes or sequences of characters.

 2) Unicode is a format used to represent characters. Therefore,
 Unicode strings are character strings, not byte strings.

 2) Encodings  are functions that map characters to bytes. They
 typically also define an inverse function that converts from bytes
 back to characters.

 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
 mentioned in the previous point. It happens to be one of the five
 standard encodings that is defined for all characters in the Unicode
 standard (the others being the little and big endian variants of
 UTF-16 and UTF-32).

 4) The internal representation of a character string DOES NOT MATTER.
 All that matters is that the API represents it as a string of
 characters, regardless of the representation. We could implement
 character strings by putting the Unicode code-points in binary-coded
 decimal and it would be a Unicode character string.

 5) The String type that .NET and Java (and unicode type in Python
 narrow builds) use is not a character string. It is a string of
 shorts, each of which corresponds to a UTF-16 code point. I know this
 is the case because in all of these, the length of \u1f435 is 2 even
 though it only consists of one character.

 6) The new string representation in Python 3.3 can successfully
 represent all characters in the Unicode standard. The actual number of
 bytes that each character consumes is invisible to the user.

--


I shew enough examples. As soon as you are using non latin-1 chars
your optimization just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a correct version with
a narrower range of chars, especially if this range represents
the daily used chars.

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally 

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On 28 mar, 22:11, jmfauth wxjmfa...@gmail.com wrote:
 On 28 mar, 21:29, Benjamin Kaplan benjamin.kap...@case.edu wrote:









  On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wxjmfa...@gmail.com wrote:
   On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
   On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
The flexible string representation takes the problem from the
other side, it attempts to work with the characters by using
their representations and it (can only) fails...

   This is false.  As I've pointed out to you before, the FSR does not
   divide characters up by representation.  It divides them up by
   codepoint -- more specifically, by the *bit-width* of the codepoint.
   We call the internal format of the string ASCII or Latin-1 or
   UCS-2 for conciseness and a point of reference, but fundamentally
   all of the FSR formats are simply byte arrays of *codepoints* -- you
   know, those things you keep harping on.  The major optimization
   performed by the FSR is to consistently truncate the leading zero
   bytes from each codepoint when it is possible to do so safely.  But
   regardless of to what extent this truncation is applied, the string is
   *always* internally just an array of codepoints, and the same
   algorithms apply for all representations.

   -

   You know, we can discuss this ad nauseam. What is important
   is Unicode.

   You have transformed Python back in an ascii oriented product.

   If Python had imlemented Unicode correctly, there would
   be no difference in using an a, é, € or any character,
   what the narrow builds did.

   If I am practically the only one, who speakes /discusses about
   this, I can ensure you, this has been noticed.

   Now, it's time to prepare the Asparagus, the jambon cru
   and a good bottle a dry white wine.

   jmf

  You still have yet to explain how Python's string representation is
  wrong. Just how it isn't optimal for one specific case. Here's how I
  understand it:

  1) Strings are sequences of stuff. Generally, we talk about strings as
  either sequences of bytes or sequences of characters.

  2) Unicode is a format used to represent characters. Therefore,
  Unicode strings are character strings, not byte strings.

  2) Encodings  are functions that map characters to bytes. They
  typically also define an inverse function that converts from bytes
  back to characters.

  3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
  mentioned in the previous point. It happens to be one of the five
  standard encodings that is defined for all characters in the Unicode
  standard (the others being the little and big endian variants of
  UTF-16 and UTF-32).

  4) The internal representation of a character string DOES NOT MATTER.
  All that matters is that the API represents it as a string of
  characters, regardless of the representation. We could implement
  character strings by putting the Unicode code-points in binary-coded
  decimal and it would be a Unicode character string.

  5) The String type that .NET and Java (and unicode type in Python
  narrow builds) use is not a character string. It is a string of
  shorts, each of which corresponds to a UTF-16 code point. I know this
  is the case because in all of these, the length of \u1f435 is 2 even
  though it only consists of one character.

  6) The new string representation in Python 3.3 can successfully
  represent all characters in the Unicode standard. The actual number of
  bytes that each character consumes is invisible to the user.

 --

 I shew enough examples. As soon as you are using non latin-1 chars
 your optimization just became irrelevant and not only this, you
 are penalized.

 I'm sorry, saying Python now is just covering the whole unicode
 range is not a valuable excuse. I prefer a correct version with
 a narrower range of chars, especially if this range represents
 the daily used chars.

 I can go a step further, if I wish to write an application for
 Western European users, I'm better served if I'm using a coding
 scheme covering all thesee languages/scripts. What about cp1252 [*]?
 Does this not remind somthing?

 Python can do better, it only succeeds to do worth!

 [*] yes, I kwnow, internally 

 jmf

-

Addendum.

And you kwow what? Py34 will suffer from the same desease.
You are spending your time in improving chunks of bytes,
when the problem is elsewhere.
In fact you are working for peanuts, eg the replacing method.


If you are not satisfied with my examples, just pick up
the examples of GvR (ascii-string) on the bug tracker, timeit
them and you will see there is already a problem.

Better, timeit them afeter having replaced his ascii-strings
with non ascii characters...

jmf

and you will see, there is
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 7:26 AM, jmfauth wxjmfa...@gmail.com wrote:
 The wide build (I never used) is in my mind as correct as
 the narrow build. It just covers a different range in unicode
 (the whole range).

Actually it does; it covers all of the Unicode range, by using
(effectively) UTF-16. Characters that cannot be represented in one
16-bit number are represented in two. That's not just covering a
different range. It's being buggy. And it's creating a way for code to
unexpectedly behave fundamentally differently on Windows and Linux
(since the most common builds for Windows were narrow and for Linux
were wide). This is a Bad Thing for Python.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread MRAB


On 28/03/2013 21:11, jmfauth wrote:

On 28 mar, 21:29, Benjamin Kaplan benjamin.kap...@case.edu wrote:

On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wxjmfa...@gmail.com wrote:
 On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
  The flexible string representation takes the problem from the
  other side, it attempts to work with the characters by using
  their representations and it (can only) fails...

 This is false.  As I've pointed out to you before, the FSR does not
 divide characters up by representation.  It divides them up by
 codepoint -- more specifically, by the *bit-width* of the codepoint.
 We call the internal format of the string ASCII or Latin-1 or
 UCS-2 for conciseness and a point of reference, but fundamentally
 all of the FSR formats are simply byte arrays of *codepoints* -- you
 know, those things you keep harping on.  The major optimization
 performed by the FSR is to consistently truncate the leading zero
 bytes from each codepoint when it is possible to do so safely.  But
 regardless of to what extent this truncation is applied, the string is
 *always* internally just an array of codepoints, and the same
 algorithms apply for all representations.

 -

 You know, we can discuss this ad nauseam. What is important
 is Unicode.

 You have transformed Python back in an ascii oriented product.

 If Python had imlemented Unicode correctly, there would
 be no difference in using an a, é, € or any character,
 what the narrow builds did.

 If I am practically the only one, who speakes /discusses about
 this, I can ensure you, this has been noticed.

 Now, it's time to prepare the Asparagus, the jambon cru
 and a good bottle a dry white wine.

 jmf

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings  are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of \u1f435 is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.


--


I shew enough examples. As soon as you are using non latin-1 chars
your optimization just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a correct version with
a narrower range of chars, especially if this range represents
the daily used chars.

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally 


If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Benjamin Kaplan

On Thu, Mar 28, 2013 at 2:11 PM, jmfauth wxjmfa...@gmail.com wrote:
 On 28 mar, 21:29, Benjamin Kaplan benjamin.kap...@case.edu wrote:
 On Thu, Mar 28, 2013 at 10:48 AM, jmfauth wxjmfa...@gmail.com wrote:
  On 28 mar, 17:33, Ian Kelly ian.g.ke...@gmail.com wrote:
  On Thu, Mar 28, 2013 at 7:34 AM, jmfauth wxjmfa...@gmail.com wrote:
   The flexible string representation takes the problem from the
   other side, it attempts to work with the characters by using
   their representations and it (can only) fails...

  This is false.  As I've pointed out to you before, the FSR does not
  divide characters up by representation.  It divides them up by
  codepoint -- more specifically, by the *bit-width* of the codepoint.
  We call the internal format of the string ASCII or Latin-1 or
  UCS-2 for conciseness and a point of reference, but fundamentally
  all of the FSR formats are simply byte arrays of *codepoints* -- you
  know, those things you keep harping on.  The major optimization
  performed by the FSR is to consistently truncate the leading zero
  bytes from each codepoint when it is possible to do so safely.  But
  regardless of to what extent this truncation is applied, the string is
  *always* internally just an array of codepoints, and the same
  algorithms apply for all representations.

  -

  You know, we can discuss this ad nauseam. What is important
  is Unicode.

  You have transformed Python back in an ascii oriented product.

  If Python had imlemented Unicode correctly, there would
  be no difference in using an a, é, € or any character,
  what the narrow builds did.

  If I am practically the only one, who speakes /discusses about
  this, I can ensure you, this has been noticed.

  Now, it's time to prepare the Asparagus, the jambon cru
  and a good bottle a dry white wine.

  jmf

 You still have yet to explain how Python's string representation is
 wrong. Just how it isn't optimal for one specific case. Here's how I
 understand it:

 1) Strings are sequences of stuff. Generally, we talk about strings as
 either sequences of bytes or sequences of characters.

 2) Unicode is a format used to represent characters. Therefore,
 Unicode strings are character strings, not byte strings.

 2) Encodings  are functions that map characters to bytes. They
 typically also define an inverse function that converts from bytes
 back to characters.

 3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
 mentioned in the previous point. It happens to be one of the five
 standard encodings that is defined for all characters in the Unicode
 standard (the others being the little and big endian variants of
 UTF-16 and UTF-32).

 4) The internal representation of a character string DOES NOT MATTER.
 All that matters is that the API represents it as a string of
 characters, regardless of the representation. We could implement
 character strings by putting the Unicode code-points in binary-coded
 decimal and it would be a Unicode character string.

 5) The String type that .NET and Java (and unicode type in Python
 narrow builds) use is not a character string. It is a string of
 shorts, each of which corresponds to a UTF-16 code point. I know this
 is the case because in all of these, the length of \u1f435 is 2 even
 though it only consists of one character.

 6) The new string representation in Python 3.3 can successfully
 represent all characters in the Unicode standard. The actual number of
 bytes that each character consumes is invisible to the user.

 --


 I shew enough examples. As soon as you are using non latin-1 chars
 your optimization just became irrelevant and not only this, you
 are penalized.

 I'm sorry, saying Python now is just covering the whole unicode
 range is not a valuable excuse. I prefer a correct version with
 a narrower range of chars, especially if this range represents
 the daily used chars.

 I can go a step further, if I wish to write an application for
 Western European users, I'm better served if I'm using a coding
 scheme covering all thesee languages/scripts. What about cp1252 [*]?
 Does this not remind somthing?

 Python can do better, it only succeeds to do worth!

 [*] yes, I kwnow, internally 

 jmf

By that logic, we should all be using ASCII because it's correct for
the 127 characters that I (as an English speaker) use, and therefore
it's all that we should care about. I don't care if é counts as two
characters, it's faster and more memory efficient for all of my
strings to just count bytes. There are certain domains where
characters outside the basic multilingual plane are used. Python's job
is to be correct in all of those circumstances, not just the ones you
care about.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread 88888 Dihedral

Chris Angelico於 2013年3月28日星期四UTC+8上午11時40分17秒寫道：
 On Thu, Mar 28, 2013 at 2:18 PM, Ethan Furman et...@stoneleaf.us wrote:
 
  Has anybody else thought that [jmf's] last few responses are starting to 
  sound
 
  bot'ish?
 
 
 
 Yes, I did wonder. It's like he and Dihedral have been trading
 
 accounts sometimes. Hey, Dihedral, I hear there's a discussion of
 
 Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords
 
 for you to trigger on and Python and bots are funny and this text is
 
 almost grammatical!
 
 
 
 There. Let's see if he takes the bait.
 
 
 
 ChrisA

Well, we need some cheap ram to hold 4 bytes per character 

in a text segment to be observed. 

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas 
of different languages.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Terry Reedy


On 3/28/2013 4:26 PM, jmfauth wrote:

Please provide references for your assertions. I have read the unicode 
standard, parts more than once, and your assertions contradict my memory.



Unicode does not stipulate, one has to cover the whole range.


I believe it does. As I remember, the recognized encodings all encode 
the entire unicode codepoint range



Unicode expects that every character in a range behaves the same
way.


I have no idea what you mean by 'same way'. Each codepoint is supposed 
to behave differently in some way. That is the reason for having 
multiple codepoints. One causes an 'a' to appear, another a 'b'. Indeed, 
the standard define multiple categories of codepoints and chars in 
different categories are supposed to act differently (or be treated 
differently). Glyphic chars versus control chars are one example.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 10:53 AM, Dennis Lee Bieber
wlfr...@ix.netcom.com wrote:
 On Wed, 27 Mar 2013 23:12:21 -0700, Ethan Furman et...@stoneleaf.us
 declaimed the following in gmane.comp.python.general:


 At some point we have to stop being gentle / polite / politically correct 
 and call a shovel a shovel... er, spade.

 Call it an Instrument For the Transplantation of Dirt

 (Is an antique Steam Shovel ever a Steam Spade?)

I don't know, but I'm pretty sure there's a private detective who
wouldn't appreciate being called Sam Shovel.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

2013-03-28 Thread Mark Lawrence


On 28/03/2013 23:53, Dennis Lee Bieber wrote:

On Wed, 27 Mar 2013 23:12:21 -0700, Ethan Furman et...@stoneleaf.us
declaimed the following in gmane.comp.python.general:



At some point we have to stop being gentle / polite / politically correct and 
call a shovel a shovel... er, spade.


Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)



Surely you can spade a lot more things than dirt?

--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

On Thu, 28 Mar 2013 10:11:59 -0600, Ian Kelly wrote:

 On Thu, Mar 28, 2013 at 8:38 AM, Chris Angelico ros...@gmail.com
 wrote:
 PEP393 strings have two optimizations, or kinda three:

 1a) ASCII-only strings
 1b) Latin1-only strings
 2) BMP-only strings
 3) Everything else

 Options 1a and 1b are almost identical - I'm not sure what the detail
 is, but there's something flagging those strings that fit inside seven
 bits. (Something to do with optimizing encodings later?) Both are
 optimized down to a single byte per character.
 
 The only difference for ASCII-only strings is that they are kept in a
 struct with a smaller header.  The smaller header omits the utf8 pointer
 (which optionally points to an additional UTF-8 representation of the
 string) and its associated length variable.  These are not needed for
 ASCII-only strings because an ASCII string can be directly interpreted
 as a UTF-8 string for the same result.  The smaller header also omits
 the wstr_length field which, according to the PEP, differs from
 length only if there are surrogate pairs in the representation.  For an
 ASCII string, of course there would not be any surrogate pairs.


I wonder why they need care about surrogate pairs? 

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only 
strings. It's only strings in the SMPs that could need surrogate pairs, 
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

I also wonder why the implementation bothers keeping a UTF-8 
representation. That sounds like premature optimization to me. Surely you 
only need it when writing to a file with UTF-8 encoding? For most 
strings, that will never happen.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Thu, 28 Mar 2013 12:54:20 -0700, rurpy wrote:

 Even if you personally would prefer someone to respond by calling you a
 liar, your personal preferences do not form a basis for desirable
 posting behavior here.

Whereas yours apparently are.

Thanks for the feedback, I'll take it under advisement.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
 strings. It's only strings in the SMPs that could need surrogate pairs,
 and they don't need them in Python's implementation since it's a full 32-
 bit implementation. So where do the surrogate pairs come into this?

PEP 393 says:

wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.


If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...

 I also wonder why the implementation bothers keeping a UTF-8
 representation. That sounds like premature optimization to me. Surely you
 only need it when writing to a file with UTF-8 encoding? For most
 strings, that will never happen.

... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

Minor nitpick, btw:
 (in which cast wstr_length differs form length)
Should be in which case and from. Who has the power to correct
typos in PEPs?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

2013-03-28 Thread Mark Lawrence


On 29/03/2013 00:54, Chris Angelico wrote:


Minor nitpick, btw:

(in which cast wstr_length differs form length)

Should be in which case and from. Who has the power to correct
typos in PEPs?

ChrisA



Sneak it in here? http://bugs.python.org/issue13604

--
If you're using GoogleCrap™ please read this 
http://wiki.python.org/moin/GoogleGroupsPython.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

On Fri, Mar 29, 2013 at 12:03 PM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 On 29/03/2013 00:54, Chris Angelico wrote:
 Minor nitpick, btw:

 (in which cast wstr_length differs form length)

 Should be in which case and from. Who has the power to correct
 typos in PEPs?

 Sneak it in here? http://bugs.python.org/issue13604

Ah! Turns out it's already been fixed; a reword of that section, as
shown in the attached files, no longer has the parenthesis, and thus
its typos.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

2013-03-28 Thread MRAB


On 29/03/2013 00:54, Chris Angelico wrote:

On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?


PEP 393 says:

wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.


If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...


I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.


... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?


You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.


Minor nitpick, btw:

(in which cast wstr_length differs form length)

Should be in which case and from. Who has the power to correct
typos in PEPs?



--
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

On Fri, 29 Mar 2013 11:54:41 +1100, Chris Angelico wrote:

 On Fri, Mar 29, 2013 at 11:39 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
 strings. It's only strings in the SMPs that could need surrogate pairs,
 and they don't need them in Python's implementation since it's a full
 32- bit implementation. So where do the surrogate pairs come into this?
 
 PEP 393 says:
 
 wstr_length, wstr: representation in platform's wchar_t
 (null-terminated). If wchar_t is 16-bit, this form may use surrogate
 pairs (in which cast wstr_length differs form length). wstr_length
 differs from length only if there are surrogate pairs in the
 representation.
 
 utf8_length, utf8: UTF-8 representation (null-terminated).
 
 data: shortest-form representation of the unicode string. The string is
 null-terminated (in its respective representation).
 
 All three representations are optional, although the data form is
 considered the canonical representation which can be absent only while
 the string is being created. If the representation is absent, the
 pointer is NULL, and the corresponding length field may contain
 arbitrary data.
 

All the words are in English (well, most of them...) but what does it 
mean?

 If the string was created from a wchar_t string, that string will be
 retained, and presumably can be used to re-output the original for a
 clean and fast round-trip.

Under what circumstances will a string be created from a wchar_t string? 
How, and why, would such a string be created? Why would Python still 
support strings containing surrogates when it now has a nice, shiny, 
surrogate-free flexible representation?



 I also wonder why the implementation bothers keeping a UTF-8
 representation. That sounds like premature optimization to me. Surely
 you only need it when writing to a file with UTF-8 encoding? For most
 strings, that will never happen.
 
 ... the UTF-8 version. It'll keep it if it has it, and not else. A lot
 of content will go out in the same encoding it came in in, so it makes
 sense to hang onto it where possible.

Not to me. That almost doubles the size of the string, on the off-chance 
that you'll need the UTF-8 encoding. Which for many uses, you don't, and 
even if you do, it seems like premature optimization to keep it around 
just in case. Encoding to UTF-8 will be fast for small N, and for large 
N, why carry around (potentially) multiple megabytes of duplicated data 
just in case the encoded version is needed some time?


 Though, from the same quote: The UTF-8 representation is
 null-terminated. Does this mean that it can't be used if there might be
 a \0 in the string?
 
 Minor nitpick, btw:
 (in which cast wstr_length differs form length)
 Should be in which case and from. Who has the power to correct typos
 in PEPs?
 
 ChrisA

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Surrogate pairs in new flexible string representation [was Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

On Fri, Mar 29, 2013 at 1:37 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Under what circumstances will a string be created from a wchar_t string?
 How, and why, would such a string be created? Why would Python still
 support strings containing surrogates when it now has a nice, shiny,
 surrogate-free flexible representation?

Strings are created from some form of content. If not from another
Python string, then - most likely - it's from a stream of bytes. If
from a C API that returns wchar_t, then it'd make sense to have that
form around.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]


Steven D'Aprano:


Some string operations need to inspect every character, e.g. str.upper().
Even for them, the increased complexity of a variable-width encoding
costs. It's not sufficient to walk the string inspecting a fixed 1, 2 or
4 bytes per character. You have to walk the string grabbing 1 byte at a
time, and then decide whether you need another 1, 2 or 3 bytes. Even
though it's still O(N), the added bit-masking and overhead of variable-
width encoding adds to the overall cost.


   It does add to implementation complexity but should only add a small 
amount of time.


   To compare costs, I am using the text of the web site 
http://www.mofa.go.jp/mofaj/ since it has a reasonable amount (10%) of 
multi-byte characters. Since the document fits in the the BMP, Python 
would choose a 2-byte wide implementation so I am emulating that choice 
with a very simple 16-bit table-based upper-caser. Real Unicode case 
conversion code is more concerned with edge cases like Turkic and 
Lithuanian locales and Greek combining characters and also allowing for 
measurement/reallocation for the cases where the result is 
smaller/larger. See, for example, glib's real_toupper in 
https://git.gnome.org/browse/glib/tree/glib/guniprop.c


   Here is some simplified example code that implements upper-casing 
over 16-bit wide (utf16_up) and UTF-8 (utf8_up) buffers:

http://www.scintilla.org/UTF8Up.cxx
   Since I didn't want to spend too much time writing code it only 
handles the BMP and doesn't have upper-case table entries outside ASCII 
for now. If this was going to be worked on further to be made 
maintainable, most of the masking and so forth would be in macros 
similar to UTF8_COMPUTE/UTF8_GET in glib.
   The UTF-8 case ranges from around 5% slower on average in a 32 bit 
release build (VC2012 on an i7 870) to averaging a little faster in a 
64-bit build. They're both around a billion characters per-second.


C:\u\hg\UpUTF\UpUTF..\x64\Release\UpUTF.exe
Time taken for UTF8 of 80449=0.006528
Time taken for UTF16 of 71525=0.006610
Relative time taken UTF8/UTF16 0.987581


Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the start offset is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.


   It doesn't horrify me - I've been working this way for over 10 years 
and it seems completely natural. You can wrap access in iterators that 
hide the byte offsets if you like. This then ensures that all operations 
on those iterators are safe only allowing the iterator to point at the 
start/end of valid characters.



Sure. And over a different set of samples, it is less compact. If you
write a lot of Latin-1, Python will use one byte per character, while
UTF-8 will use two bytes per character.


   I think you mean writing a lot of Latin-1 characters outside ASCII. 
However, even people writing texts in, say, French will find that only a 
small proportion of their text is outside ASCII and so the cost of UTF-8 
is correspondingly small.


   The counter-problem is that a French document that needs to include 
one mathematical symbol (or emoji) outside Latin-1 will double in size 
as a Python string.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]


MRAB:


Implementing the regex module (http://pypi.python.org/pypi/regex) would
have been more difficult if the internal representation had been UTF-8,
because of the need to decode, and the implementation would also have
been slower for that reason.


   One way to build regex support for UTF-8 is to build a fixed width 
version of the regex code and then interpose an object that converts 
between the UTF-8 representation and that code.


   The C++11 standard library contains a regex template that can be 
instantiated over a UTF-8 representation in this way.


   Neil

--
http://mail.python.org/mailman/listinfo/python-list

unicode and the FSR [was: Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]]

2013-03-28 Thread Ethan Furman


On 03/28/2013 08:34 PM, Neil Hodgson wrote:

Steven D'Aprano:


Any string method that takes a starting offset requires the method to
walk the string byte-by-byte. I've even seen languages put responsibility
for dealing with that onto the programmer: the start offset is given in
*bytes*, not characters. I don't remember what language this was... it
might have been Haskell? Whatever it was, it horrified me.


It doesn't horrify me - I've been working this way for over 10 years and it 
seems completely natural.


Horrifying or not, I am willing to give up a small amount of speed for correctness.  Heck, I'm willing to give up a lot 
of speed for correctness.  Once I have my slow but correct prototype going I can recode in a faster language (if needed) 
and compare it's blazingly fast output with my slowly-generated but known-good output.



 You can wrap
access in iterators that hide the byte offsets if you like. This then ensures 
that all operations on those iterators are
safe only allowing the iterator to point at the start/end of valid characters.


Sure.  Or I can let Python handle it for me.



The counter-problem is that a French document that needs to include one 
mathematical symbol (or emoji) outside
Latin-1 will double in size as a Python string.


True.  But how often do you have the entire document as a single string?  Use readlines() instead of read().  Besides, 
memory is cheap.


--
~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]

On Fri, Mar 29, 2013 at 2:34 PM, Neil Hodgson nhodg...@iinet.net.au wrote:
It doesn't horrify me - I've been working this way for over 10 years and
 it seems completely natural. You can wrap access in iterators that hide the
 byte offsets if you like. This then ensures that all operations on those
 iterators are safe only allowing the iterator to point at the start/end of
 valid characters.

But both this and your example of case conversion are, fundamentally,
iterating over the string. What if you aren't doing that? What if you
want to parse and process?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flaming vs accuracy [was Re: Performance of int/long in Python 3]