Re: RE Module Performance

2013-07-31 Thread Chris Angelico
On Wed, Jul 31, 2013 at 6:45 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 if you care about minimizing every possible byte, you should
 use a low-level language like C. Then you can give every character 21
 bits, and be happy that you don't waste even one bit.

Could go better! Since not every character has been assigned, and some
are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut
them out of your representation system and save memory!

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Antoon Pardon
Op 31-07-13 05:30, Michael Torrie schreef:
 On 07/30/2013 12:19 PM, Antoon Pardon wrote:
 So? Why are you making this a point of discussion? I was not aware that
 the pro and cons of various editor buffer implemantations was relevant
 to the point I was trying to make.
 
 I for one found it very interesting.  In fact this thread caused me to
 wonder how one actually does create an efficient editor.  Off the
 original topic true, but still very interesting.
 

Yes, it can be interesting. But I really think if that is what you want
to discuss, it deserves its own subject thread.

-- 
Antoon Pardon

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Antoon Pardon
Op 30-07-13 21:09, wxjmfa...@gmail.com schreef:
 Matable, immutable, copyint + xxx, bufferint, O(n) 
 Yes, but conceptualy the reencoding happen sometime, somewhere.

Which is a far cry from your previous claim that it happened
every time you enter a char.

This of course make your case harder to argue. Because the
impact of something that happens sometime, somewhere is
vastly less than something that happens everytime you enter
a char.

 The internal ucs-2 will never automagically be transformed
 into ucs-4 (eg).

It will just start producing wrong results when someone starts
using characters that don't fit into ucs-2.


 timeit.timeit('a'*1 +'€')
 7.087220684719967
 timeit.timeit('a'*1 +'z')
 1.5685214234430873
 timeit.timeit(z = 'a'*1; z = z +'€')
 7.169538866162213
 timeit.timeit(z = 'a'*1; z = z +'z')
 1.5815893830557286
 timeit.timeit(z = 'a'*1; z += 'z')
 1.606955741596181
 timeit.timeit(z = 'a'*1; z += '€')
 7.160483334521416
 
 
 And do not forget, in a pure utf coding scheme, your
 char or a char will *never* be larger than 4 bytes.
 
 sys.getsizeof('a')
 26
 sys.getsizeof('\U000101000')
 48

Nonsense.

 sys.getsizeof('a'.encode('utf-8'))
18




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread wxjmfauth
FSR:
===

The 'a' in 'a€' and 'a\U0001d11e:

 ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')]
['0b', '0b0111', '0b0010', '0b10101100']
 ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b', '0b', '0b', '0b0111',
'0b', '0b0001', '0b11010001', '0b0000']

Has to be done.

sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27


Unicode/utf*


i) (primary key) Create and use a unique set of encoded
code points.
ii) (secondary key) Depending of the wish,
memory/performance: utf-8/16/32

Two advantages at the light of the above example:
iii) The a has never to be reencoded.
iv) An a size never exceeds 4 bytes.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Antoon Pardon
Op 31-07-13 10:32, wxjmfa...@gmail.com schreef:
 Unicode/utf*
 

 i) (primary key) Create and use a unique set of encoded
 code points.

FSR does this.

 st1 = 'a€'
 st2 = 'aa'
 ord(st1[0])
97
 ord(st2[0])
97


 ii) (secondary key) Depending of the wish,
 memory/performance: utf-8/16/32

Whose wish? I don't know any language that allows the
programmer choose the internal representation of its
strings. If it is the designers choice FSR does this,
if it is the programmers choice, I don't see why
this is necessary for compliance.

 Two advantages at the light of the above example:
 iii) The a has never to be reencoded.

FSR: check. Using a container with wider slots is not a reëncoding.
If such widening is encoding then your 'choice' between utf-8/16/32
implies that it will also have to reencode when it changes from
utf-8 to utf-16 or utf-32.

 iv) An a size never exceeds 4 bytes.

FSR: check.

 Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
 Is is possible? ;-) The solution is in the problem.

Mayby you should use bytes or bytearrays if that is really what you want.

-- 
Antoon Pardon
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Michael Torrie
On 07/31/2013 01:23 AM, Antoon Pardon wrote:
 Op 31-07-13 05:30, Michael Torrie schreef:
 On 07/30/2013 12:19 PM, Antoon Pardon wrote:
 So? Why are you making this a point of discussion? I was not aware that
 the pro and cons of various editor buffer implemantations was relevant
 to the point I was trying to make.

 I for one found it very interesting.  In fact this thread caused me to
 wonder how one actually does create an efficient editor.  Off the
 original topic true, but still very interesting.

 
 Yes, it can be interesting. But I really think if that is what you want
 to discuss, it deserves its own subject thread.

Subject lines can and should be changed to reflect the ebbs and flows of
the discussion.

In fact this thread's subject should have been changed a long time ago
since the original topic was RE module performance!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Michael Torrie
On 07/31/2013 02:32 AM, wxjmfa...@gmail.com wrote:
 Unicode/utf*

Why do you keep using the terms utf and Unicode interchangeably?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread wxjmfauth
Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit :
 On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:
 
 
 
  And do not forget, in a pure utf coding scheme, your char or a char will
 
  *never* be larger than 4 bytes.
 
  
 
  sys.getsizeof('a')
 
  26
 
  sys.getsizeof('\U000101000')
 
  48
 
 
 
 Neither character above is larger than 4 bytes. You forgot to deduct the 
 
 size of the object header. Python is a high-level object-oriented 
 
 language, if you care about minimizing every possible byte, you should 
 
 use a low-level language like C. Then you can give every character 21 
 
 bits, and be happy that you don't waste even one bit.
 
 
 
 
 
 -- 
 
 Steven

... char never consumes or requires more than 4 bytes ...

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-31 Thread Chris Angelico
On Wed, Jul 31, 2013 at 9:15 PM,  wxjmfa...@gmail.com wrote:
 ... char never consumes or requires more than 4 bytes ...


The integer 5 should be able to be stored in 3 bits.

 sys.getsizeof(5)
14

Clearly Python is doing something really horribly wrong here. In fact,
sys.getsizeof needs to be changed to return a float, to allow it to
more properly reflect these important facts.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread wxjmfauth
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
 On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 
  Back to utf. utfs are not only elements of a unique set of encoded
 
  code points. They have an interesting feature. Each utf chunk
 
  holds intrisically the character (in fact the code point) it is
 
  supposed to represent. In utf-32, the obvious case, it is just
 
  the code point. In utf-8, that's the first chunk which helps and
 
  utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 
  implementation using bytes, for any pointer position it is always
 
  possible to find the corresponding encoded code point and from this
 
  the corresponding character without any programmed information. See
 
  my editor example, how to find the char under the caret? In fact,
 
  a silly example, how can the caret can be positioned or moved, if
 
  the underlying corresponding encoded code point can not be
 
  dicerned!
 
 
 
 Yes, given a pointer location into a utf-8 or utf-16 string, it is
 
 easy to determine the identity of the code point at that location.
 
 But this is not often a useful operation, save for resynchronization
 
 in the case that the string data is corrupted.  The caret of an editor
 
 does not conceptually correspond to a pointer location, but to a
 
 character index.  Given a particular character index (e.g. 127504), an
 
 editor must be able to determine the identity and/or the memory
 
 location of the character at that index, and for UTF-8 and UTF-16
 
 without an auxiliary data structure that is a O(n) operation.
 
 
--

Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.

-

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Somebody wrote FSR is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).

There is a trick to make the editor to know if it has
to be optimized. Just put some flag somewhere. Then
you fall on the Houston syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.

 sys.getsizeof('€')
40
 sys.getsizeof('a')
26

Now the good news. In an editor à la FSR, the
composition is not so important. You know,
practicality beats purity. The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Antoon Pardon
Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:
 
 I am pretty sure that once you have typed your 127504
 ascii characters, you are very happy the buffer of your
 editor does not waste time in reencoding the buffer as
 soon as you enter an €, the 125505th char. Sorry, I wanted
 to say z instead of euro, just to show that backspacing the
 last char and reentering a new char implies twice a reencoding.

Using a single string as an editor buffer is a bad idea in python
for the simple reason that strings are immutable. So adding
characters would mean continuously copying the string buffer
into a new string with the next character added. Copying
127504 characters into a new string will not make that much
of a difference whether the octets are just copied to octets
or are unpacked into 32 bit words.

 Somebody wrote FSR is just an optimization. Yes, but in case
 of an editor à la FSR, this optimization take place everytime you
 enter a char. Your poor editor, in fact the FSR, is finally
 spending its time in optimizing and finally it optimizes nothing.
 (It is even worse).

Even if you would do it this way, it would *not* take place
every time you enter a char. Once your buffer would contain
a wide character, it would just need to convert the single
character that is added after each keystroke. It would not
need to convert the whole buffer after each key stroke.

 If you type correctly a z instead of an €, it is not necessary
 to reencode the buffer. Problem, you do you know that you do
 not have to reencode? simple just check it, and by just checking
 it wastes time to test it you have to optimized or not and hurt
 a little bit more what is supposed to be an optimization.

Your scenario is totally unrealistic. First of all because of
the immutable nature of python strings, second because you
suggest that real time usage would result in frequent conversions
which is highly unlikely.

-- 
Antoon Pardon
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Chris Angelico
On Tue, Jul 30, 2013 at 3:01 PM,  wxjmfa...@gmail.com wrote:
 I am pretty sure that once you have typed your 127504
 ascii characters, you are very happy the buffer of your
 editor does not waste time in reencoding the buffer as
 soon as you enter an €, the 125505th char. Sorry, I wanted
 to say z instead of euro, just to show that backspacing the
 last char and reentering a new char implies twice a reencoding.

You're still thinking that the editor's buffer is a Python string. As
I've shown earlier, this is a really bad idea, and that has nothing to
do with FSR/PEP 393. An immutable string is *horribly* inefficient at
this; if you want to keep concatenating onto a string, the recommended
method is a list of strings that gets join()d at the end, and the same
technique works well here. Here's a little demo class that could make
the basis for such a system:

class EditorBuffer:
def __init__(self,fn):
self.fn=fn
self.buffer=[open(fn).read()]
def insert(self,pos,char):
if pos==0:
# Special case: insertion at beginning of buffer
if len(self.buffer[0])1024: self.buffer.insert(0,char)
else: self.buffer[0]=char+self.buffer[0]
return
for idx,part in enumerate(self.buffer):
l=len(part)
if posl:
pos-=l
continue
if posl:
# Cursor is somewhere inside this string
splitme=self.buffer[idx]

self.buffer[idx:idx+1]=splitme[:pos],splitme[pos:]
l=pos
# Cursor is now at the end of this string
if l1024: self.buffer[idx:idx+1]=self.buffer[idx],char
else: self.buffer[idx]+=char
return
raise ValueError(Cannot insert past end of buffer)
def __str__(self):
return ''.join(self.buffer)
def save(self):
open(fn,w).write(str(self))

It guarantees that inserts will never need to resize more than 1KB of
text. As a real basis for an editor, it still sucks, but it's purely
to prove this one point.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread MRAB

On 30/07/2013 15:38, Antoon Pardon wrote:

Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:


I am pretty sure that once you have typed your 127504 ascii
characters, you are very happy the buffer of your editor does not
waste time in reencoding the buffer as soon as you enter an €, the
125505th char. Sorry, I wanted to say z instead of euro, just to
show that backspacing the last char and reentering a new char
implies twice a reencoding.


Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.


Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.


So adding characters would mean continuously copying the string
buffer into a new string with the next character added. Copying
127504 characters into a new string will not make that much of a
difference whether the octets are just copied to octets or are
unpacked into 32 bit words.


Somebody wrote FSR is just an optimization. Yes, but in case of
an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).


Even if you would do it this way, it would *not* take place every
time you enter a char. Once your buffer would contain a wide
character, it would just need to convert the single character that is
added after each keystroke. It would not need to convert the whole
buffer after each key stroke.


If you type correctly a z instead of an €, it is not necessary to
reencode the buffer. Problem, you do you know that you do not have
to reencode? simple just check it, and by just checking it wastes
time to test it you have to optimized or not and hurt a little bit
more what is supposed to be an optimization.


Your scenario is totally unrealistic. First of all because of the
immutable nature of python strings, second because you suggest that
real time usage would result in frequent conversions which is highly
unlikely.


What you would have is a list of mutable chunks.

Inserting into a chunk would be fast, and a chunk would be split if
it's already full. Also, small adjacent chunks would be joined together.

Finally, a chunk could use FSR to reduce memory usage.
--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Antoon Pardon

Op 30-07-13 18:13, MRAB schreef:

On 30/07/2013 15:38, Antoon Pardon wrote:

Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:


I am pretty sure that once you have typed your 127504 ascii
characters, you are very happy the buffer of your editor does not
waste time in reencoding the buffer as soon as you enter an €, the
125505th char. Sorry, I wanted to say z instead of euro, just to
show that backspacing the last char and reentering a new char
implies twice a reencoding.


Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.


Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.


Not if you use a gap buffer.

--
Antoon Pardon.

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread MRAB

On 30/07/2013 17:39, Antoon Pardon wrote:

Op 30-07-13 18:13, MRAB schreef:

On 30/07/2013 15:38, Antoon Pardon wrote:

Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:


I am pretty sure that once you have typed your 127504 ascii
characters, you are very happy the buffer of your editor does not
waste time in reencoding the buffer as soon as you enter an €, the
125505th char. Sorry, I wanted to say z instead of euro, just to
show that backspacing the last char and reentering a new char
implies twice a reencoding.


Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.


Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.


Not if you use a gap buffer.


The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Tim Delaney
On 31 July 2013 00:01, wxjmfa...@gmail.com wrote:


 I am pretty sure that once you have typed your 127504
 ascii characters, you are very happy the buffer of your
 editor does not waste time in reencoding the buffer as
 soon as you enter an €, the 125505th char. Sorry, I wanted
 to say z instead of euro, just to show that backspacing the
 last char and reentering a new char implies twice a reencoding.


And here we come to the root of your complete misunderstanding and
mischaracterisation of the FSR. You don't appear to understand that
strings in Python are immutable and that to add a character to an
existing string requires copying the entire string + new character. In
your hypothetical situation above, you have already performed 127504
copy + new character operations before you ever get to a single widening
operation. The overhead of the copy + new character repeated 127504
times dwarfs the overhead of a single widening operation.

Given your misunderstanding, it's no surprise that you are focused on
microbenchmarks that demonstrate that copying entire strings and adding
a character can be slower in some situations than others. When the only
use case you have is implementing the buffer of an editor using an
immutable string I can fully understand why you would be concerned about
the performance of adding and removing individual characters. However,
in that case *you're focused on the wrong problem*.

Until you can demonstrate an understanding that doing the above in any
language which has immutable strings is completely insane you will have
no credibility and the only interest anyone will pay to your posts is
refuting your FUD so that people new to the language are not driven off
by you.

Tim Delaney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Joshua Landau
On 30 July 2013 17:39, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:

 Op 30-07-13 18:13, MRAB schreef:

  On 30/07/2013 15:38, Antoon Pardon wrote:

 Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:


 I am pretty sure that once you have typed your 127504 ascii
 characters, you are very happy the buffer of your editor does not
 waste time in reencoding the buffer as soon as you enter an €, the
 125505th char. Sorry, I wanted to say z instead of euro, just to
 show that backspacing the last char and reentering a new char
 implies twice a reencoding.


 Using a single string as an editor buffer is a bad idea in python for
 the simple reason that strings are immutable.


 Using a single string as an editor buffer is a bad idea in _any_
 language because an insertion would require all the following
 characters to be moved.


 Not if you use a gap buffer.


Additionally, who says a language couldn't use, say, B-Trees for all of its
list-like types, including strings?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Antoon Pardon

Op 30-07-13 19:14, MRAB schreef:

On 30/07/2013 17:39, Antoon Pardon wrote:

Op 30-07-13 18:13, MRAB schreef:

On 30/07/2013 15:38, Antoon Pardon wrote:

Op 30-07-13 16:01, wxjmfa...@gmail.com schreef:


I am pretty sure that once you have typed your 127504 ascii
characters, you are very happy the buffer of your editor does not
waste time in reencoding the buffer as soon as you enter an €, the
125505th char. Sorry, I wanted to say z instead of euro, just to
show that backspacing the last char and reentering a new char
implies twice a reencoding.


Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.


Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.


Not if you use a gap buffer.


The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.


So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

If you prefer an other data structure in the editor you are working on,
I will not dissuade you.

--
Antoon Pardon
--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread wxjmfauth
Matable, immutable, copyint + xxx, bufferint, O(n) 
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal ucs-2 will never automagically be transformed
into ucs-4 (eg).

 timeit.timeit('a'*1 +'€')
7.087220684719967
 timeit.timeit('a'*1 +'z')
1.5685214234430873
 timeit.timeit(z = 'a'*1; z = z +'€')
7.169538866162213
 timeit.timeit(z = 'a'*1; z = z +'z')
1.5815893830557286
 timeit.timeit(z = 'a'*1; z += 'z')
1.606955741596181
 timeit.timeit(z = 'a'*1; z += '€')
7.160483334521416


And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.

 sys.getsizeof('a')
26
 sys.getsizeof('\U000101000')
48


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Chris Angelico
On Tue, Jul 30, 2013 at 8:09 PM,  wxjmfa...@gmail.com wrote:
 Matable, immutable, copyint + xxx, bufferint, O(n) 
 Yes, but conceptualy the reencoding happen sometime, somewhere.
 The internal ucs-2 will never automagically be transformed
 into ucs-4 (eg).

But probably not on the entire document. With even a brainless scheme
like I posted code for, no more than 1024 bytes will need to be
recoded at a time (except in some odd edge cases, and even then, no
more than once for any given file).

 And do not forget, in a pure utf coding scheme, your
 char or a char will *never* be larger than 4 bytes.

 sys.getsizeof('a')
 26
 sys.getsizeof('\U000101000')
 48

Yeah, you have a few odd issues like, oh, I dunno, GC overhead,
reference count, object class, and string length, all stored somewhere
there. Honestly jmf, if you want raw assembly you know where to get
it.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Terry Reedy

On 7/30/2013 1:40 PM, Joshua Landau wrote:


Additionally, who says a language couldn't use, say, B-Trees for all of
its list-like types, including strings?


Tk apparently uses a B-tree in its text widget.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Neil Hodgson

MRAB:


The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.


   The normal technique is to only move the gap when text is added or 
removed, not when the cursor moves. Code that reads the contents, such 
as for display, handles the gap by checking the requested position and 
using a different offset when the position is after the gap.


   Gap buffers work well because changes are generally close to the 
previous change, so require moving only a relatively small amount of 
text. Even an occasional move of the whole contents won't cause too much 
trouble for interactivity with current processors moving multiple 
megabytes per millisecond.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Michael Torrie
On 07/30/2013 12:19 PM, Antoon Pardon wrote:
 So? Why are you making this a point of discussion? I was not aware that
 the pro and cons of various editor buffer implemantations was relevant
 to the point I was trying to make.

I for one found it very interesting.  In fact this thread caused me to
wonder how one actually does create an efficient editor.  Off the
original topic true, but still very interesting.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Michael Torrie
On 07/30/2013 01:09 PM, wxjmfa...@gmail.com wrote:
 Matable, immutable, copyint + xxx, bufferint, O(n) 
 Yes, but conceptualy the reencoding happen sometime, somewhere.
 The internal ucs-2 will never automagically be transformed
 into ucs-4 (eg).

So what major python project are you working on where you've found FSR
in general to be a problem?  Maybe we can help you work out a more
appropriate data structure and algorithm to use.

But if you're not developing something, and not developing in Python,
perhaps you should withdraw and let us use our horrible FSR in peace,
because it doesn't seem to bother the vast majority of python
programmers, and does not bother some large python projects out there.
In fact I think most of us welcome integrated, correct, full unicode.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-30 Thread Steven D'Aprano
On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:

 And do not forget, in a pure utf coding scheme, your char or a char will
 *never* be larger than 4 bytes.
 
 sys.getsizeof('a')
 26
 sys.getsizeof('\U000101000')
 48

Neither character above is larger than 4 bytes. You forgot to deduct the 
size of the object header. Python is a high-level object-oriented 
language, if you care about minimizing every possible byte, you should 
use a low-level language like C. Then you can give every character 21 
bits, and be happy that you don't waste even one bit.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-29 Thread Antoon Pardon

Op 26-07-13 15:21, wxjmfa...@gmail.com schreef:


Hint: To understand Unicode (and every coding scheme), you should
understand utf. The how and the *why*.


No you don't. You are mixing the information with how the information
is coded. utf is like base64, a way of coding the information that is
usefull for storage or transfer. But once you have decode the byte 
stream, you no longer need any understanding of base64 to process your

information. Likewise, once you have decode the bytestream into uniocde
information you don't need knowledge of utf to process unicode strings.

--
Antoon Pardon



--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-29 Thread Antoon Pardon

Op 28-07-13 20:19, Joshua Landau schreef:

On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be
mailto:antoon.par...@rece.vub.ac.be wrote:

Op 27-07-13 20:21, wxjmfa...@gmail.com mailto:wxjmfa...@gmail.com
schreef:

utf-8 or any (utf) never need and never spend their time
in reencoding.


So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.


Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes
any noticeable slowdown in real-world circumstances? That's not remotely
true.


No that is not what I am saying. But if jmf would be complaining about
garbage collection in an analog way as he is complaining about the FSR,
he wouldn't be complaining about real-world circumstances but about
theorectical possibilities and micro bench marks. In those circunstances
the garbage collection problem wouldn't be worthy of attention much.

--
Antoon Pardon
--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-29 Thread Antoon Pardon

Op 28-07-13 21:30, wxjmfa...@gmail.com schreef:


To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these micro-benchmarks.
all the memory cases. It a good start, no?



There is nothing to solve. Unicode doesn't force implementations
to use the same size of memory for strings of the same length.

So you pointing out examples of same length strings that don't
use the same size of memory doesn't point at something that
must be solved.

--
Antoon Pardon

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-29 Thread Chris Angelico
On Sun, Jul 28, 2013 at 11:14 PM, Joshua Landau jos...@landau.ws wrote:
 GC does have sometimes severe impact in memory-constrained environments,
 though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/,
 about half-way down, specifically
 http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png.

 The best verification of these graphs I could find was
 https://blog.mozilla.org/nnethercote/category/garbage-collection/, although
 it's not immediately clear in Chrome's and Opera's case mainly due to none
 of the benchmarks pushing memory usage significantly.

 I also don't quite agree with the first post (sealedabstract) because I get
 by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS
 is just really heavy. Nonetheless, the benchmarks aren't lying.

The ultimate in non-managed memory (the opposite of a GC) would have
to be the assembly language programming I did in my earlier days,
firing up DEBUG.EXE and writing a .COM file that lived inside a single
64KB segment for everything (256-byte Program Segment Prefix, then
code, then initialized data, then uninitialized data and stack),
crashing the computer with remarkable ease. Everything higher level
than that (even malloc/free) has its conveniences and its costs,
usually memory wastage. If you malloc random-sized blocks, free them
at random, and ensure that your total allocated size stays below some
limit, you'll still eventually run yourself out of memory. This is
unsurprising. The only question is, how bad is the wastage and how
much time gets devoted to it?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread wxjmfauth
Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote:
 
 
 
  Do not forget that à la FSR mechanism for a non-ascii user is
 
  *irrelevant*.
 
 
 
 You have been told repeatedly, Python's internals are *full* of ASCII-
 
 only strings.
 
 
 
 py dir(list)
 
 ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', 
 
 '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
 
 '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', 
 
 '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
 
 '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
 
 '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', 
 
 '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 
 
 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
 
 
 
 There's 45 ASCII-only strings right there, in only one built-in type, out 
 
 of dozens. There are dozens, hundreds of ASCII-only strings in Python: 
 
 builtin functions and classes, attributes, exceptions, internal 
 
 attributes, variable names, and so on.
 
 
 
 You already know this, and yet you persist in repeating nonsense.
 
 
 
 
 
 -- 
 
 Steven

3.2
 timeit.timeit(r = dir(list))
22.300465007102908

3.3
 timeit.timeit(r = dir(list))
27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Chris Angelico
On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 3.2
 timeit.timeit(r = dir(list))
 22.300465007102908

 3.3
 timeit.timeit(r = dir(list))
 27.13981129541519

3.2:
 len(dir(list))
42

3.3:
 len(dir(list))
45

Wonder if that might maybe have an impact on the timings.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Heiko Wundram

Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com:

3.2

timeit.timeit(r = dir(list))

22.300465007102908

3.3

timeit.timeit(r = dir(list))

27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.


Please give a single *proof* (not your gut feeling) that this is related 
to the FSR, and not rather due to other side-effects such as changes in 
how dir() works or (as Chris pointed out) due to more members on the 
list type in 3.3. If you can't or won't give that proof, there's no 
sense in continuing the discussion.


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Devyn Collier Johnson


On 07/29/2013 08:06 AM, Heiko Wundram wrote:

Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com:

3.2

timeit.timeit(r = dir(list))

22.300465007102908

3.3

timeit.timeit(r = dir(list))

27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.


Please give a single *proof* (not your gut feeling) that this is 
related to the FSR, and not rather due to other side-effects such as 
changes in how dir() works or (as Chris pointed out) due to more 
members on the list type in 3.3. If you can't or won't give that 
proof, there's no sense in continuing the discussion.


Wow! The RE Module thread I created is evolving into Unicode topics. 
That thread grew up so fast!


DCJ
--
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread wxjmfauth
Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 
  3.2
 
  timeit.timeit(r = dir(list))
 
  22.300465007102908
 
 
 
  3.3
 
  timeit.timeit(r = dir(list))
 
  27.13981129541519
 
 
 
 3.2:
 
  len(dir(list))
 
 42
 
 
 
 3.3:
 
  len(dir(list))
 
 45
 
 
 
 Wonder if that might maybe have an impact on the timings.
 
 
 
 ChrisA

Good point. I stupidely forgot this.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread wxjmfauth
Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit :
 On 7/28/2013 11:52 AM, Michael Torrie wrote:
 
 
 
  3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
 
  slicing a string would be very very slow,
 
 
 
 Not necessarily so. See below.
 
 
 
  and that's unacceptable for
 
  the use cases of python strings.  I'm assuming you understand big O
 
  notation, as you talk of experience in many languages over the years.
 
  FSR and UTF-32 both are O(1) for slicing and lookups.
 
 
 
 Slicing is at least O(m) where m is the length of the slice.
 
 
 
  UTF-8, 16 and any variable-width encoding are always O(n).\
 
 
 
 I posted about a week ago, in response to Chris A., a method by which 
 
 lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, 
 
 O(1+log2(k+1)), where k is the number of non-BMP chars in the string.
 
 
 
 This uses an auxiliary array of k ints. An auxiliary array of n ints 
 
 would make UFT-16 lookup O(1), but then one is using more space than 
 
 with UFT-32. Similar comments apply to UTF-8.
 
 
 
 The unicode standard says that a single strings should use exactly one 
 
 coding scheme. It does *not* say that all strings in an application must 
 
 use the same scheme. I just rechecked a few days ago. It also does not 
 
 say that an application cannot associate additional data with a string 
 
 to make processing of the string easier.
 
 
 
 -- 
 
 Terry Jan Reedy

To my knowledge, the Unicode doc always speak about
the misc. utf* coding schemes in an exclusive or way.

Having multiple encoded strings is one thing. Manipulating
multiple encoded strings is something else.

Maybe the mistake was to not emphasize the fact that
one has to work with a unique set of encoded code points
(utf-8 or utf-16 or utf-32) because it was considered,
as to obvious one can not work properly with multiple
coding schemes.

You are also right in saying  ...application cannot associate
additional data
The doc does not specify it either. It is superfleous.


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread wxjmfauth
Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 
  3.2
 
  timeit.timeit(r = dir(list))
 
  22.300465007102908
 
 
 
  3.3
 
  timeit.timeit(r = dir(list))
 
  27.13981129541519
 
 
 
 3.2:
 
  len(dir(list))
 
 42
 
 
 
 3.3:
 
  len(dir(list))
 
 45
 
 
 
 Wonder if that might maybe have an impact on the timings.
 
 
 
 ChrisA




class C:
a = 'abc'
b = 'def'
def aaa(self):
pass
def bbb(self):
pass
def ccc(self):
pass

if __name__ == '__main__':
import timeit
print(timeit.timeit(r = dir(C), setup=from __main__ import C))



c:\python32\pythonw -u timitmod.py
15.258061416225663
Exit code: 0
c:\Python33\pythonw -u timitmod.py
17.052203122286194
Exit code: 0


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Chris Angelico
On Mon, Jul 29, 2013 at 3:20 PM,  wxjmfa...@gmail.com wrote:
c:\python32\pythonw -u timitmod.py
 15.258061416225663
Exit code: 0
c:\Python33\pythonw -u timitmod.py
 17.052203122286194
Exit code: 0

 len(dir(C))

Did you even think to check that before you posted timings?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread wxjmfauth
Le lundi 29 juillet 2013 16:49:34 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 3:20 PM,  wxjmfa...@gmail.com wrote:
 
 c:\python32\pythonw -u timitmod.py
 
  15.258061416225663
 
 Exit code: 0
 
 c:\Python33\pythonw -u timitmod.py
 
  17.052203122286194
 
 Exit code: 0
 
 
 
  len(dir(C))
 
 
 
 Did you even think to check that before you posted timings?
 
 
 
 ChrisA

Boum, no! the diff is one.
I have however noticed, I can increase the number
of attributes (ascii), the timing differences
is very well marked.
I do not draw conclusions. Such a factor for one
unit

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Antoon Pardon

Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:


Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a


sys.getsizeof('€')

40

needs 14 bytes more than a


sys.getsizeof('z')

26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.


sys.getsizeof('b'*100 + 'c')

126

sys.getsizeof('b'*100 + '€')

240

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.


So? The same effect can be seen with other datatypes.

 nr = 32767
 sys.getsizeof(nr)
14
 nr += 1
 sys.getsizeof(nr)
16




This FSR is not even a copy of the utf-8.

len(('b'*100 + '€').encode('utf-8'))

103


Why should it be? Why should a unicode string be a copy
of its utf-8 encoding? That makes as much sense as expecting
that a number would be a copy of its string reprensentation.



utf-8 or any (utf) never need and never spend their time
in reencoding.


So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.

Until then you are merely harboring a pet peeve.

--
Antoon Pardon
--
http://mail.python.org/mailman/listinfo/python-list


FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Michael Torrie
On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:
 Good point. FSR, nice tool for those who wish to teach
 Unicode. It is not every day, one has such an opportunity.

I had a long e-mail composed, but decided to chop it down, but still too
long.  so I ditched a lot of the context, which jmf also seems to do.
Apologies.

1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
is an official encoding.  FSR only differs from UTF-32 in that the
padding zeros are stripped off such that it is stored in the most
compact form that can handle all the characters in string, which is
always known at string creation time.  Now you can argue many things,
but to say FSR is not unicode compliant is quite a stretch!  What
unicode entities or characters cannot be stored in strings using FSR?
What sequences of bytes in FSR result in invalid Unicode entities?

2. strings in Python *never change*.  They are immutable.  The +
operator always copies strings character by character into a new string
object, even if Python had used UTF-8 internally.  If you're doing a lot
of string concatenations, perhaps you're using the wrong data type.  A
byte buffer might be better for you, where you can stuff utf-8 sequences
into it to your heart's content.

3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow, and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
variable-width encoding are always O(n).  A lot slower!

4. Unicode is, well, unicode.  You seem to hop all over the place from
talking about code points to bytes to bits, using them all
interchangeably.  And now you seem to be claiming that a particular byte
encoding standard is by definition unicode (UTF-8).  Or at least that's
how it sounds.  And also claim FSR is not compliant with unicode
standards, which appears to me to be completely false.

Is my understanding of these things wrong?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Chris Angelico
On Sun, Jul 28, 2013 at 4:52 PM, Michael Torrie torr...@gmail.com wrote:
 Is my understanding of these things wrong?

No, your understanding of those matters is fine. There's just one area
you seem to be misunderstanding; you appear to think that jmf actually
cares about logical argument. I gave up on that theory a long time
ago, and now I respond for the benefit of those reading, rather than
jmf himself. I've also given up on trying to figure out what he
actually wants; the nearest I can come up with is that he's King
Gama-esque - that he just wants to complain.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Terry Reedy

On 7/28/2013 11:52 AM, Michael Torrie wrote:


3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow,


Not necessarily so. See below.


and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.


Slicing is at least O(m) where m is the length of the slice.


UTF-8, 16 and any variable-width encoding are always O(n).\


I posted about a week ago, in response to Chris A., a method by which 
lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, 
O(1+log2(k+1)), where k is the number of non-BMP chars in the string.


This uses an auxiliary array of k ints. An auxiliary array of n ints 
would make UFT-16 lookup O(1), but then one is using more space than 
with UFT-32. Similar comments apply to UTF-8.


The unicode standard says that a single strings should use exactly one 
coding scheme. It does *not* say that all strings in an application must 
use the same scheme. I just rechecked a few days ago. It also does not 
say that an application cannot associate additional data with a string 
to make processing of the string easier.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Chris Angelico
On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy tjre...@udel.edu wrote:
 I posted about a week ago, in response to Chris A., a method by which lookup
 for UTF-16 can be made O(log2 k), or perhaps more accurately,
 O(1+log2(k+1)), where k is the number of non-BMP chars in the string.


Which is an optimization choice that favours strings containing very
few non-BMP characters. To justify the extra complexity of out-of-band
storage, you would need to be working with almost exclusively the BMP.
That would drastically improve jmf's microbenchmarks which do exactly
that, but it would penalize strings that are almost exclusively
higher-codepoint characters. Its quality, then, would be based on a
major survey of string usage: are there enough strings with
mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled
better by PEP 393, so this is only of value when there are actually
those mixed strings.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread wxjmfauth
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
 On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 
  Back to utf. utfs are not only elements of a unique set of encoded
 
  code points. They have an interesting feature. Each utf chunk
 
  holds intrisically the character (in fact the code point) it is
 
  supposed to represent. In utf-32, the obvious case, it is just
 
  the code point. In utf-8, that's the first chunk which helps and
 
  utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 
  implementation using bytes, for any pointer position it is always
 
  possible to find the corresponding encoded code point and from this
 
  the corresponding character without any programmed information. See
 
  my editor example, how to find the char under the caret? In fact,
 
  a silly example, how can the caret can be positioned or moved, if
 
  the underlying corresponding encoded code point can not be
 
  dicerned!
 
 
 
 Yes, given a pointer location into a utf-8 or utf-16 string, it is
 
 easy to determine the identity of the code point at that location.
 
 But this is not often a useful operation, save for resynchronization
 
 in the case that the string data is corrupted.  The caret of an editor
 
 does not conceptually correspond to a pointer location, but to a
 
 character index.  Given a particular character index (e.g. 127504), an
 
 editor must be able to determine the identity and/or the memory
 
 location of the character at that index, and for UTF-8 and UTF-16
 
 without an auxiliary data structure that is a O(n) operation.
 
 
 
  2) Take a look at this. Get rid of the overhead.
 
 
 
  sys.getsizeof('b'*100 + 'c')
 
  126
 
  sys.getsizeof('b'*100 + '€')
 
  240
 
 
 
  What does it mean? It means that Python has to
 
  reencode a str every time it is necessary because
 
  it works with multiple codings.
 
 
 
 Large strings in practical usage do not need to be resized like this
 
 often.  Python 3.3 has been in production use for months now, and you
 
 still have yet to produce any real-world application code that
 
 demonstrates a performance regression.  If there is no real-world
 
 regression, then there is no problem.
 
 
 
  3) Unicode compliance. We know retrospectively, latin-1,
 
  is was a bad choice. Unusable for 17 European languages.
 
  Believe of not. 20 years of Unicode of incubation is not
 
  long enough to learn it. When discussing once with a French
 
  Python core dev, one with commit access, he did not know one
 
  can not use latin-1 for the French language!
 
 
 
 Probably because for many French strings, one can.  As far as I am
 
 aware, the only characters that are missing from Latin-1 are the Euro
 
 sign (an unfortunate victim of history), the ligature œ (I have no
 
 doubt that many users just type oe anyway), and the rare capital Ÿ
 
 (the miniscule version is present in Latin-1).  All French strings
 
 that are fortunate enough to be absent these characters can be
 
 represented in Latin-1 and so will have a 1-byte width in the FSR.

--

latin-1? that's not even truth.

 sys.getsizeof('a')
26
 sys.getsizeof('ü')
38
 sys.getsizeof('aa')
27
 sys.getsizeof('aü')
39


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Joshua Landau
On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:

 Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:

 utf-8 or any (utf) never need and never spend their time
 in reencoding.


 So? That python sometimes needs to do some kind of background
 processing is not a problem, whether it is garbage collection,
 allocating more memory, shufling around data blocks or reencoding a
 string, that doesn't matter. If you've got a real world example where
 one of those things noticeably slows your program down or makes the
 program behave faulty then you have something that is worthy of
 attention.


Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes
any noticeable slowdown in real-world circumstances? That's not remotely
true.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Chris Angelico
On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:
 On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:

 Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:

 utf-8 or any (utf) never need and never spend their time
 in reencoding.


 So? That python sometimes needs to do some kind of background
 processing is not a problem, whether it is garbage collection,
 allocating more memory, shufling around data blocks or reencoding a
 string, that doesn't matter. If you've got a real world example where
 one of those things noticeably slows your program down or makes the
 program behave faulty then you have something that is worthy of
 attention.


 Somewhat off topic, but befitting of the triviality of this thread, do I
 understand correctly that you are saying garbage collection never causes any
 noticeable slowdown in real-world circumstances? That's not remotely true.

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app; most of the issues with GC timing are when one
operation gets unexpectedly delayed for a GC run (making performance
measurement hard, and such). It should certainly never cause your
program to behave faultily, though I have seen cases where the GC run
appears to cause the program to crash - something like this:

some_string = buggy_call()
...
gc()
...
print(some_string)

The buggy call mucked up the reference count, so the gc run actually
wiped the string from memory - resulting in a segfault on next usage.
But the GC wasn't at fault, the original call was. (Which, btw, was
quite a debugging search, especially since the function in question
wasn't my code.)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread MRAB

On 28/07/2013 19:13, wxjmfa...@gmail.com wrote:

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :

On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:

 Back to utf. utfs are not only elements of a unique set of encoded

 code points. They have an interesting feature. Each utf chunk

 holds intrisically the character (in fact the code point) it is

 supposed to represent. In utf-32, the obvious case, it is just

 the code point. In utf-8, that's the first chunk which helps and

 utf-16 is a mixed case (utf-8 / utf-32). In other words, in an

 implementation using bytes, for any pointer position it is always

 possible to find the corresponding encoded code point and from this

 the corresponding character without any programmed information. See

 my editor example, how to find the char under the caret? In fact,

 a silly example, how can the caret can be positioned or moved, if

 the underlying corresponding encoded code point can not be

 dicerned!



Yes, given a pointer location into a utf-8 or utf-16 string, it is

easy to determine the identity of the code point at that location.

But this is not often a useful operation, save for resynchronization

in the case that the string data is corrupted.  The caret of an editor

does not conceptually correspond to a pointer location, but to a

character index.  Given a particular character index (e.g. 127504), an

editor must be able to determine the identity and/or the memory

location of the character at that index, and for UTF-8 and UTF-16

without an auxiliary data structure that is a O(n) operation.



 2) Take a look at this. Get rid of the overhead.



 sys.getsizeof('b'*100 + 'c')

 126

 sys.getsizeof('b'*100 + '€')

 240



 What does it mean? It means that Python has to

 reencode a str every time it is necessary because

 it works with multiple codings.



Large strings in practical usage do not need to be resized like this

often.  Python 3.3 has been in production use for months now, and you

still have yet to produce any real-world application code that

demonstrates a performance regression.  If there is no real-world

regression, then there is no problem.



 3) Unicode compliance. We know retrospectively, latin-1,

 is was a bad choice. Unusable for 17 European languages.

 Believe of not. 20 years of Unicode of incubation is not

 long enough to learn it. When discussing once with a French

 Python core dev, one with commit access, he did not know one

 can not use latin-1 for the French language!



Probably because for many French strings, one can.  As far as I am

aware, the only characters that are missing from Latin-1 are the Euro

sign (an unfortunate victim of history), the ligature œ (I have no

doubt that many users just type oe anyway), and the rare capital Ÿ

(the miniscule version is present in Latin-1).  All French strings

that are fortunate enough to be absent these characters can be

represented in Latin-1 and so will have a 1-byte width in the FSR.


--

latin-1? that's not even truth.


sys.getsizeof('a')

26

sys.getsizeof('ü')

38

sys.getsizeof('aa')

27

sys.getsizeof('aü')

39



 sys.getsizeof('aa') - sys.getsizeof('a')
1

One byte per codepoint.

 sys.getsizeof('üü') - sys.getsizeof('ü')
1

Also one byte per codepoint.

 sys.getsizeof('ü') - sys.getsizeof('a')
12

Clearly there's more going on here.

FSR is an optimisation. You'll always be able to find some
circumstances where an optimisation makes things worse, but what
matters is the overall result.

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Terry Reedy

On 7/28/2013 2:29 PM, Chris Angelico wrote:

On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:



Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes any
noticeable slowdown in real-world circumstances? That's not remotely true.


If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app;


There are situations, some discussed on this list, where doing gc 
'right' means turning off the cycle garbage collector. As I remember, an 
example is creating a list of a million tuples, which otherwise triggers 
a lot of useless background bookkeeping. The cyclic gc is tuned for 
'normal' use patterns.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread wxjmfauth
Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
 On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:
 
  Good point. FSR, nice tool for those who wish to teach
 
  Unicode. It is not every day, one has such an opportunity.
 
 
 
 I had a long e-mail composed, but decided to chop it down, but still too
 
 long.  so I ditched a lot of the context, which jmf also seems to do.
 
 Apologies.
 
 
 
 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
 
 is an official encoding.  FSR only differs from UTF-32 in that the
 
 padding zeros are stripped off such that it is stored in the most
 
 compact form that can handle all the characters in string, which is
 
 always known at string creation time.  Now you can argue many things,
 
 but to say FSR is not unicode compliant is quite a stretch!  What
 
 unicode entities or characters cannot be stored in strings using FSR?
 
 What sequences of bytes in FSR result in invalid Unicode entities?
 
 
 
 2. strings in Python *never change*.  They are immutable.  The +
 
 operator always copies strings character by character into a new string
 
 object, even if Python had used UTF-8 internally.  If you're doing a lot
 
 of string concatenations, perhaps you're using the wrong data type.  A
 
 byte buffer might be better for you, where you can stuff utf-8 sequences
 
 into it to your heart's content.
 
 
 
 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
 
 slicing a string would be very very slow, and that's unacceptable for
 
 the use cases of python strings.  I'm assuming you understand big O
 
 notation, as you talk of experience in many languages over the years.
 
 FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
 
 variable-width encoding are always O(n).  A lot slower!
 
 
 
 4. Unicode is, well, unicode.  You seem to hop all over the place from
 
 talking about code points to bytes to bits, using them all
 
 interchangeably.  And now you seem to be claiming that a particular byte
 
 encoding standard is by definition unicode (UTF-8).  Or at least that's
 
 how it sounds.  And also claim FSR is not compliant with unicode
 
 standards, which appears to me to be completely false.
 
 
 
 Is my understanding of these things wrong?

--

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

 timeit.timeit(a = 'hundred'; 'x' in a)
0.09897159682121348
 timeit.timeit(a = 'hundre€'; 'x' in a)
0.09079501961732461
 sys.getsizeof('d')
32
 sys.getsizeof('€')
32
 sys.getsizeof('dd')
34
 sys.getsizeof('d€')
34


Py3.3

 timeit.timeit(a = 'hundred'; 'x' in a)
0.12183182740848858
 timeit.timeit(a = 'hundre€'; 'x' in a)
0.2365732969632326
 sys.getsizeof('d')
26
 sys.getsizeof('€')
40
 sys.getsizeof('dd')
27
 sys.getsizeof('d€')
42

Tell me which one seems to be more unicode compliant?
The goal of Unicode is to handle every char equaly.

Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).

 sys.getsizeof('abc' * 1000 + 'z')
3026
 sys.getsizeof('abc' * 1000 + '\U00010010')
12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread wxjmfauth
Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit :
 On 28/07/2013 19:13, wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
 
  On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 
 
 
   Back to utf. utfs are not only elements of a unique set of encoded
 
 
 
   code points. They have an interesting feature. Each utf chunk
 
 
 
   holds intrisically the character (in fact the code point) it is
 
 
 
   supposed to represent. In utf-32, the obvious case, it is just
 
 
 
   the code point. In utf-8, that's the first chunk which helps and
 
 
 
   utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 
 
 
   implementation using bytes, for any pointer position it is always
 
 
 
   possible to find the corresponding encoded code point and from this
 
 
 
   the corresponding character without any programmed information. See
 
 
 
   my editor example, how to find the char under the caret? In fact,
 
 
 
   a silly example, how can the caret can be positioned or moved, if
 
 
 
   the underlying corresponding encoded code point can not be
 
 
 
   dicerned!
 
 
 
 
 
 
 
  Yes, given a pointer location into a utf-8 or utf-16 string, it is
 
 
 
  easy to determine the identity of the code point at that location.
 
 
 
  But this is not often a useful operation, save for resynchronization
 
 
 
  in the case that the string data is corrupted.  The caret of an editor
 
 
 
  does not conceptually correspond to a pointer location, but to a
 
 
 
  character index.  Given a particular character index (e.g. 127504), an
 
 
 
  editor must be able to determine the identity and/or the memory
 
 
 
  location of the character at that index, and for UTF-8 and UTF-16
 
 
 
  without an auxiliary data structure that is a O(n) operation.
 
 
 
 
 
 
 
   2) Take a look at this. Get rid of the overhead.
 
 
 
  
 
 
 
   sys.getsizeof('b'*100 + 'c')
 
 
 
   126
 
 
 
   sys.getsizeof('b'*100 + '€')
 
 
 
   240
 
 
 
  
 
 
 
   What does it mean? It means that Python has to
 
 
 
   reencode a str every time it is necessary because
 
 
 
   it works with multiple codings.
 
 
 
 
 
 
 
  Large strings in practical usage do not need to be resized like this
 
 
 
  often.  Python 3.3 has been in production use for months now, and you
 
 
 
  still have yet to produce any real-world application code that
 
 
 
  demonstrates a performance regression.  If there is no real-world
 
 
 
  regression, then there is no problem.
 
 
 
 
 
 
 
   3) Unicode compliance. We know retrospectively, latin-1,
 
 
 
   is was a bad choice. Unusable for 17 European languages.
 
 
 
   Believe of not. 20 years of Unicode of incubation is not
 
 
 
   long enough to learn it. When discussing once with a French
 
 
 
   Python core dev, one with commit access, he did not know one
 
 
 
   can not use latin-1 for the French language!
 
 
 
 
 
 
 
  Probably because for many French strings, one can.  As far as I am
 
 
 
  aware, the only characters that are missing from Latin-1 are the Euro
 
 
 
  sign (an unfortunate victim of history), the ligature œ (I have no
 
 
 
  doubt that many users just type oe anyway), and the rare capital Ÿ
 
 
 
  (the miniscule version is present in Latin-1).  All French strings
 
 
 
  that are fortunate enough to be absent these characters can be
 
 
 
  represented in Latin-1 and so will have a 1-byte width in the FSR.
 
 
 
  --
 
 
 
  latin-1? that's not even truth.
 
 
 
  sys.getsizeof('a')
 
  26
 
  sys.getsizeof('ü')
 
  38
 
  sys.getsizeof('aa')
 
  27
 
  sys.getsizeof('aü')
 
  39
 
 
 
 
 
   sys.getsizeof('aa') - sys.getsizeof('a')
 
 1
 
 
 
 One byte per codepoint.
 
 
 
   sys.getsizeof('üü') - sys.getsizeof('ü')
 
 1
 
 
 
 Also one byte per codepoint.
 
 
 
   sys.getsizeof('ü') - sys.getsizeof('a')
 
 12
 
 
 
 Clearly there's more going on here.
 
 
 
 FSR is an optimisation. You'll always be able to find some
 
 circumstances where an optimisation makes things worse, but what
 
 matters is the overall result.




Yes, I know my examples are always wrong, never
real examples.

I can point long strings, I should point short strings.
I point a short string (char), it is not long enough.
Strings as dict keys, no the problem is in Python dict.
Performance? no that's a memory issue.
Memory? no, it's a question to keep perfomance.
I am using this char, no you should not, it's no common.
The nabla operator in TeX file, who is so stupid to use
that char?
Many time, I'm just mimicking 'BDFL' examples, just
by replacing his ascii chars by non ascii char ;-)
And so on.

To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these micro-benchmarks.
all the memory cases. It a good start, no?


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread MRAB

On 28/07/2013 20:23, wxjmfa...@gmail.com wrote:
[snip]


Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)


Why are you using a narrow build of Python 3.2? It doesn't treat all
codepoints equally (those outside the BMP can't be stored in one code
unit) and, therefore, it isn't Unicode compliant!


timeit.timeit(a = 'hundred'; 'x' in a)

0.09897159682121348

timeit.timeit(a = 'hundre€'; 'x' in a)

0.09079501961732461

sys.getsizeof('d')

32

sys.getsizeof('€')

32

sys.getsizeof('dd')

34

sys.getsizeof('d€')

34


Py3.3


timeit.timeit(a = 'hundred'; 'x' in a)

0.12183182740848858

timeit.timeit(a = 'hundre€'; 'x' in a)

0.2365732969632326

sys.getsizeof('d')

26

sys.getsizeof('€')

40

sys.getsizeof('dd')

27

sys.getsizeof('d€')

42

Tell me which one seems to be more unicode compliant?
The goal of Unicode is to handle every char equaly.

Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).


sys.getsizeof('abc' * 1000 + 'z')

3026

sys.getsizeof('abc' * 1000 + '\U00010010')

12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


jmf



--
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Antoon Pardon

Op 28-07-13 21:23, wxjmfa...@gmail.com schreef:

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :

On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:


Good point. FSR, nice tool for those who wish to teach



Unicode. It is not every day, one has such an opportunity.




I had a long e-mail composed, but decided to chop it down, but still too

long.  so I ditched a lot of the context, which jmf also seems to do.

Apologies.



1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32

is an official encoding.  FSR only differs from UTF-32 in that the

padding zeros are stripped off such that it is stored in the most

compact form that can handle all the characters in string, which is

always known at string creation time.  Now you can argue many things,

but to say FSR is not unicode compliant is quite a stretch!  What

unicode entities or characters cannot be stored in strings using FSR?

What sequences of bytes in FSR result in invalid Unicode entities?



2. strings in Python *never change*.  They are immutable.  The +

operator always copies strings character by character into a new string

object, even if Python had used UTF-8 internally.  If you're doing a lot

of string concatenations, perhaps you're using the wrong data type.  A

byte buffer might be better for you, where you can stuff utf-8 sequences

into it to your heart's content.



3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that

slicing a string would be very very slow, and that's unacceptable for

the use cases of python strings.  I'm assuming you understand big O

notation, as you talk of experience in many languages over the years.

FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any

variable-width encoding are always O(n).  A lot slower!



4. Unicode is, well, unicode.  You seem to hop all over the place from

talking about code points to bytes to bits, using them all

interchangeably.  And now you seem to be claiming that a particular byte

encoding standard is by definition unicode (UTF-8).  Or at least that's

how it sounds.  And also claim FSR is not compliant with unicode

standards, which appears to me to be completely false.



Is my understanding of these things wrong?


--

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)


timeit.timeit(a = 'hundred'; 'x' in a)

0.09897159682121348

timeit.timeit(a = 'hundre€'; 'x' in a)

0.09079501961732461

sys.getsizeof('d')

32

sys.getsizeof('€')

32

sys.getsizeof('dd')

34

sys.getsizeof('d€')

34


Py3.3


timeit.timeit(a = 'hundred'; 'x' in a)

0.12183182740848858

timeit.timeit(a = 'hundre€'; 'x' in a)

0.2365732969632326

sys.getsizeof('d')

26

sys.getsizeof('€')

40

sys.getsizeof('dd')

27

sys.getsizeof('d€')

42

Tell me which one seems to be more unicode compliant?


Cant tell, you give no relevant information on which one can decide
this question.


The goal of Unicode is to handle every char equaly.


Not to this kind of detail, which is looking at irrelevant
implementation details.


Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).


So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.


sys.getsizeof('abc' * 1000 + 'z')

3026

sys.getsizeof('abc' * 1000 + '\U00010010')

12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


And totally unimportant for deciding complyance.

--
Antoon Pardon

--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Lele Gaifax
wxjmfa...@gmail.com writes:

 Suggestion. Start by solving all these micro-benchmarks.
 all the memory cases. It a good start, no?

Since you seem the only one who has this dramatic problem with such
micro-benchmarks, that BTW have nothing to do with unicode compliance,
I'd suggest *you* should find a better implementation and propose it to
the core devs.

An even better suggestion, with due respect, is to get a life and find
something more interesting to do, or at least better arguments :-)

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Steven D'Aprano
On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote:

 Do not forget that à la FSR mechanism for a non-ascii user is
 *irrelevant*.

You have been told repeatedly, Python's internals are *full* of ASCII-
only strings.

py dir(list)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', 
'__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
'__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', 
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', 
'__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 
'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

There's 45 ASCII-only strings right there, in only one built-in type, out 
of dozens. There are dozens, hundreds of ASCII-only strings in Python: 
builtin functions and classes, attributes, exceptions, internal 
attributes, variable names, and so on.

You already know this, and yet you persist in repeating nonsense.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-28 Thread Joshua Landau
On 28 July 2013 19:29, Chris Angelico ros...@gmail.com wrote:

 On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:
  On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be
 wrote:
 
  Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:
 
  utf-8 or any (utf) never need and never spend their time
  in reencoding.
 
 
  So? That python sometimes needs to do some kind of background
  processing is not a problem, whether it is garbage collection,
  allocating more memory, shufling around data blocks or reencoding a
  string, that doesn't matter. If you've got a real world example where
  one of those things noticeably slows your program down or makes the
  program behave faulty then you have something that is worthy of
  attention.
 
 
  Somewhat off topic, but befitting of the triviality of this thread, do I
  understand correctly that you are saying garbage collection never causes
 any
  noticeable slowdown in real-world circumstances? That's not remotely
 true.

 If it's done properly, garbage collection shouldn't hurt the *overall*
 performance of the app; most of the issues with GC timing are when one
 operation gets unexpectedly delayed for a GC run (making performance
 measurement hard, and such). It should certainly never cause your
 program to behave faultily, though I have seen cases where the GC run
 appears to cause the program to crash - something like this:

 some_string = buggy_call()
 ...
 gc()
 ...
 print(some_string)

 The buggy call mucked up the reference count, so the gc run actually
 wiped the string from memory - resulting in a segfault on next usage.
 But the GC wasn't at fault, the original call was. (Which, btw, was
 quite a debugging search, especially since the function in question
 wasn't my code.)


GC does have sometimes severe impact in memory-constrained environments,
though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/,
about half-way down, specifically
http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png
.

The best verification of these graphs I could find was
https://blog.mozilla.org/nnethercote/category/garbage-collection/, although
it's not immediately clear in Chrome's and Opera's case mainly due to none
of the benchmarks pushing memory usage significantly.

I also don't quite agree with the first post (sealedabstract) because I get
by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS
is just really heavy. Nonetheless, the benchmarks aren't lying.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-27 Thread Steven D'Aprano
On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote:

 BTW, I'm pleased to read sequence of bits and not bytes. Again, utf
 transformers are producing sequence of bits, call Unicode Transformation
 Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32.
 UCS transformers are (were) producing bytes, from there the names
 ucs-2/4.


Not only does your distinction between bits and bytes make no practical 
difference on nearly all hardware in common use today[1], but the Unicode 
Consortium disagrees with you, and defines UTC in terms of bytes:

A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate code points) to a unique byte 
sequence.

http://www.unicode.org/faq/utf_bom.html#gen2




[1] There may still be some old supercomputers where a byte is more than 
8 bits in use, but they're unlikely to support Unicode.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-27 Thread wxjmfauth
Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :
 On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote:
 
  sys.getsizeof('––') - sys.getsizeof('–')
 
  
 
  I have already explained / commented this.
 
 
 
 Maybe it got lost in translation, but I don't understand your point with
 
 that.
 
 
 
  Hint: To understand Unicode (and every coding scheme), you should
 
  understand utf. The how and the *why*.
 
 
 
 Hmm, so if python used utf-8 internally to represent unicode strings
 
 would not that punish *all* users (not just non-ascii users) since
 
 searching a string for a certain character position requires an O(n)
 
 operation?  UTF-32 I could see (and indeed that's essentially what FSR
 
 uses when necessary does it not?), but not utf-8 or utf-16.

--

Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?

Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)

Unicode is a four-steps process.
[ {unique set of characters}  -- {unique set of code points, the
labels} --  {unique set of encoded code points} ] -- implementation
(bytes)

First point to notice. pure unicode, [...], is different from
the implementation. *This is a deliberate choice*.

The critical step is the path {unique set of characters} ---
{unique set of encoded code points} in such a way so that
the implementation can work comfortably with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of already prepared encoded code points.
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is already eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.

Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, Unicode
uses these utf(s). I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
machine? (see golang / rune / int32).

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each utf chunk
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any programmed information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).

There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.

One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.

Computer scientists are funny scientists. They do love
to solve the problems they created themselves.

-

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a 

 sys.getsizeof('€')
40

needs 14 bytes more than a

 sys.getsizeof('z')
26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take 

Re: RE Module Performance

2013-07-27 Thread Ian Kelly
On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 Back to utf. utfs are not only elements of a unique set of encoded
 code points. They have an interesting feature. Each utf chunk
 holds intrisically the character (in fact the code point) it is
 supposed to represent. In utf-32, the obvious case, it is just
 the code point. In utf-8, that's the first chunk which helps and
 utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 implementation using bytes, for any pointer position it is always
 possible to find the corresponding encoded code point and from this
 the corresponding character without any programmed information. See
 my editor example, how to find the char under the caret? In fact,
 a silly example, how can the caret can be positioned or moved, if
 the underlying corresponding encoded code point can not be
 dicerned!

Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted.  The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index.  Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.

 2) Take a look at this. Get rid of the overhead.

 sys.getsizeof('b'*100 + 'c')
 126
 sys.getsizeof('b'*100 + '€')
 240

 What does it mean? It means that Python has to
 reencode a str every time it is necessary because
 it works with multiple codings.

Large strings in practical usage do not need to be resized like this
often.  Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression.  If there is no real-world
regression, then there is no problem.

 3) Unicode compliance. We know retrospectively, latin-1,
 is was a bad choice. Unusable for 17 European languages.
 Believe of not. 20 years of Unicode of incubation is not
 long enough to learn it. When discussing once with a French
 Python core dev, one with commit access, he did not know one
 can not use latin-1 for the French language!

Probably because for many French strings, one can.  As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature œ (I have no
doubt that many users just type oe anyway), and the rare capital Ÿ
(the miniscule version is present in Latin-1).  All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread wxjmfauth
Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
 
 
 
  On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 
  steve+comp.lang.pyt...@pearwood.info wrote:
 
  On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 
  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 
  that are codepoints of text characters within buffers and strings.
 
  Rather, Emacs uses a variable-length internal representation of
 
  characters, that stores each character as a sequence of 1 to 5 8-bit
 
  bytes, depending on the magnitude of its codepoint[1]. For example,
 
  any ASCII character takes up only 1 byte, a Latin-1 character takes up
 
  2 bytes, etc. We call this representation of text multibyte.
 
 
 
  Well, you've just proven what Vim users have always suspected: Emacs
 
  doesn't really exist.
 
 
 
  ... lolwut?
 
 
 
 
 
  JMF has explained that it is impossible, impossible I say!, to write an
 
  editor using a flexible string representation. Since Emacs uses such a
 
  flexible string representation, Emacs is impossible, and therefore Emacs
 
  doesn't exist.
 
 
 
  QED.
 
 
 
 Except that the described representation used by Emacs is a variant of
 
 UTF-8, not an FSR.  It doesn't have three different possible encodings
 
 for the letter 'a' depending on what other characters happen to be in
 
 the string.
 
 
 
 As I understand it, jfm would be perfectly happy if Python used UTF-8
 
 (or presumably the Emacs variant) as its internal string
 
 representation.

--

And emacs it probably working smoothly.

Your comment summarized all this stuff very correctly and
very shortly.

utf8/16/32? I do not care. There are all working correctly,
smoothly and efficiently. In fact, these utf's are already
doing correctly, what this FSR is doing in a wrong way.

My preference? utf32. Why? It is the most simple and
consequently performing choice. I'm not a narrow minded
ascii user. (I do not pretend to belong to those who
are solving the quadrature of the circle, I pretend to
belong to those who know, the quadrature of the circle
is not solvable).

Note: text processing tools or tools that have to process
characters — and the tools to build these tools — are all
moving to utf32, if not already done. There are technical
reasons behind this, which are going beyond the
pure raw unicode. There are however still 100% Unicode
compliant.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread wxjmfauth
Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit :
 On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
 
  JMF has explained that it is impossible, impossible I say!, to write an 
 
  editor using a flexible string representation. Since Emacs uses such a 
 
  flexible string representation, Emacs is impossible, and therefore Emacs 
 
  doesn't exist.
 
 
 
 Now I'm even more confused.  He once pointed to Go as an example of how
 
 unicode should be done in a language.  yet Go uses UTF-8 I think.
 
 
 
 But I don't think UTF-8 is what JMF refers to as flexible string
 
 representation.  FSR does use 1,2 or 4 bytes per character, but each
 
 character in the string uses the same width.  That's different from
 
 UTF-8 or UTF-16, which is variable width per character.

-

 sys.getsizeof('––') - sys.getsizeof('–')

I have already explained / commented this.




Hint: To understand Unicode (and every coding scheme), you should
understand utf. The how and the *why*.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread wxjmfauth
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  UTF-8 uses a flexible representation on a character-by-character basis.
 
  When parsing UTF-8, one needs to look at EVERY character to decide how
 
  many bytes you need to read. In Python 3, the flexible representation is
 
  on a string-by-string basis: once Python has looked at the string header,
 
  it can tell whether the *entire* string takes 1, 2 or 4 bytes per
 
  character, and the string is then fixed-width. You can't do that with
 
  UTF-8.
 
 
 
 UTF-8 does not use a flexible representation.  A codec that is
 
 encoding a string in UTF-8 and examining a particular character does
 
 not have any choice of how to encode that character; there is exactly
 
 one sequence of bits that is the UTF-8 encoding for the character.
 
 Further, for any given sequence of code points there is exactly one
 
 sequence of bytes that is the UTF-8 encoding of those code points.  In
 
 contrast, with the FSR there are as many as three different sequences
 
 of bytes that encode a sequence of code points, with one of them (the
 
 shortest) being canonical.  That's what makes it flexible.
 
 
 
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 
 claim about implementing text editors, because UTF-8 is not what he
 
 (or anybody else) is referring to when speaking of the FSR or
 
 something like the FSR.




BTW, it is not necessary to use an endorsed Unicode coding
scheme (utf*), a string literal would have been possible,
but then one falls on memory issures.

All these utf are following the basic coding scheme.

I repeat again.
A coding scheme works with a unique set of characters
and its implementation works with a unique set of
encoded code points (the utf's, in case of Unicode).

And again, that why we live today with all these coding
schemes, or, to take the problem from the other side,
that's because one has to work with a unique set of
encoded code points, that all these coding schemes had to
be created.

utf's have not been created by newbies ;-)

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread wxjmfauth
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  UTF-8 uses a flexible representation on a character-by-character basis.
 
  When parsing UTF-8, one needs to look at EVERY character to decide how
 
  many bytes you need to read. In Python 3, the flexible representation is
 
  on a string-by-string basis: once Python has looked at the string header,
 
  it can tell whether the *entire* string takes 1, 2 or 4 bytes per
 
  character, and the string is then fixed-width. You can't do that with
 
  UTF-8.
 
 
 
 UTF-8 does not use a flexible representation.  A codec that is
 
 encoding a string in UTF-8 and examining a particular character does
 
 not have any choice of how to encode that character; there is exactly
 
 one sequence of bits that is the UTF-8 encoding for the character.
 
 Further, for any given sequence of code points there is exactly one
 
 sequence of bytes that is the UTF-8 encoding of those code points.  In
 
 contrast, with the FSR there are as many as three different sequences
 
 of bytes that encode a sequence of code points, with one of them (the
 
 shortest) being canonical.  That's what makes it flexible.
 
 
 
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 
 claim about implementing text editors, because UTF-8 is not what he
 
 (or anybody else) is referring to when speaking of the FSR or
 
 something like the FSR.

-

Let's be clear. I'm perfectly understanding what is utf-8
and that's for that precise reason, I put the editor
as an exemple on the table.

This FSR is not *a* coding scheme. It is more a composite
coding scheme. (And form there, all the problems).

BTW, I'm pleased to read sequence of bits and not bytes.
Again, utf transformers are producing sequence of bits,
call Unicode Transformation Units, with lengths of
8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there
the names ucs-2/4.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread Michael Torrie
On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote:
 sys.getsizeof('––') - sys.getsizeof('–')
 
 I have already explained / commented this.

Maybe it got lost in translation, but I don't understand your point with
that.

 Hint: To understand Unicode (and every coding scheme), you should
 understand utf. The how and the *why*.

Hmm, so if python used utf-8 internally to represent unicode strings
would not that punish *all* users (not just non-ascii users) since
searching a string for a certain character position requires an O(n)
operation?  UTF-32 I could see (and indeed that's essentially what FSR
uses when necessary does it not?), but not utf-8 or utf-16.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread Steven D'Aprano
On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote:

 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 UTF-8 uses a flexible representation on a character-by-character basis.
 When parsing UTF-8, one needs to look at EVERY character to decide how
 many bytes you need to read. In Python 3, the flexible representation
 is on a string-by-string basis: once Python has looked at the string
 header, it can tell whether the *entire* string takes 1, 2 or 4 bytes
 per character, and the string is then fixed-width. You can't do that
 with UTF-8.
 
 UTF-8 does not use a flexible representation.

I disagree, and so does Jeremy Sanders who first pointed out the 
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the 
Emacs documentation again:

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc.

And the Python FSR:

To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any 
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc.

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of 
the string, Python based on the largest code-point in the string.


[...]
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 claim about implementing text editors, because UTF-8 is not what he (or
 anybody else) is referring to when speaking of the FSR or something
 like the FSR.

Whether JMF can see the similarities between different implementations of 
strings or not is beside the point, those similarities do exist. As do 
the differences, of course, but in this case the differences are in 
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4 
bytes, while a FSR implementation can simply record the fact that the 
string is pure Latin1 at creation time, and then treat it as fixed-width 
from then on.

JMF claims that FSR is impossible to use efficiently, and yet he 
supports encoding schemes which are *less* efficient. Go figure. He tells 
us he has no problem with any of the established UTF encodings, and yet 
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not 
UTF-16, since there are no surrogate pairs. But the difference is 
insignificant.)

Having watched this issue from Day One when JMF first complained about 
it, I believe this is entirely about denying any benefit to ASCII users. 
Had Python implemented a system identical to the current FSR except that 
it added a fourth category, all ASCII, which used an eight-byte 
encoding scheme (thus making ASCII strings twice as expensive as strings 
including code points from the Supplementary Multilingual Planes), JMF 
would be the scheme's number one champion.

I cannot see any other rational explanation for why JMF prefers broken, 
buggy Unicode implementations, or implementations which are equally 
expensive for all strings, over one which is demonstrably correct, 
demonstrably saves memory, and for realistic, non-contrived benchmarks, 
demonstrably faster, except that he wants to punish ASCII users more than 
he wants to support Unicode users.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread Ian Kelly
On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 See the similarity now? Both flexibly change the width used by code-
 points, UTF-8 based on the code-point itself regardless of the rest of
 the string, Python based on the largest code-point in the string.

No, I think we're just using the word flexible differently.  In my
view, simply being variable-width does not make an encoding flexible
in the sense of the FSR.  But I'm not going to keep repeating myself
in order to argue about it.

 Having watched this issue from Day One when JMF first complained about
 it, I believe this is entirely about denying any benefit to ASCII users.
 Had Python implemented a system identical to the current FSR except that
 it added a fourth category, all ASCII, which used an eight-byte
 encoding scheme (thus making ASCII strings twice as expensive as strings
 including code points from the Supplementary Multilingual Planes), JMF
 would be the scheme's number one champion.

I agree.  In fact I made a similar observation back in December:

http://mail.python.org/pipermail/python-list/2012-December/636942.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-26 Thread Steven D'Aprano
On Fri, 26 Jul 2013 22:12:36 -0600, Ian Kelly wrote:

 On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 See the similarity now? Both flexibly change the width used by code-
 points, UTF-8 based on the code-point itself regardless of the rest of
 the string, Python based on the largest code-point in the string.
 
 No, I think we're just using the word flexible differently.  In my
 view, simply being variable-width does not make an encoding flexible
 in the sense of the FSR.  But I'm not going to keep repeating myself in
 order to argue about it.

But I paid for the full half hour!

http://en.wikipedia.org/wiki/The_Argument_Sketch


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF:

 His most recent argument that Python should use UTF as a representation
 is very strange to be honest.

He's not arguing for anything, he is just hating on anything that gives 
even the tiniest benefit to ASCII users. This isn't about Python 3.3. 
hurting non-ASCII users, because that is demonstrably untrue: they are 
*better off* in Python 3.3. This is about denying even a tiny benefit to 
ASCII users.

In Python 3.3, non-ASCII users have these advantages compared to previous 
versions:

- strings will usually take less memory, and aside from trivial changes 
to the object header, they never take more memory than a wide build would 
use;

- consequently nearly all objects will take less memory (especially 
builtins and standard library objects, which are all ASCII), since 
objects contain dozens of internal strings (attribute and method names in 
__dict__, class name, etc.);

- consequently whole-application benchmarks show most applications will 
use significantly less memory, which leads to faster speeds;

- you cannot break surrogate pairs apart by accident, which you can do in 
narrow builds;

- in previous versions, code which works when run in a wide build may 
fail in a narrow build, but that is no longer an issue since the 
distinction between wide and narrow builds is gone;

- Latin1 users, which includes JMF himself, will likewise see memory 
savings, since Latin1 strings will take half the size of narrow builds 
and a quarter the size of wide builds.


The cost of all these benefits is a small overhead when creating a string 
in the first place, and some purely internal added complication to the 
string implementation.

I'm the first to argue against complication unless there is a 
corresponding benefit. This is a case where the benefit has proven itself 
doubly: Python 3.3's Unicode implementation is *more correct* than 
before, and it uses less memory to do so.

 The cons of UTF are apparent and widely
 known.  The main con is that UTF strings are O(n) for indexing a
 position within the string.

Not so for UTF-32.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka storch...@gmail.com wrote:
 24.07.13 21:15, Chris Angelico написав(ла):

 To my mind, exposing UTF-16
 surrogates to the application is a bug to be fixed, not a feature to
 be maintained.


 Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
 area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

\22\341\210\264\360\222\215\205

--

\x12\u1234\U00012345

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote:

 But mainly, I'm just wondering how many people here have any basis from
 which to argue the point he's trying to make. I doubt most of us have
 (a) implemented an editor widget, or (b) tested multiple different
 internal representations to learn the true pros and cons of each. And
 even if any of us had, that still wouldn't have any bearing on PEP 393,
 which is about applications, not editor widgets. As stated above, Python
 strings before AND after PEP 393 are poor choices for an editor, ergo
 arguing from that standpoint is pretty useless.

That's a misleading way to put it. Using immutable strings as editor 
buffers might be a bad way to implement all but the most trivial, low-
performance (i.e. slow) editor, but the basic concept of PEP 393, picking 
an internal representation of the text based on its contents, is not. 
That's just normal. The only difference with PEP 393 is that the choice 
is made on the fly, at runtime, instead of decided in advance by the 
programmer.

I expect that the PEP 393 concept of optimizing memory per string buffer 
would work well in an editor. However the internal buffer is arranged, 
you can safely assume that each chunk of text (word, sentence, paragraph, 
buffer...) will very rarely shift from all Latin 1 to all BMP to 
includes SMP chars. So, for example, entering a SMP character will need 
to immediately up-cast the chunk from 1-byte per char to 4-bytes per 
char, which is relatively pricey, but it's a one-off cost. Down-casting 
when the SMP character is deleted doesn't need to be done immediately, it 
can be performed when the application is idle.

If the chunks are relatively small (say, a paragraph rather than multiple 
pages of text) then even that initial conversion will be invisible. A 
fast touch typist hits a key about every 0.1 of a second; if it takes a 
millisecond to convert the chunk, you wouldn't even notice the delay. You 
can copy and up-cast a lot of bytes in a millisecond.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript) decides
 to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
 even if it broke code to do so.

Unfortunately, so long as most language designers are European-centric, 
there is going to be a lot of push-back against any attempt to fix (say) 
Javascript, or Java just for the sake of a bunch of dead languages in 
the SMPs. Thank goodness for emoji. Wait til the young kids start 
complaining that their emoticons and emoji are broken in Javascript, and 
eventually it will get fixed. It may take a decade, for the young kids to 
grow up and take over Javascript from the old-codgers, but it will happen.


 To my mind, exposing UTF-16 surrogates
 to the application is a bug to be fixed, not a feature to be maintained.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the 
implementation is a lot more complex than most language developers can be 
bothered with. I'm not aware of any language that uses UTF-16 internally 
that doesn't give wrong results for surrogate pairs.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Thu, Jul 25, 2013 at 5:02 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote:

 But mainly, I'm just wondering how many people here have any basis from
 which to argue the point he's trying to make. I doubt most of us have
 (a) implemented an editor widget, or (b) tested multiple different
 internal representations to learn the true pros and cons of each. And
 even if any of us had, that still wouldn't have any bearing on PEP 393,
 which is about applications, not editor widgets. As stated above, Python
 strings before AND after PEP 393 are poor choices for an editor, ergo
 arguing from that standpoint is pretty useless.

 That's a misleading way to put it. Using immutable strings as editor
 buffers might be a bad way to implement all but the most trivial, low-
 performance (i.e. slow) editor, but the basic concept of PEP 393, picking
 an internal representation of the text based on its contents, is not.
 That's just normal. The only difference with PEP 393 is that the choice
 is made on the fly, at runtime, instead of decided in advance by the
 programmer.

Maybe I worded it poorly, but my point was the same as you're saying
here: that a Python string is a poor buffer for editing, regardless of
PEP 393. It's not that PEP 393 makes Python strings worse for writing
a text editor, it's that immutability does that.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript) decides
 to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
 even if it broke code to do so.

 Unfortunately, so long as most language designers are European-centric,
 there is going to be a lot of push-back against any attempt to fix (say)
 Javascript, or Java just for the sake of a bunch of dead languages in
 the SMPs. Thank goodness for emoji. Wait til the young kids start
 complaining that their emoticons and emoji are broken in Javascript, and
 eventually it will get fixed. It may take a decade, for the young kids to
 grow up and take over Javascript from the old-codgers, but it will happen.

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a 500 character maximum field deducts two from
the limit when an emoticon goes in. Example:

Type here:brtextarea id=content oninput=showlimit(this)/textarea
brYou have span id=limit1500/span characters left (self.value.length).
brYou have span id=limit2500/span characters left (self.textLength).
script
function showlimit(self)
{
document.getElementById(limit1).innerHTML=500-self.value.length;
document.getElementById(limit2).innerHTML=500-self.textLength;
}
/script

I've included an attribute documented here[1] as the codepoint length
of the control's value, but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say That's what the internet does. Maybe, if
you're really lucky, they'll acknowledge that that's what JavaScript
does, but even then I doubt it'd result in language changes.

 To my mind, exposing UTF-16 surrogates
 to the application is a bug to be fixed, not a feature to be maintained.

 This, times a thousand.

 It is *possible* to have non-buggy string routines using UTF-16, but the
 implementation is a lot more complex than most language developers can be
 bothered with. I'm not aware of any language that uses UTF-16 internally
 that doesn't give wrong results for surrogate pairs.

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.

ChrisA

[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:

 On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript)
 decides to move from UTF-16 to UTF-32, I would wholeheartedly support
 the move, even if it broke code to do so.

 Unfortunately, so long as most language designers are European-centric,
 there is going to be a lot of push-back against any attempt to fix
 (say) Javascript, or Java just for the sake of a bunch of dead
 languages in the SMPs. Thank goodness for emoji. Wait til the young
 kids start complaining that their emoticons and emoji are broken in
 Javascript, and eventually it will get fixed. It may take a decade, for
 the young kids to grow up and take over Javascript from the
 old-codgers, but it will happen.
 
 I don't know that that'll happen like that. Emoticons aren't broken in
 Javascript - you can use them just fine. You only start seeing problems
 when you index into that string. People will start to wonder why, for
 instance, a 500 character maximum field deducts two from the limit
 when an emoticon goes in.

I get that. I meant *Javascript developers*, not end-users. The young 
kids today who become Javascript developers tomorrow will grow up in a 
world where they expect to be able to write band names like
▼□■□■□■ (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I 
guarantee that even as we speak some new hipster band is trying to decide 
whether to name themselves Smiling  or Crying .

:-)



 It is *possible* to have non-buggy string routines using UTF-16, but
 the implementation is a lot more complex than most language developers
 can be bothered with. I'm not aware of any language that uses UTF-16
 internally that doesn't give wrong results for surrogate pairs.
 
 The problem isn't the underlying representation, the problem is what
 gets exposed to the application. Once you've decided to expose
 codepoints to the app (abstracting over your UTF-16 underlying
 representation), the change to using UTF-32, or mimicking PEP 393, or
 some other structure, is purely internal and an optimization. So I doubt
 any language will use UTF-16 internally and UTF-32 to the app. It'd be
 needlessly complex.

To be honest, I don't understand what you are trying to say.

What I'm trying to say is that it is possible to use UTF-16 internally, 
but *not* assume that every code point (character) is represented by a 
single 2-byte unit. For example, the len() of a UTF-16 string should not 
be calculated by counting the number of bytes and dividing by two. You 
actually need to walk the string, inspecting each double-byte:

# calculate length
count = 0
inside_surrogate = False
for bb in buffer:  # get two bytes at a time
if is_lower_surrogate(bb):
inside_surrogate = True
continue
if is_upper_surrogate(bb):
if inside_surrogate:
count += 1
inside_surrogate = False
continue
raise ValueError(missing lower surrogate)
if inside_surrogate:
break
count += 1
if inside_surrogate:
raise ValueError(missing upper surrogate)


Given immutable strings, you could validate the string once, on creation, 
and from then on assume they are well-formed:

# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer:  # get two bytes at a time
if skip:
count += 1
skip = False
continue
if is_surrogate(bb):
skip = True
count += 1


String operations such as slicing become much more complex once you can 
no longer assume a 1:1 relationship between code points and code units, 
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't 
handle that complexity, and push responsibility for it back onto the 
coder using the language. 



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread wxjmfauth
Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit :
 On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote:
 
  Sorry, you are not understanding Unicode. What is a Unicode
 
  Transformation Format (UTF), what is the goal of a UTF and
 
  why it is important for an implementation to work with a UTF.
 
 
 
 Really?  Enlighten me.
 
 
 
 Personally, I would never use UTF as a representation *in memory* for a
 
 unicode string if it were up to me.  Why?  Because UTF characters are
 
 not uniform in byte width so accessing positions within the string is
 
 terribly slow and has to always be done by starting at the beginning of
 
 the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
 
 understand this.  Do you dispute this fact?
 
 
 
 UTF is a great choice for interchange, though, and indeed that's what it
 
 was designed for.
 
 
 
 Are you calling for UTF to be adopted as the internal, in-memory
 
 representation of unicode?  Or would you simply settle for UCS-4?
 
 Please be clear here.  What are you saying?
 
 
 
  Short example. Writing an editor with something like the
 
  FSR is simply impossible (properly).
 
 
 
 How? FSR is just an implementation detail.  It could be UCS-4 and it
 
 would also work.

-

A coding scheme works with a unique set of characters (the repertoire),
and the implementation (the programming) works with a unique set
of encoded code points. The critical step is the path
{unique set of characters} -- {unique set of encoded code points}


Fact: there is no other way to do it properly (This is explaining
why we have to live today with all these coding schemes or also
explaining why so many coding schemes hadto be created).

How to understand it? With a sheet of paper and a pencil.

In the byte string world, this step is a no-op.

In Unicode, it is exactly the purpose of a utf to achieve this
step. utf: a confusing name covering at the same time the
process and the result of the process.
A utf chunk, a series of bits (not bytes), hold intrisically
the information about the character it is representing.

Other exotic coding schemes like iso6937 of CID-fonts are woking
in the same way.

Unicode with the help of utf(s) does not differ from the basic
rule.

-

ucs-2: ucs-2 is a perfecly and correctly working coding scheme.
ucs-2 is not different from the other coding schemes and does
not behave differently (cp... or iso-... or ...). It only
covers a smaller repertoire.

-

utf32: as a pointed many times. You are already using it (maybe
without knowing it). Where? in fonts (OpenType technology),
rendering engines, pdf files. Why? Because there is not other
way to do it better.

--

The Unicode table (its constuction) is a problem per se.
It is not a technical problem, a very important linguistic
aspect of Unicode.
See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

--

If you are not understanding my editor analogy. One other
proposed exercise. Build/create a flexible iso-8859-X coding
scheme. You will quickly understand where the bottleneck
is.
Two working ways:
- stupidly with an editor and your fingers.
- lazily with a sheet of paper and you head.




About my benchmarks: No offense. You are not understanding them,
because you do not understand what this FSR does and the coding
of characters. It's a little bit a devil's circle.

Conceptually, this FSR is spending its time in solving the
problem it creates itsself, with plenty of side effects.

-

There is a clear difference between FSR and ucs-4/utf32.

-

See also:
http://www.unicode.org/reports/tr17/

(In my mind, quite dry and not easy to understand at
a first reading).


jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 What I'm trying to say is that it is possible to use UTF-16 internally,
 but *not* assume that every code point (character) is represented by a
 single 2-byte unit. For example, the len() of a UTF-16 string should not
 be calculated by counting the number of bytes and dividing by two. You
 actually need to walk the string, inspecting each double-byte

Anything's possible. But since underlying representations can be
changed fairly easily (relative term of course - it's a lot of work,
but it can be changed in a single release, no deprecation required or
anything), there's very little reason to continue using UTF-16
underneath. May as well switch to UTF-32 for convenience, or PEP 393
for convenience and efficiency, or maybe some other system that's
still mostly fixed-width.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Thu, Jul 25, 2013 at 7:27 PM,  wxjmfa...@gmail.com wrote:
 A coding scheme works with a unique set of characters (the repertoire),
 and the implementation (the programming) works with a unique set
 of encoded code points. The critical step is the path
 {unique set of characters} -- {unique set of encoded code points}

That's called Unicode. It maps the character 'A' to the code point
U+0041 and so on. Code points are integers. In fact, they are very
well represented in Python that way (also in Pike, fwiw):

 ord('A')
65
 chr(65)
'A'
 chr(123456)
'\U0001e240'
 ord(_)
123456

 In the byte string world, this step is a no-op.

 In Unicode, it is exactly the purpose of a utf to achieve this
 step. utf: a confusing name covering at the same time the
 process and the result of the process.
 A utf chunk, a series of bits (not bytes), hold intrisically
 the information about the character it is representing.

No, now you're looking at another level: how to store codepoints in
memory. That demands that they be stored as bits and bytes, because PC
memory works that way.

 utf32: as a pointed many times. You are already using it (maybe
 without knowing it). Where? in fonts (OpenType technology),
 rendering engines, pdf files. Why? Because there is not other
 way to do it better.

And UTF-32 is an excellent system... as long as you're okay with
spending four bytes for every character.

 See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

I refuse to click this link. Give us a link to the
python-list@python.org archive, or gmane, or something else more
suited to the audience. I'm not going to Google Groups just to figure
out what you're saying.

 If you are not understanding my editor analogy. One other
 proposed exercise. Build/create a flexible iso-8859-X coding
 scheme. You will quickly understand where the bottleneck
 is.
 Two working ways:
 - stupidly with an editor and your fingers.
 - lazily with a sheet of paper and you head.

What has this to do with the editor?

 There is a clear difference between FSR and ucs-4/utf32.

Yes. Memory usage. PEP 393 strings might take up half or even a
quarter of what they'd take up in fixed UTF-32. Other than that,
there's no difference.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Jeremy Sanders
wxjmfa...@gmail.com wrote:

 Short example. Writing an editor with something like the
 FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are 
codepoints of text characters within buffers and strings. Rather, Emacs uses a 
variable-length internal representation of characters, that stores each 
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of 
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a 
Latin-1 character takes up 2 bytes, etc. We call this representation of text 
multibyte.

...

[1] This internal representation is based on one of the encodings defined by 
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but 
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.



Jeremy


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Devyn Collier Johnson


On 07/25/2013 09:36 AM, Jeremy Sanders wrote:

wxjmfa...@gmail.com wrote:


Short example. Writing an editor with something like the
FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are
codepoints of text characters within buffers and strings. Rather, Emacs uses a
variable-length internal representation of characters, that stores each
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc. We call this representation of text
multibyte.

...

[1] This internal representation is based on one of the encodings defined by
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.



Jeremy


Wow! The thread that I started has changed a lot and lived a long time. 
I look forward to its first birthday (^u^).


Devyn Collier Johnson
--
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:

 wxjmfa...@gmail.com wrote:
 
 Short example. Writing an editor with something like the FSR is simply
 impossible (properly).
 
 http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-
Representations.html#Text-Representations
 
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example, any
 ASCII character takes up only 1 byte, a Latin-1 character takes up 2
 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs 
doesn't really exist.


 [1] This internal representation is based on one of the encodings
 defined by the Unicode Standard, called UTF-8, for representing any
 Unicode codepoint, but Emacs extends UTF-8 to represent the additional
 codepoints it uses for raw 8- bit bytes and characters not unified with
 Unicode.
 

Do you know what those characters not unified with Unicode are? Is there 
a list somewhere? I've read all of the pages from here to no avail:

http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example, any
 ASCII character takes up only 1 byte, a Latin-1 character takes up 2
 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

... lolwut?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.
 
 ... lolwut?


JMF has explained that it is impossible, impossible I say!, to write an 
editor using a flexible string representation. Since Emacs uses such a 
flexible string representation, Emacs is impossible, and therefore Emacs 
doesn't exist.

QED.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Fri, Jul 26, 2013 at 3:18 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

 ... lolwut?


 JMF has explained that it is impossible, impossible I say!, to write an
 editor using a flexible string representation. Since Emacs uses such a
 flexible string representation, Emacs is impossible, and therefore Emacs
 doesn't exist.

 QED.

Quad Error Demonstrated.

I never got past the level of Canis Latinicus in debating class.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread wxjmfauth
Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit :
 On Thu, Jul 25, 2013 at 7:27 PM,  wxjmfa...@gmail.com wrote:
 
  A coding scheme works with a unique set of characters (the repertoire),
 
  and the implementation (the programming) works with a unique set
 
  of encoded code points. The critical step is the path
 
  {unique set of characters} -- {unique set of encoded code points}
 
 
 
 That's called Unicode. It maps the character 'A' to the code point
 
 U+0041 and so on. Code points are integers. In fact, they are very
 
 well represented in Python that way (also in Pike, fwiw):
 
 
 
  ord('A')
 
 65
 
  chr(65)
 
 'A'
 
  chr(123456)
 
 '\U0001e240'
 
  ord(_)
 
 123456
 
 
 
  In the byte string world, this step is a no-op.
 
 
 
  In Unicode, it is exactly the purpose of a utf to achieve this
 
  step. utf: a confusing name covering at the same time the
 
  process and the result of the process.
 
  A utf chunk, a series of bits (not bytes), hold intrisically
 
  the information about the character it is representing.
 
 
 
 No, now you're looking at another level: how to store codepoints in
 
 memory. That demands that they be stored as bits and bytes, because PC
 
 memory works that way.
 
 
 
  utf32: as a pointed many times. You are already using it (maybe
 
  without knowing it). Where? in fonts (OpenType technology),
 
  rendering engines, pdf files. Why? Because there is not other
 
  way to do it better.
 
 
 
 And UTF-32 is an excellent system... as long as you're okay with
 
 spending four bytes for every character.
 
 
 
  See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0
 
 
 
 I refuse to click this link. Give us a link to the
 
 python-list@python.org archive, or gmane, or something else more
 
 suited to the audience. I'm not going to Google Groups just to figure
 
 out what you're saying.
 
 
 
  If you are not understanding my editor analogy. One other
 
  proposed exercise. Build/create a flexible iso-8859-X coding
 
  scheme. You will quickly understand where the bottleneck
 
  is.
 
  Two working ways:
 
  - stupidly with an editor and your fingers.
 
  - lazily with a sheet of paper and you head.
 
 
 
 What has this to do with the editor?
 
 
 
  There is a clear difference between FSR and ucs-4/utf32.
 
 
 
 Yes. Memory usage. PEP 393 strings might take up half or even a
 
 quarter of what they'd take up in fixed UTF-32. Other than that,
 
 there's no difference.
 
 
 
 ChrisA




Let start with a simple string \textemdash or \texttendash

 sys.getsizeof('–')
40
 sys.getsizeof('a')
26

jmf

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Chris Angelico
On Fri, Jul 26, 2013 at 5:07 AM,  wxjmfa...@gmail.com wrote:
 Let start with a simple string \textemdash or \texttendash

 sys.getsizeof('–')
 40
 sys.getsizeof('a')
 26

Most of the cost is in those two apostrophes, look:

 sys.getsizeof('a')
26
 sys.getsizeof(a)
8

Okay, that's slightly unfair (bonus points: figure out what I did to
make this work; there are at least two right answers) but still, look
at what an empty string costs:

 sys.getsizeof('')
25

Or look at the difference between one of these characters and two:

 sys.getsizeof('aa')-sys.getsizeof('a')
1
 sys.getsizeof('––')-sys.getsizeof('–')
2

That's what the characters really cost. The overhead is fixed. It is,
in fact, almost completely insignificant. The storage requirement for
a non-ASCII, BMP-only string converges to two bytes per character.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: RE Module Performance

2013-07-25 Thread Prasad, Ramit
Chris Angelico wrote:
 On Fri, Jul 26, 2013 at 5:07 AM,  wxjmfa...@gmail.com wrote:
  Let start with a simple string \textemdash or \texttendash
 
  sys.getsizeof('-')
  40
  sys.getsizeof('a')
  26
 
 Most of the cost is in those two apostrophes, look:
 
  sys.getsizeof('a')
 26
  sys.getsizeof(a)
 8
 
 Okay, that's slightly unfair (bonus points: figure out what I did to
 make this work; there are at least two right answers) but still, look
 at what an empty string costs:

I like bonus points. :)
 a = None 
 sys.getsizeof(a)
8

Not sure what the other right answer is...booleans take 12 bytes (on 2.6)

 
  sys.getsizeof('')
 25
 
 Or look at the difference between one of these characters and two:
 
  sys.getsizeof('aa')-sys.getsizeof('a')
 1
  sys.getsizeof('--')-sys.getsizeof('-')
 2
 
 That's what the characters really cost. The overhead is fixed. It is,
 in fact, almost completely insignificant. The storage requirement for
 a non-ASCII, BMP-only string converges to two bytes per character.
 
 ChrisA
 --
 http://mail.python.org/mailman/listinfo/python-list


Ramit



This email is confidential and subject to important disclaimers and conditions 
including on offers for the purchase or sale of securities, accuracy and 
completeness of information, viruses, confidentiality, legal privilege, and 
legal entity disclaimers, available at 
http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Ian Kelly
On Wed, Jul 24, 2013 at 9:34 AM, Chris Angelico ros...@gmail.com wrote:
 On Thu, Jul 25, 2013 at 12:17 AM, David Hutto dwightdhu...@gmail.com wrote:
 I've screwed up plenty of times in python, but can write code like a pro
 when I'm feeling better(on SSI and medicaid). An editor can be built simply,
 but it's preference that makes the difference. Some might have used tkinter,
 gtk. wxpython or other methods for the task.

 I think the main issue in responding is your library preference, or widget
 set preference. These can make you right with some in your response, or
 wrong with others that have a preferable gui library that coincides with
 one's personal cognitive structure that makes t

 jmf's point is more about writing the editor widget (Scintilla, as
 opposed to SciTE), which most people will never bother to do. I've
 written several text editors, always by embedding someone else's
 widget, and therefore not concerning myself with its internal string
 representation. Frankly, Python's strings are a *terrible* internal
 representation for an editor widget - not because of PEP 393, but
 simply because they are immutable, and every keypress would result in
 a rebuilding of the string. On the flip side, I could quite plausibly
 imagine using a list of strings; whenever text gets inserted, the
 string gets split at that point, and a new string created for the
 insert (which also means that an Undo operation simply removes one
 entire string). In this usage, the FSR is beneficial, as it's possible
 to have different strings at different widths.

 But mainly, I'm just wondering how many people here have any basis
 from which to argue the point he's trying to make. I doubt most of us
 have (a) implemented an editor widget, or (b) tested multiple
 different internal representations to learn the true pros and cons of
 each. And even if any of us had, that still wouldn't have any bearing
 on PEP 393, which is about applications, not editor widgets. As stated
 above, Python strings before AND after PEP 393 are poor choices for an
 editor, ergo arguing from that standpoint is pretty useless. Not that
 that bothers jmf...

I think you've just motivated me to finally get around to writing the
custom output widget for my MUD client.  Of course that will be
simpler than a standard rich text editor widget, since it will never
receive input from the user and modifications will (typically) always
come in the form of append operations.  I intend to write it in pure
Python (well, wxPython), however.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Ian Kelly
On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

 ... lolwut?


 JMF has explained that it is impossible, impossible I say!, to write an
 editor using a flexible string representation. Since Emacs uses such a
 flexible string representation, Emacs is impossible, and therefore Emacs
 doesn't exist.

 QED.

Except that the described representation used by Emacs is a variant of
UTF-8, not an FSR.  It doesn't have three different possible encodings
for the letter 'a' depending on what other characters happen to be in
the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8
(or presumably the Emacs variant) as its internal string
representation.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Steven D'Aprano
On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote:

 On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes
 up 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

 ... lolwut?


 JMF has explained that it is impossible, impossible I say!, to write an
 editor using a flexible string representation. Since Emacs uses such a
 flexible string representation, Emacs is impossible, and therefore
 Emacs doesn't exist.

 QED.
 
 Except that the described representation used by Emacs is a variant of
 UTF-8, not an FSR.  It doesn't have three different possible encodings
 for the letter 'a' depending on what other characters happen to be in
 the string.
 
 As I understand it, jfm would be perfectly happy if Python used UTF-8
 (or presumably the Emacs variant) as its internal string representation.


UTF-8 uses a flexible representation on a character-by-character basis. 
When parsing UTF-8, one needs to look at EVERY character to decide how 
many bytes you need to read. In Python 3, the flexible representation is 
on a string-by-string basis: once Python has looked at the string header, 
it can tell whether the *entire* string takes 1, 2 or 4 bytes per 
character, and the string is then fixed-width. You can't do that with 
UTF-8.

To put it in terms of pseudo-code:

# Python 3.3
def parse_string(astring):
# Decision gets made once per string.
if astring uses 1 byte:
count = 1
elif astring uses 2 bytes:
count = 2
else: 
count = 4
while not done:
char = convert(next(count bytes))


# UTF-8
def parse_string(astring):
while not done:
b = next(1 byte)
# Decision gets made for every single char
if uses 1 byte:
char = convert(b)
elif uses 2 bytes:
char = convert(b, next(1 byte))
elif uses 3 bytes:
char = convert(b, next(2 bytes))
else:
char = convert(b, next(3 bytes))


So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's 
variation can in fact require more bytes per character than either. 
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.) 
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of 
his depth, and is a fine example of the Dunning-Kruger effect in action. 
He is so sure he is right based on so little evidence.

One advantage of UTF-8 is that for some BMP characters, you can get away 
with only three bytes instead of four. For transmitting data over the 
wire, or storage on disk, that's potentially up to a 25% reduction in 
space, which is not to be sneezed at. (Although in practice it's usually 
much less than that, since the most common characters are encoded to 1 or 
2 bytes, not 4). But that comes at the cost of much more runtime 
overhead, which in my opinion makes UTF-8 a second-class string 
representation compared to fixed-width representations.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Michael Torrie
On 07/25/2013 01:07 PM, wxjmfa...@gmail.com wrote:
 Let start with a simple string \textemdash or \texttendash
 
 sys.getsizeof('–')
 40
 sys.getsizeof('a')
 26

That's meaningless.  You're comparing the overhead of a string object
itself (a one-time cost anyway), not the overhead of storing the actual
characters.  This is the only meaningful comparison:

 sys.getsizeof('––') - sys.getsizeof('–')

 sys.getsizeof('aa') - sys.getsizeof('a')

Actually I'm not even sure what your point is after all this time of
railing against FSR.  You have said in the past that Python penalizes
users of character sets that require wider byte encodings, but what
would you have us do? use 4-byte characters and penalize everyone
equally?  Use 2-byte characters that incorrectly expose surrogate pairs
for some characters? Use UTF-8 in memory and do O(n) indexing?  Are your
programs (actual programs, not contrived benchmarks) actually slower
because of FSR?  Is FSR incorrect?  If so, according to what part of the
unicode standard?  I'm not trying to troll, or feed the troll.  I'm
actually curious.

I think perhaps you feel that many of us who don't use unicode often
don't understand unicode because some of us don't understand you.  If
so, I'm not sure that's actually true.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Michael Torrie
On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
 JMF has explained that it is impossible, impossible I say!, to write an 
 editor using a flexible string representation. Since Emacs uses such a 
 flexible string representation, Emacs is impossible, and therefore Emacs 
 doesn't exist.

Now I'm even more confused.  He once pointed to Go as an example of how
unicode should be done in a language.  yet Go uses UTF-8 I think.

But I don't think UTF-8 is what JMF refers to as flexible string
representation.  FSR does use 1,2 or 4 bytes per character, but each
character in the string uses the same width.  That's different from
UTF-8 or UTF-16, which is variable width per character.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-25 Thread Ian Kelly
On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 UTF-8 uses a flexible representation on a character-by-character basis.
 When parsing UTF-8, one needs to look at EVERY character to decide how
 many bytes you need to read. In Python 3, the flexible representation is
 on a string-by-string basis: once Python has looked at the string header,
 it can tell whether the *entire* string takes 1, 2 or 4 bytes per
 character, and the string is then fixed-width. You can't do that with
 UTF-8.

UTF-8 does not use a flexible representation.  A codec that is
encoding a string in UTF-8 and examining a particular character does
not have any choice of how to encode that character; there is exactly
one sequence of bits that is the UTF-8 encoding for the character.
Further, for any given sequence of code points there is exactly one
sequence of bytes that is the UTF-8 encoding of those code points.  In
contrast, with the FSR there are as many as three different sequences
of bytes that encode a sequence of code points, with one of them (the
shortest) being canonical.  That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he
(or anybody else) is referring to when speaking of the FSR or
something like the FSR.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread wxjmfauth
Le samedi 13 juillet 2013 01:13:47 UTC+2, Michael Torrie a écrit :
 On 07/12/2013 09:59 AM, Joshua Landau wrote:
 
  If you're interested, the basic of it is that strings now use a
 
  variable number of bytes to encode their values depending on whether
 
  values outside of the ASCII range and some other range are used, as an
 
  optimisation.
 
 
 
 Variable number of bytes is a problematic way to saying it.  UTF-8 is a
 
 variable-number-of-bytes encoding scheme where each character can be 1,
 
 2, 4, or more bytes, depending on the unicode character.  As you can
 
 imagine this sort of encoding scheme would be very slow to do slicing
 
 with (looking up a character at a certain position).  Python uses
 
 fixed-width encoding schemes, so they preserve the O(n) lookup speeds,
 
 but python will use 1, 2, or 4 bytes per every character in the string,
 
 depending on what is needed.  Just in case the OP might have
 
 misunderstood what you are saying.
 
 
 
 jmf sees the case where a string is promoted from one width to another,
 
 and thinks that the brief slowdown in string operations to accomplish
 
 this is a problem.  In reality I have never seen anyone use the types of
 
 string operations his pseudo benchmarks use, and in general Python 3's
 
 string behavior is pretty fast.  And apparently much more correct than
 
 if jmf's ideas of unicode were implemented.

--

Sorry, you are not understanding Unicode. What is a Unicode
Transformation Format (UTF), what is the goal of a UTF and
why it is important for an implementation to work with a UTF.

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Chris Angelico
On Wed, Jul 24, 2013 at 11:40 PM,  wxjmfa...@gmail.com wrote:
 Short example. Writing an editor with something like the
 FSR is simply impossible (properly).

jmf, have you ever written an editor with *any* string representation?
Are you speaking from any level of experience at all?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread David Hutto
I've screwed up plenty of times in python, but can write code like a pro
when I'm feeling better(on SSI and medicaid). An editor can be built
simply, but it's preference that makes the difference. Some might have used
tkinter, gtk. wxpython or other methods for the task.

I think the main issue in responding is your library preference, or widget
set preference. These can make you right with some in your response, or
wrong with others that have a preferable gui library that coincides with
one's personal cognitive structure that makes t




On Wed, Jul 24, 2013 at 9:48 AM, Chris Angelico ros...@gmail.com wrote:

 On Wed, Jul 24, 2013 at 11:40 PM,  wxjmfa...@gmail.com wrote:
  Short example. Writing an editor with something like the
  FSR is simply impossible (properly).

 jmf, have you ever written an editor with *any* string representation?
 Are you speaking from any level of experience at all?

 ChrisA
 --
 http://mail.python.org/mailman/listinfo/python-list




-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread David Hutto
I've screwed up plenty of times in python, but can write code like a pro
when I'm feeling better(on SSI and medicaid). An editor can be built
simply, but it's preference that makes the difference. Some might have used
tkinter, gtk. wxpython or other methods for the task.

I think the main issue in responding is your library preference, or widget
set preference. These can make you right with some in your response, or
wrong with others that have a preferable gui library that coincides with
one's personal cognitive structure that makes it more usable in relation to
how you learned a preferable gui kit.


On Wed, Jul 24, 2013 at 10:17 AM, David Hutto dwightdhu...@gmail.comwrote:

 I've screwed up plenty of times in python, but can write code like a pro
 when I'm feeling better(on SSI and medicaid). An editor can be built
 simply, but it's preference that makes the difference. Some might have used
 tkinter, gtk. wxpython or other methods for the task.

 I think the main issue in responding is your library preference, or widget
 set preference. These can make you right with some in your response, or
 wrong with others that have a preferable gui library that coincides with
 one's personal cognitive structure that makes t




 On Wed, Jul 24, 2013 at 9:48 AM, Chris Angelico ros...@gmail.com wrote:

 On Wed, Jul 24, 2013 at 11:40 PM,  wxjmfa...@gmail.com wrote:
  Short example. Writing an editor with something like the
  FSR is simply impossible (properly).

 jmf, have you ever written an editor with *any* string representation?
 Are you speaking from any level of experience at all?

 ChrisA
 --
 http://mail.python.org/mailman/listinfo/python-list




 --
 Best Regards,
 David Hutto
 *CEO:* *http://www.hitwebdevelopment.com*




-- 
Best Regards,
David Hutto
*CEO:* *http://www.hitwebdevelopment.com*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Chris Angelico
On Thu, Jul 25, 2013 at 12:17 AM, David Hutto dwightdhu...@gmail.com wrote:
 I've screwed up plenty of times in python, but can write code like a pro
 when I'm feeling better(on SSI and medicaid). An editor can be built simply,
 but it's preference that makes the difference. Some might have used tkinter,
 gtk. wxpython or other methods for the task.

 I think the main issue in responding is your library preference, or widget
 set preference. These can make you right with some in your response, or
 wrong with others that have a preferable gui library that coincides with
 one's personal cognitive structure that makes t

jmf's point is more about writing the editor widget (Scintilla, as
opposed to SciTE), which most people will never bother to do. I've
written several text editors, always by embedding someone else's
widget, and therefore not concerning myself with its internal string
representation. Frankly, Python's strings are a *terrible* internal
representation for an editor widget - not because of PEP 393, but
simply because they are immutable, and every keypress would result in
a rebuilding of the string. On the flip side, I could quite plausibly
imagine using a list of strings; whenever text gets inserted, the
string gets split at that point, and a new string created for the
insert (which also means that an Undo operation simply removes one
entire string). In this usage, the FSR is beneficial, as it's possible
to have different strings at different widths.

But mainly, I'm just wondering how many people here have any basis
from which to argue the point he's trying to make. I doubt most of us
have (a) implemented an editor widget, or (b) tested multiple
different internal representations to learn the true pros and cons of
each. And even if any of us had, that still wouldn't have any bearing
on PEP 393, which is about applications, not editor widgets. As stated
above, Python strings before AND after PEP 393 are poor choices for an
editor, ergo arguing from that standpoint is pretty useless. Not that
that bothers jmf...

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Michael Torrie
On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote:
 Sorry, you are not understanding Unicode. What is a Unicode
 Transformation Format (UTF), what is the goal of a UTF and
 why it is important for an implementation to work with a UTF.

Really?  Enlighten me.

Personally, I would never use UTF as a representation *in memory* for a
unicode string if it were up to me.  Why?  Because UTF characters are
not uniform in byte width so accessing positions within the string is
terribly slow and has to always be done by starting at the beginning of
the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
understand this.  Do you dispute this fact?

UTF is a great choice for interchange, though, and indeed that's what it
was designed for.

Are you calling for UTF to be adopted as the internal, in-memory
representation of unicode?  Or would you simply settle for UCS-4?
Please be clear here.  What are you saying?

 Short example. Writing an editor with something like the
 FSR is simply impossible (properly).

How? FSR is just an implementation detail.  It could be UCS-4 and it
would also work.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Michael Torrie
On 07/24/2013 08:34 AM, Chris Angelico wrote:
 Frankly, Python's strings are a *terrible* internal representation
 for an editor widget - not because of PEP 393, but simply because
 they are immutable, and every keypress would result in a rebuilding
 of the string. On the flip side, I could quite plausibly imagine
 using a list of strings; whenever text gets inserted, the string gets
 split at that point, and a new string created for the insert (which
 also means that an Undo operation simply removes one entire string).
 In this usage, the FSR is beneficial, as it's possible to have
 different strings at different widths.

Very good point.  Seems like this is exactly what is tripping up jmf in
general.  His pseudo benchmarks are bogus for this exact reason. No one
uses python strings in this fashion.  Editors certainly would not.  But
then again his argument in the past does not mention editors.  But it
makes me wonder if jmf is using python strings appropriately, or even
realizes they are immutable.

 But mainly, I'm just wondering how many people here have any basis 
 from which to argue the point he's trying to make. I doubt most of
 us have (a) implemented an editor widget, or (b) tested multiple 
 different internal representations to learn the true pros and cons
 of each. 

Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
pros and cons, and the cons of using UCS-2 (the old narrow builds) are
well known.  UCS-2 simply cannot represent all of unicode correctly.
This is in the PEP of course.

His most recent argument that Python should use UTF as a representation
is very strange to be honest.  The cons of UTF are apparent and widely
known.  The main con is that UTF strings are O(n) for indexing a
position within the string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Chris Angelico
On Thu, Jul 25, 2013 at 12:47 AM, Michael Torrie torr...@gmail.com wrote:
 On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote:
 Sorry, you are not understanding Unicode. What is a Unicode
 Transformation Format (UTF), what is the goal of a UTF and
 why it is important for an implementation to work with a UTF.

 Really?  Enlighten me.

 Personally, I would never use UTF as a representation *in memory* for a
 unicode string if it were up to me.  Why?  Because UTF characters are
 not uniform in byte width so accessing positions within the string is
 terribly slow and has to always be done by starting at the beginning of
 the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
 understand this.  Do you dispute this fact?

Take care here; UTF is a general term for Unicode Translation Formats,
of which one (UTF-32) is fixed-width. Every other UTF-n is variable
width, though, so your point still stands. UTF-32 is the basis for
Python's FSR.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE Module Performance

2013-07-24 Thread Terry Reedy

On 7/24/2013 11:00 AM, Michael Torrie wrote:

On 07/24/2013 08:34 AM, Chris Angelico wrote:

Frankly, Python's strings are a *terrible* internal representation
for an editor widget - not because of PEP 393, but simply because
they are immutable, and every keypress would result in a rebuilding
of the string. On the flip side, I could quite plausibly imagine
using a list of strings;


I used exactly this, a list of strings, for a Python-coded text-only 
mock editor to replace the tk Text widget in idle tests. It works fine 
for the purpose. For small test texts, the inefficiency of immutable 
strings is not relevant.


Tk apparently uses a C-coded btree rather than a Python list. All 
details are hidden, unless one finds and reads the source ;-), but but 
it uses C arrays rather than Python strings.



In this usage, the FSR is beneficial, as it's possible to have
different strings at different widths.


For my purpose, the mock Text works the same in 2.7 and 3.3+.


Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
pros and cons,


They both have the pro that indexing is direct *and correct*. The cons 
are different.



and the cons of using UCS-2 (the old narrow builds) are
well known.  UCS-2 simply cannot represent all of unicode correctly.


Python's narrow builds, at least for several releases, were in between 
USC-2 and UTF-16 in that they used surrogates to represent all unicodes 
but did not correct indexing for the presence of astral chars. This is a 
nuisance for those who do use astral chars, such as emotes and CJK name 
chars, on an everyday basis.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


  1   2   >