subject:"RE Module Performance"

Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote:
 
 
 
  Do not forget that à la FSR mechanism for a non-ascii user is
 
  *irrelevant*.
 
 
 
 You have been told repeatedly, Python's internals are *full* of ASCII-
 
 only strings.
 
 
 
 py dir(list)
 
 ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', 
 
 '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
 
 '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', 
 
 '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
 
 '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
 
 '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', 
 
 '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 
 
 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']
 
 
 
 There's 45 ASCII-only strings right there, in only one built-in type, out 
 
 of dozens. There are dozens, hundreds of ASCII-only strings in Python: 
 
 builtin functions and classes, attributes, exceptions, internal 
 
 attributes, variable names, and so on.
 
 
 
 You already know this, and yet you persist in repeating nonsense.
 
 
 
 
 
 -- 
 
 Steven

3.2
 timeit.timeit(r = dir(list))
22.300465007102908

3.3
 timeit.timeit(r = dir(list))
27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Chris Angelico

On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 3.2
 timeit.timeit(r = dir(list))
 22.300465007102908

 3.3
 timeit.timeit(r = dir(list))
 27.13981129541519

3.2:
 len(dir(list))
42

3.3:
 len(dir(list))
45

Wonder if that might maybe have an impact on the timings.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Heiko Wundram


Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com:

3.2

timeit.timeit(r = dir(list))

22.300465007102908

3.3

timeit.timeit(r = dir(list))

27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.


Please give a single *proof* (not your gut feeling) that this is related 
to the FSR, and not rather due to other side-effects such as changes in 
how dir() works or (as Chris pointed out) due to more members on the 
list type in 3.3. If you can't or won't give that proof, there's no 
sense in continuing the discussion.


--
--- Heiko.
--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Devyn Collier Johnson



On 07/29/2013 08:06 AM, Heiko Wundram wrote:

Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com:

3.2

timeit.timeit(r = dir(list))

22.300465007102908

3.3

timeit.timeit(r = dir(list))

27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.


Please give a single *proof* (not your gut feeling) that this is 
related to the FSR, and not rather due to other side-effects such as 
changes in how dir() works or (as Chris pointed out) due to more 
members on the list type in 3.3. If you can't or won't give that 
proof, there's no sense in continuing the discussion.


Wow! The RE Module thread I created is evolving into Unicode topics. 
That thread grew up so fast!


DCJ
--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 
  3.2
 
  timeit.timeit(r = dir(list))
 
  22.300465007102908
 
 
 
  3.3
 
  timeit.timeit(r = dir(list))
 
  27.13981129541519
 
 
 
 3.2:
 
  len(dir(list))
 
 42
 
 
 
 3.3:
 
  len(dir(list))
 
 45
 
 
 
 Wonder if that might maybe have an impact on the timings.
 
 
 
 ChrisA

Good point. I stupidely forgot this.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit :
 On 7/28/2013 11:52 AM, Michael Torrie wrote:
 
 
 
  3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
 
  slicing a string would be very very slow,
 
 
 
 Not necessarily so. See below.
 
 
 
  and that's unacceptable for
 
  the use cases of python strings.  I'm assuming you understand big O
 
  notation, as you talk of experience in many languages over the years.
 
  FSR and UTF-32 both are O(1) for slicing and lookups.
 
 
 
 Slicing is at least O(m) where m is the length of the slice.
 
 
 
  UTF-8, 16 and any variable-width encoding are always O(n).\
 
 
 
 I posted about a week ago, in response to Chris A., a method by which 
 
 lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, 
 
 O(1+log2(k+1)), where k is the number of non-BMP chars in the string.
 
 
 
 This uses an auxiliary array of k ints. An auxiliary array of n ints 
 
 would make UFT-16 lookup O(1), but then one is using more space than 
 
 with UFT-32. Similar comments apply to UTF-8.
 
 
 
 The unicode standard says that a single strings should use exactly one 
 
 coding scheme. It does *not* say that all strings in an application must 
 
 use the same scheme. I just rechecked a few days ago. It also does not 
 
 say that an application cannot associate additional data with a string 
 
 to make processing of the string easier.
 
 
 
 -- 
 
 Terry Jan Reedy

To my knowledge, the Unicode doc always speak about
the misc. utf* coding schemes in an exclusive or way.

Having multiple encoded strings is one thing. Manipulating
multiple encoded strings is something else.

Maybe the mistake was to not emphasize the fact that
one has to work with a unique set of encoded code points
(utf-8 or utf-16 or utf-32) because it was considered,
as to obvious one can not work properly with multiple
coding schemes.

You are also right in saying  ...application cannot associate
additional data
The doc does not specify it either. It is superfleous.


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 12:43 PM,  wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
 
  3.2
 
  timeit.timeit(r = dir(list))
 
  22.300465007102908
 
 
 
  3.3
 
  timeit.timeit(r = dir(list))
 
  27.13981129541519
 
 
 
 3.2:
 
  len(dir(list))
 
 42
 
 
 
 3.3:
 
  len(dir(list))
 
 45
 
 
 
 Wonder if that might maybe have an impact on the timings.
 
 
 
 ChrisA




class C:
a = 'abc'
b = 'def'
def aaa(self):
pass
def bbb(self):
pass
def ccc(self):
pass

if __name__ == '__main__':
import timeit
print(timeit.timeit(r = dir(C), setup=from __main__ import C))



c:\python32\pythonw -u timitmod.py
15.258061416225663
Exit code: 0
c:\Python33\pythonw -u timitmod.py
17.052203122286194
Exit code: 0


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-29 Thread Chris Angelico

On Mon, Jul 29, 2013 at 3:20 PM,  wxjmfa...@gmail.com wrote:
c:\python32\pythonw -u timitmod.py
 15.258061416225663
Exit code: 0
c:\Python33\pythonw -u timitmod.py
 17.052203122286194
Exit code: 0

 len(dir(C))

Did you even think to check that before you posted timings?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

Le lundi 29 juillet 2013 16:49:34 UTC+2, Chris Angelico a écrit :
 On Mon, Jul 29, 2013 at 3:20 PM,  wxjmfa...@gmail.com wrote:
 
 c:\python32\pythonw -u timitmod.py
 
  15.258061416225663
 
 Exit code: 0
 
 c:\Python33\pythonw -u timitmod.py
 
  17.052203122286194
 
 Exit code: 0
 
 
 
  len(dir(C))
 
 
 
 Did you even think to check that before you posted timings?
 
 
 
 ChrisA

Boum, no! the diff is one.
I have however noticed, I can increase the number
of attributes (ascii), the timing differences
is very well marked.
I do not draw conclusions. Such a factor for one
unit

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Antoon Pardon


Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:


Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a


sys.getsizeof('€')

40

needs 14 bytes more than a


sys.getsizeof('z')

26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.


sys.getsizeof('b'*100 + 'c')

126

sys.getsizeof('b'*100 + '€')

240

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.


So? The same effect can be seen with other datatypes.

 nr = 32767
 sys.getsizeof(nr)
14
 nr += 1
 sys.getsizeof(nr)
16




This FSR is not even a copy of the utf-8.

len(('b'*100 + '€').encode('utf-8'))

103


Why should it be? Why should a unicode string be a copy
of its utf-8 encoding? That makes as much sense as expecting
that a number would be a copy of its string reprensentation.



utf-8 or any (utf) never need and never spend their time
in reencoding.


So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.

Until then you are merely harboring a pet peeve.

--
Antoon Pardon
--
http://mail.python.org/mailman/listinfo/python-list

FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Michael Torrie

On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:
 Good point. FSR, nice tool for those who wish to teach
 Unicode. It is not every day, one has such an opportunity.

I had a long e-mail composed, but decided to chop it down, but still too
long.  so I ditched a lot of the context, which jmf also seems to do.
Apologies.

1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
is an official encoding.  FSR only differs from UTF-32 in that the
padding zeros are stripped off such that it is stored in the most
compact form that can handle all the characters in string, which is
always known at string creation time.  Now you can argue many things,
but to say FSR is not unicode compliant is quite a stretch!  What
unicode entities or characters cannot be stored in strings using FSR?
What sequences of bytes in FSR result in invalid Unicode entities?

2. strings in Python *never change*.  They are immutable.  The +
operator always copies strings character by character into a new string
object, even if Python had used UTF-8 internally.  If you're doing a lot
of string concatenations, perhaps you're using the wrong data type.  A
byte buffer might be better for you, where you can stuff utf-8 sequences
into it to your heart's content.

3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow, and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
variable-width encoding are always O(n).  A lot slower!

4. Unicode is, well, unicode.  You seem to hop all over the place from
talking about code points to bytes to bits, using them all
interchangeably.  And now you seem to be claiming that a particular byte
encoding standard is by definition unicode (UTF-8).  Or at least that's
how it sounds.  And also claim FSR is not compliant with unicode
standards, which appears to me to be completely false.

Is my understanding of these things wrong?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Chris Angelico

On Sun, Jul 28, 2013 at 4:52 PM, Michael Torrie torr...@gmail.com wrote:
 Is my understanding of these things wrong?

No, your understanding of those matters is fine. There's just one area
you seem to be misunderstanding; you appear to think that jmf actually
cares about logical argument. I gave up on that theory a long time
ago, and now I respond for the benefit of those reading, rather than
jmf himself. I've also given up on trying to figure out what he
actually wants; the nearest I can come up with is that he's King
Gama-esque - that he just wants to complain.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Terry Reedy


On 7/28/2013 11:52 AM, Michael Torrie wrote:


3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow,


Not necessarily so. See below.


and that's unacceptable for
the use cases of python strings.  I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.


Slicing is at least O(m) where m is the length of the slice.


UTF-8, 16 and any variable-width encoding are always O(n).\


I posted about a week ago, in response to Chris A., a method by which 
lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, 
O(1+log2(k+1)), where k is the number of non-BMP chars in the string.


This uses an auxiliary array of k ints. An auxiliary array of n ints 
would make UFT-16 lookup O(1), but then one is using more space than 
with UFT-32. Similar comments apply to UTF-8.


The unicode standard says that a single strings should use exactly one 
coding scheme. It does *not* say that all strings in an application must 
use the same scheme. I just rechecked a few days ago. It also does not 
say that an application cannot associate additional data with a string 
to make processing of the string easier.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Chris Angelico

On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy tjre...@udel.edu wrote:
 I posted about a week ago, in response to Chris A., a method by which lookup
 for UTF-16 can be made O(log2 k), or perhaps more accurately,
 O(1+log2(k+1)), where k is the number of non-BMP chars in the string.


Which is an optimization choice that favours strings containing very
few non-BMP characters. To justify the extra complexity of out-of-band
storage, you would need to be working with almost exclusively the BMP.
That would drastically improve jmf's microbenchmarks which do exactly
that, but it would penalize strings that are almost exclusively
higher-codepoint characters. Its quality, then, would be based on a
major survey of string usage: are there enough strings with
mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled
better by PEP 393, so this is only of value when there are actually
those mixed strings.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread wxjmfauth

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
 On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 
  Back to utf. utfs are not only elements of a unique set of encoded
 
  code points. They have an interesting feature. Each utf chunk
 
  holds intrisically the character (in fact the code point) it is
 
  supposed to represent. In utf-32, the obvious case, it is just
 
  the code point. In utf-8, that's the first chunk which helps and
 
  utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 
  implementation using bytes, for any pointer position it is always
 
  possible to find the corresponding encoded code point and from this
 
  the corresponding character without any programmed information. See
 
  my editor example, how to find the char under the caret? In fact,
 
  a silly example, how can the caret can be positioned or moved, if
 
  the underlying corresponding encoded code point can not be
 
  dicerned!
 
 
 
 Yes, given a pointer location into a utf-8 or utf-16 string, it is
 
 easy to determine the identity of the code point at that location.
 
 But this is not often a useful operation, save for resynchronization
 
 in the case that the string data is corrupted.  The caret of an editor
 
 does not conceptually correspond to a pointer location, but to a
 
 character index.  Given a particular character index (e.g. 127504), an
 
 editor must be able to determine the identity and/or the memory
 
 location of the character at that index, and for UTF-8 and UTF-16
 
 without an auxiliary data structure that is a O(n) operation.
 
 
 
  2) Take a look at this. Get rid of the overhead.
 
 
 
  sys.getsizeof('b'*100 + 'c')
 
  126
 
  sys.getsizeof('b'*100 + '€')
 
  240
 
 
 
  What does it mean? It means that Python has to
 
  reencode a str every time it is necessary because
 
  it works with multiple codings.
 
 
 
 Large strings in practical usage do not need to be resized like this
 
 often.  Python 3.3 has been in production use for months now, and you
 
 still have yet to produce any real-world application code that
 
 demonstrates a performance regression.  If there is no real-world
 
 regression, then there is no problem.
 
 
 
  3) Unicode compliance. We know retrospectively, latin-1,
 
  is was a bad choice. Unusable for 17 European languages.
 
  Believe of not. 20 years of Unicode of incubation is not
 
  long enough to learn it. When discussing once with a French
 
  Python core dev, one with commit access, he did not know one
 
  can not use latin-1 for the French language!
 
 
 
 Probably because for many French strings, one can.  As far as I am
 
 aware, the only characters that are missing from Latin-1 are the Euro
 
 sign (an unfortunate victim of history), the ligature œ (I have no
 
 doubt that many users just type oe anyway), and the rare capital Ÿ
 
 (the miniscule version is present in Latin-1).  All French strings
 
 that are fortunate enough to be absent these characters can be
 
 represented in Latin-1 and so will have a 1-byte width in the FSR.

--

latin-1? that's not even truth.

 sys.getsizeof('a')
26
 sys.getsizeof('ü')
38
 sys.getsizeof('aa')
27
 sys.getsizeof('aü')
39


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Joshua Landau

On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:

 Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:

 utf-8 or any (utf) never need and never spend their time
 in reencoding.


 So? That python sometimes needs to do some kind of background
 processing is not a problem, whether it is garbage collection,
 allocating more memory, shufling around data blocks or reencoding a
 string, that doesn't matter. If you've got a real world example where
 one of those things noticeably slows your program down or makes the
 program behave faulty then you have something that is worthy of
 attention.


Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes
any noticeable slowdown in real-world circumstances? That's not remotely
true.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Chris Angelico

On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:
 On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote:

 Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:

 utf-8 or any (utf) never need and never spend their time
 in reencoding.


 So? That python sometimes needs to do some kind of background
 processing is not a problem, whether it is garbage collection,
 allocating more memory, shufling around data blocks or reencoding a
 string, that doesn't matter. If you've got a real world example where
 one of those things noticeably slows your program down or makes the
 program behave faulty then you have something that is worthy of
 attention.


 Somewhat off topic, but befitting of the triviality of this thread, do I
 understand correctly that you are saying garbage collection never causes any
 noticeable slowdown in real-world circumstances? That's not remotely true.

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app; most of the issues with GC timing are when one
operation gets unexpectedly delayed for a GC run (making performance
measurement hard, and such). It should certainly never cause your
program to behave faultily, though I have seen cases where the GC run
appears to cause the program to crash - something like this:

some_string = buggy_call()
...
gc()
...
print(some_string)

The buggy call mucked up the reference count, so the gc run actually
wiped the string from memory - resulting in a segfault on next usage.
But the GC wasn't at fault, the original call was. (Which, btw, was
quite a debugging search, especially since the function in question
wasn't my code.)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread MRAB


On 28/07/2013 19:13, wxjmfa...@gmail.com wrote:

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :

On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:

 Back to utf. utfs are not only elements of a unique set of encoded

 code points. They have an interesting feature. Each utf chunk

 holds intrisically the character (in fact the code point) it is

 supposed to represent. In utf-32, the obvious case, it is just

 the code point. In utf-8, that's the first chunk which helps and

 utf-16 is a mixed case (utf-8 / utf-32). In other words, in an

 implementation using bytes, for any pointer position it is always

 possible to find the corresponding encoded code point and from this

 the corresponding character without any programmed information. See

 my editor example, how to find the char under the caret? In fact,

 a silly example, how can the caret can be positioned or moved, if

 the underlying corresponding encoded code point can not be

 dicerned!



Yes, given a pointer location into a utf-8 or utf-16 string, it is

easy to determine the identity of the code point at that location.

But this is not often a useful operation, save for resynchronization

in the case that the string data is corrupted.  The caret of an editor

does not conceptually correspond to a pointer location, but to a

character index.  Given a particular character index (e.g. 127504), an

editor must be able to determine the identity and/or the memory

location of the character at that index, and for UTF-8 and UTF-16

without an auxiliary data structure that is a O(n) operation.



 2) Take a look at this. Get rid of the overhead.



 sys.getsizeof('b'*100 + 'c')

 126

 sys.getsizeof('b'*100 + '€')

 240



 What does it mean? It means that Python has to

 reencode a str every time it is necessary because

 it works with multiple codings.



Large strings in practical usage do not need to be resized like this

often.  Python 3.3 has been in production use for months now, and you

still have yet to produce any real-world application code that

demonstrates a performance regression.  If there is no real-world

regression, then there is no problem.



 3) Unicode compliance. We know retrospectively, latin-1,

 is was a bad choice. Unusable for 17 European languages.

 Believe of not. 20 years of Unicode of incubation is not

 long enough to learn it. When discussing once with a French

 Python core dev, one with commit access, he did not know one

 can not use latin-1 for the French language!



Probably because for many French strings, one can.  As far as I am

aware, the only characters that are missing from Latin-1 are the Euro

sign (an unfortunate victim of history), the ligature œ (I have no

doubt that many users just type oe anyway), and the rare capital Ÿ

(the miniscule version is present in Latin-1).  All French strings

that are fortunate enough to be absent these characters can be

represented in Latin-1 and so will have a 1-byte width in the FSR.


--

latin-1? that's not even truth.


sys.getsizeof('a')

26

sys.getsizeof('ü')

38

sys.getsizeof('aa')

27

sys.getsizeof('aü')

39



 sys.getsizeof('aa') - sys.getsizeof('a')
1

One byte per codepoint.

 sys.getsizeof('üü') - sys.getsizeof('ü')
1

Also one byte per codepoint.

 sys.getsizeof('ü') - sys.getsizeof('a')
12

Clearly there's more going on here.

FSR is an optimisation. You'll always be able to find some
circumstances where an optimisation makes things worse, but what
matters is the overall result.

--
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Terry Reedy


On 7/28/2013 2:29 PM, Chris Angelico wrote:

On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:



Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes any
noticeable slowdown in real-world circumstances? That's not remotely true.


If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app;


There are situations, some discussed on this list, where doing gc 
'right' means turning off the cycle garbage collector. As I remember, an 
example is creating a list of a million tuples, which otherwise triggers 
a lot of useless background bookkeeping. The cyclic gc is tuned for 
'normal' use patterns.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread wxjmfauth

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
 On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:
 
  Good point. FSR, nice tool for those who wish to teach
 
  Unicode. It is not every day, one has such an opportunity.
 
 
 
 I had a long e-mail composed, but decided to chop it down, but still too
 
 long.  so I ditched a lot of the context, which jmf also seems to do.
 
 Apologies.
 
 
 
 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
 
 is an official encoding.  FSR only differs from UTF-32 in that the
 
 padding zeros are stripped off such that it is stored in the most
 
 compact form that can handle all the characters in string, which is
 
 always known at string creation time.  Now you can argue many things,
 
 but to say FSR is not unicode compliant is quite a stretch!  What
 
 unicode entities or characters cannot be stored in strings using FSR?
 
 What sequences of bytes in FSR result in invalid Unicode entities?
 
 
 
 2. strings in Python *never change*.  They are immutable.  The +
 
 operator always copies strings character by character into a new string
 
 object, even if Python had used UTF-8 internally.  If you're doing a lot
 
 of string concatenations, perhaps you're using the wrong data type.  A
 
 byte buffer might be better for you, where you can stuff utf-8 sequences
 
 into it to your heart's content.
 
 
 
 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
 
 slicing a string would be very very slow, and that's unacceptable for
 
 the use cases of python strings.  I'm assuming you understand big O
 
 notation, as you talk of experience in many languages over the years.
 
 FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any
 
 variable-width encoding are always O(n).  A lot slower!
 
 
 
 4. Unicode is, well, unicode.  You seem to hop all over the place from
 
 talking about code points to bytes to bits, using them all
 
 interchangeably.  And now you seem to be claiming that a particular byte
 
 encoding standard is by definition unicode (UTF-8).  Or at least that's
 
 how it sounds.  And also claim FSR is not compliant with unicode
 
 standards, which appears to me to be completely false.
 
 
 
 Is my understanding of these things wrong?

--

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

 timeit.timeit(a = 'hundred'; 'x' in a)
0.09897159682121348
 timeit.timeit(a = 'hundre€'; 'x' in a)
0.09079501961732461
 sys.getsizeof('d')
32
 sys.getsizeof('€')
32
 sys.getsizeof('dd')
34
 sys.getsizeof('d€')
34


Py3.3

 timeit.timeit(a = 'hundred'; 'x' in a)
0.12183182740848858
 timeit.timeit(a = 'hundre€'; 'x' in a)
0.2365732969632326
 sys.getsizeof('d')
26
 sys.getsizeof('€')
40
 sys.getsizeof('dd')
27
 sys.getsizeof('d€')
42

Tell me which one seems to be more unicode compliant?
The goal of Unicode is to handle every char equaly.

Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).

 sys.getsizeof('abc' * 1000 + 'z')
3026
 sys.getsizeof('abc' * 1000 + '\U00010010')
12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread wxjmfauth

Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit :
 On 28/07/2013 19:13, wxjmfa...@gmail.com wrote:
 
  Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
 
  On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 
 
 
   Back to utf. utfs are not only elements of a unique set of encoded
 
 
 
   code points. They have an interesting feature. Each utf chunk
 
 
 
   holds intrisically the character (in fact the code point) it is
 
 
 
   supposed to represent. In utf-32, the obvious case, it is just
 
 
 
   the code point. In utf-8, that's the first chunk which helps and
 
 
 
   utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 
 
 
   implementation using bytes, for any pointer position it is always
 
 
 
   possible to find the corresponding encoded code point and from this
 
 
 
   the corresponding character without any programmed information. See
 
 
 
   my editor example, how to find the char under the caret? In fact,
 
 
 
   a silly example, how can the caret can be positioned or moved, if
 
 
 
   the underlying corresponding encoded code point can not be
 
 
 
   dicerned!
 
 
 
 
 
 
 
  Yes, given a pointer location into a utf-8 or utf-16 string, it is
 
 
 
  easy to determine the identity of the code point at that location.
 
 
 
  But this is not often a useful operation, save for resynchronization
 
 
 
  in the case that the string data is corrupted.  The caret of an editor
 
 
 
  does not conceptually correspond to a pointer location, but to a
 
 
 
  character index.  Given a particular character index (e.g. 127504), an
 
 
 
  editor must be able to determine the identity and/or the memory
 
 
 
  location of the character at that index, and for UTF-8 and UTF-16
 
 
 
  without an auxiliary data structure that is a O(n) operation.
 
 
 
 
 
 
 
   2) Take a look at this. Get rid of the overhead.
 
 
 
  
 
 
 
   sys.getsizeof('b'*100 + 'c')
 
 
 
   126
 
 
 
   sys.getsizeof('b'*100 + '€')
 
 
 
   240
 
 
 
  
 
 
 
   What does it mean? It means that Python has to
 
 
 
   reencode a str every time it is necessary because
 
 
 
   it works with multiple codings.
 
 
 
 
 
 
 
  Large strings in practical usage do not need to be resized like this
 
 
 
  often.  Python 3.3 has been in production use for months now, and you
 
 
 
  still have yet to produce any real-world application code that
 
 
 
  demonstrates a performance regression.  If there is no real-world
 
 
 
  regression, then there is no problem.
 
 
 
 
 
 
 
   3) Unicode compliance. We know retrospectively, latin-1,
 
 
 
   is was a bad choice. Unusable for 17 European languages.
 
 
 
   Believe of not. 20 years of Unicode of incubation is not
 
 
 
   long enough to learn it. When discussing once with a French
 
 
 
   Python core dev, one with commit access, he did not know one
 
 
 
   can not use latin-1 for the French language!
 
 
 
 
 
 
 
  Probably because for many French strings, one can.  As far as I am
 
 
 
  aware, the only characters that are missing from Latin-1 are the Euro
 
 
 
  sign (an unfortunate victim of history), the ligature œ (I have no
 
 
 
  doubt that many users just type oe anyway), and the rare capital Ÿ
 
 
 
  (the miniscule version is present in Latin-1).  All French strings
 
 
 
  that are fortunate enough to be absent these characters can be
 
 
 
  represented in Latin-1 and so will have a 1-byte width in the FSR.
 
 
 
  --
 
 
 
  latin-1? that's not even truth.
 
 
 
  sys.getsizeof('a')
 
  26
 
  sys.getsizeof('ü')
 
  38
 
  sys.getsizeof('aa')
 
  27
 
  sys.getsizeof('aü')
 
  39
 
 
 
 
 
   sys.getsizeof('aa') - sys.getsizeof('a')
 
 1
 
 
 
 One byte per codepoint.
 
 
 
   sys.getsizeof('üü') - sys.getsizeof('ü')
 
 1
 
 
 
 Also one byte per codepoint.
 
 
 
   sys.getsizeof('ü') - sys.getsizeof('a')
 
 12
 
 
 
 Clearly there's more going on here.
 
 
 
 FSR is an optimisation. You'll always be able to find some
 
 circumstances where an optimisation makes things worse, but what
 
 matters is the overall result.




Yes, I know my examples are always wrong, never
real examples.

I can point long strings, I should point short strings.
I point a short string (char), it is not long enough.
Strings as dict keys, no the problem is in Python dict.
Performance? no that's a memory issue.
Memory? no, it's a question to keep perfomance.
I am using this char, no you should not, it's no common.
The nabla operator in TeX file, who is so stupid to use
that char?
Many time, I'm just mimicking 'BDFL' examples, just
by replacing his ascii chars by non ascii char ;-)
And so on.

To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these micro-benchmarks.
all the memory cases. It a good start, no?


jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread MRAB


On 28/07/2013 20:23, wxjmfa...@gmail.com wrote:
[snip]


Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)


Why are you using a narrow build of Python 3.2? It doesn't treat all
codepoints equally (those outside the BMP can't be stored in one code
unit) and, therefore, it isn't Unicode compliant!


timeit.timeit(a = 'hundred'; 'x' in a)

0.09897159682121348

timeit.timeit(a = 'hundre€'; 'x' in a)

0.09079501961732461

sys.getsizeof('d')

32

sys.getsizeof('€')

32

sys.getsizeof('dd')

34

sys.getsizeof('d€')

34


Py3.3


timeit.timeit(a = 'hundred'; 'x' in a)

0.12183182740848858

timeit.timeit(a = 'hundre€'; 'x' in a)

0.2365732969632326

sys.getsizeof('d')

26

sys.getsizeof('€')

40

sys.getsizeof('dd')

27

sys.getsizeof('d€')

42

Tell me which one seems to be more unicode compliant?
The goal of Unicode is to handle every char equaly.

Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).


sys.getsizeof('abc' * 1000 + 'z')

3026

sys.getsizeof('abc' * 1000 + '\U00010010')

12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


jmf



--
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Antoon Pardon


Op 28-07-13 21:23, wxjmfa...@gmail.com schreef:

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :

On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote:


Good point. FSR, nice tool for those who wish to teach



Unicode. It is not every day, one has such an opportunity.




I had a long e-mail composed, but decided to chop it down, but still too

long.  so I ditched a lot of the context, which jmf also seems to do.

Apologies.



1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32

is an official encoding.  FSR only differs from UTF-32 in that the

padding zeros are stripped off such that it is stored in the most

compact form that can handle all the characters in string, which is

always known at string creation time.  Now you can argue many things,

but to say FSR is not unicode compliant is quite a stretch!  What

unicode entities or characters cannot be stored in strings using FSR?

What sequences of bytes in FSR result in invalid Unicode entities?



2. strings in Python *never change*.  They are immutable.  The +

operator always copies strings character by character into a new string

object, even if Python had used UTF-8 internally.  If you're doing a lot

of string concatenations, perhaps you're using the wrong data type.  A

byte buffer might be better for you, where you can stuff utf-8 sequences

into it to your heart's content.



3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that

slicing a string would be very very slow, and that's unacceptable for

the use cases of python strings.  I'm assuming you understand big O

notation, as you talk of experience in many languages over the years.

FSR and UTF-32 both are O(1) for slicing and lookups.  UTF-8, 16 and any

variable-width encoding are always O(n).  A lot slower!



4. Unicode is, well, unicode.  You seem to hop all over the place from

talking about code points to bytes to bits, using them all

interchangeably.  And now you seem to be claiming that a particular byte

encoding standard is by definition unicode (UTF-8).  Or at least that's

how it sounds.  And also claim FSR is not compliant with unicode

standards, which appears to me to be completely false.



Is my understanding of these things wrong?


--

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)


timeit.timeit(a = 'hundred'; 'x' in a)

0.09897159682121348

timeit.timeit(a = 'hundre€'; 'x' in a)

0.09079501961732461

sys.getsizeof('d')

32

sys.getsizeof('€')

32

sys.getsizeof('dd')

34

sys.getsizeof('d€')

34


Py3.3


timeit.timeit(a = 'hundred'; 'x' in a)

0.12183182740848858

timeit.timeit(a = 'hundre€'; 'x' in a)

0.2365732969632326

sys.getsizeof('d')

26

sys.getsizeof('€')

40

sys.getsizeof('dd')

27

sys.getsizeof('d€')

42

Tell me which one seems to be more unicode compliant?


Cant tell, you give no relevant information on which one can decide
this question.


The goal of Unicode is to handle every char equaly.


Not to this kind of detail, which is looking at irrelevant
implementation details.


Now, the problem: memory. Do not forget that à la FSR
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).


So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.


sys.getsizeof('abc' * 1000 + 'z')

3026

sys.getsizeof('abc' * 1000 + '\U00010010')

12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


And totally unimportant for deciding complyance.

--
Antoon Pardon

--
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Lele Gaifax

wxjmfa...@gmail.com writes:

 Suggestion. Start by solving all these micro-benchmarks.
 all the memory cases. It a good start, no?

Since you seem the only one who has this dramatic problem with such
micro-benchmarks, that BTW have nothing to do with unicode compliance,
I'd suggest *you* should find a better implementation and propose it to
the core devs.

An even better suggestion, with due respect, is to get a life and find
something more interesting to do, or at least better arguments :-)

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: FSR and unicode compliance - was Re: RE Module Performance

2013-07-28 Thread Steven D'Aprano

On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote:

 Do not forget that à la FSR mechanism for a non-ascii user is
 *irrelevant*.

You have been told repeatedly, Python's internals are *full* of ASCII-
only strings.

py dir(list)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', 
'__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
'__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', 
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', 
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', 
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', 
'__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 
'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

There's 45 ASCII-only strings right there, in only one built-in type, out 
of dozens. There are dozens, hundreds of ASCII-only strings in Python: 
builtin functions and classes, attributes, exceptions, internal 
attributes, variable names, and so on.

You already know this, and yet you persist in repeating nonsense.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-28 Thread Joshua Landau

On 28 July 2013 19:29, Chris Angelico ros...@gmail.com wrote:

On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote:
On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be
wrote:

Op 27-07-13 20:21, wxjmfa...@gmail.com schreef:

utf-8 or any (utf) never need and never spend their time
in reencoding.

So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.

Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes
any
noticeable slowdown in real-world circumstances? That's not remotely
true.

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app; most of the issues with GC timing are when one
operation gets unexpectedly delayed for a GC run (making performance
measurement hard, and such). It should certainly never cause your
program to behave faultily, though I have seen cases where the GC run
appears to cause the program to crash - something like this:

some_string = buggy_call()
...
gc()
...
print(some_string)

The buggy call mucked up the reference count, so the gc run actually
wiped the string from memory - resulting in a segfault on next usage.
But the GC wasn't at fault, the original call was. (Which, btw, was
quite a debugging search, especially since the function in question
wasn't my code.)

GC does have sometimes severe impact in memory-constrained environments,
though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/,
about half-way down, specifically
http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png
.

Re: RE Module Performance

2013-07-27 Thread Steven D'Aprano

On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote:

 BTW, I'm pleased to read sequence of bits and not bytes. Again, utf
 transformers are producing sequence of bits, call Unicode Transformation
 Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32.
 UCS transformers are (were) producing bytes, from there the names
 ucs-2/4.


Not only does your distinction between bits and bytes make no practical 
difference on nearly all hardware in common use today[1], but the Unicode 
Consortium disagrees with you, and defines UTC in terms of bytes:

A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate code points) to a unique byte 
sequence.

http://www.unicode.org/faq/utf_bom.html#gen2




[1] There may still be some old supercomputers where a byte is more than 
8 bits in use, but they're unlikely to support Unicode.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-27 Thread wxjmfauth

Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :
 On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote:
 
  sys.getsizeof('––') - sys.getsizeof('–')
 
  
 
  I have already explained / commented this.
 
 
 
 Maybe it got lost in translation, but I don't understand your point with
 
 that.
 
 
 
  Hint: To understand Unicode (and every coding scheme), you should
 
  understand utf. The how and the *why*.
 
 
 
 Hmm, so if python used utf-8 internally to represent unicode strings
 
 would not that punish *all* users (not just non-ascii users) since
 
 searching a string for a certain character position requires an O(n)
 
 operation?  UTF-32 I could see (and indeed that's essentially what FSR
 
 uses when necessary does it not?), but not utf-8 or utf-16.

--

Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?

Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)

Unicode is a four-steps process.
[ {unique set of characters}  -- {unique set of code points, the
labels} --  {unique set of encoded code points} ] -- implementation
(bytes)

First point to notice. pure unicode, [...], is different from
the implementation. *This is a deliberate choice*.

The critical step is the path {unique set of characters} ---
{unique set of encoded code points} in such a way so that
the implementation can work comfortably with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of already prepared encoded code points.
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is already eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.

Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, Unicode
uses these utf(s). I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
machine? (see golang / rune / int32).

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each utf chunk
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any programmed information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).

There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.

One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.

Computer scientists are funny scientists. They do love
to solve the problems they created themselves.

-

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a 

 sys.getsizeof('€')
40

needs 14 bytes more than a

 sys.getsizeof('z')
26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take

Re: RE Module Performance

2013-07-27 Thread Ian Kelly

On Sat, Jul 27, 2013 at 12:21 PM,  wxjmfa...@gmail.com wrote:
 Back to utf. utfs are not only elements of a unique set of encoded
 code points. They have an interesting feature. Each utf chunk
 holds intrisically the character (in fact the code point) it is
 supposed to represent. In utf-32, the obvious case, it is just
 the code point. In utf-8, that's the first chunk which helps and
 utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
 implementation using bytes, for any pointer position it is always
 possible to find the corresponding encoded code point and from this
 the corresponding character without any programmed information. See
 my editor example, how to find the char under the caret? In fact,
 a silly example, how can the caret can be positioned or moved, if
 the underlying corresponding encoded code point can not be
 dicerned!

Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted.  The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index.  Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.

 2) Take a look at this. Get rid of the overhead.

 sys.getsizeof('b'*100 + 'c')
 126
 sys.getsizeof('b'*100 + '€')
 240

 What does it mean? It means that Python has to
 reencode a str every time it is necessary because
 it works with multiple codings.

Large strings in practical usage do not need to be resized like this
often.  Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression.  If there is no real-world
regression, then there is no problem.

 3) Unicode compliance. We know retrospectively, latin-1,
 is was a bad choice. Unusable for 17 European languages.
 Believe of not. 20 years of Unicode of incubation is not
 long enough to learn it. When discussing once with a French
 Python core dev, one with commit access, he did not know one
 can not use latin-1 for the French language!

Probably because for many French strings, one can.  As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature œ (I have no
doubt that many users just type oe anyway), and the rare capital Ÿ
(the miniscule version is present in Latin-1).  All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:
 
 
 
  On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 
  steve+comp.lang.pyt...@pearwood.info wrote:
 
  On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 
  To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 
  that are codepoints of text characters within buffers and strings.
 
  Rather, Emacs uses a variable-length internal representation of
 
  characters, that stores each character as a sequence of 1 to 5 8-bit
 
  bytes, depending on the magnitude of its codepoint[1]. For example,
 
  any ASCII character takes up only 1 byte, a Latin-1 character takes up
 
  2 bytes, etc. We call this representation of text multibyte.
 
 
 
  Well, you've just proven what Vim users have always suspected: Emacs
 
  doesn't really exist.
 
 
 
  ... lolwut?
 
 
 
 
 
  JMF has explained that it is impossible, impossible I say!, to write an
 
  editor using a flexible string representation. Since Emacs uses such a
 
  flexible string representation, Emacs is impossible, and therefore Emacs
 
  doesn't exist.
 
 
 
  QED.
 
 
 
 Except that the described representation used by Emacs is a variant of
 
 UTF-8, not an FSR.  It doesn't have three different possible encodings
 
 for the letter 'a' depending on what other characters happen to be in
 
 the string.
 
 
 
 As I understand it, jfm would be perfectly happy if Python used UTF-8
 
 (or presumably the Emacs variant) as its internal string
 
 representation.

--

And emacs it probably working smoothly.

Your comment summarized all this stuff very correctly and
very shortly.

utf8/16/32? I do not care. There are all working correctly,
smoothly and efficiently. In fact, these utf's are already
doing correctly, what this FSR is doing in a wrong way.

My preference? utf32. Why? It is the most simple and
consequently performing choice. I'm not a narrow minded
ascii user. (I do not pretend to belong to those who
are solving the quadrature of the circle, I pretend to
belong to those who know, the quadrature of the circle
is not solvable).

Note: text processing tools or tools that have to process
characters — and the tools to build these tools — are all
moving to utf32, if not already done. There are technical
reasons behind this, which are going beyond the
pure raw unicode. There are however still 100% Unicode
compliant.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit :
 On 07/25/2013 11:18 AM, Steven D'Aprano wrote:
 
  JMF has explained that it is impossible, impossible I say!, to write an 
 
  editor using a flexible string representation. Since Emacs uses such a 
 
  flexible string representation, Emacs is impossible, and therefore Emacs 
 
  doesn't exist.
 
 
 
 Now I'm even more confused.  He once pointed to Go as an example of how
 
 unicode should be done in a language.  yet Go uses UTF-8 I think.
 
 
 
 But I don't think UTF-8 is what JMF refers to as flexible string
 
 representation.  FSR does use 1,2 or 4 bytes per character, but each
 
 character in the string uses the same width.  That's different from
 
 UTF-8 or UTF-16, which is variable width per character.

-

 sys.getsizeof('––') - sys.getsizeof('–')

I have already explained / commented this.




Hint: To understand Unicode (and every coding scheme), you should
understand utf. The how and the *why*.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  UTF-8 uses a flexible representation on a character-by-character basis.
 
  When parsing UTF-8, one needs to look at EVERY character to decide how
 
  many bytes you need to read. In Python 3, the flexible representation is
 
  on a string-by-string basis: once Python has looked at the string header,
 
  it can tell whether the *entire* string takes 1, 2 or 4 bytes per
 
  character, and the string is then fixed-width. You can't do that with
 
  UTF-8.
 
 
 
 UTF-8 does not use a flexible representation.  A codec that is
 
 encoding a string in UTF-8 and examining a particular character does
 
 not have any choice of how to encode that character; there is exactly
 
 one sequence of bits that is the UTF-8 encoding for the character.
 
 Further, for any given sequence of code points there is exactly one
 
 sequence of bytes that is the UTF-8 encoding of those code points.  In
 
 contrast, with the FSR there are as many as three different sequences
 
 of bytes that encode a sequence of code points, with one of them (the
 
 shortest) being canonical.  That's what makes it flexible.
 
 
 
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 
 claim about implementing text editors, because UTF-8 is not what he
 
 (or anybody else) is referring to when speaking of the FSR or
 
 something like the FSR.




BTW, it is not necessary to use an endorsed Unicode coding
scheme (utf*), a string literal would have been possible,
but then one falls on memory issures.

All these utf are following the basic coding scheme.

I repeat again.
A coding scheme works with a unique set of characters
and its implementation works with a unique set of
encoded code points (the utf's, in case of Unicode).

And again, that why we live today with all these coding
schemes, or, to take the problem from the other side,
that's because one has to work with a unique set of
encoded code points, that all these coding schemes had to
be created.

utf's have not been created by newbies ;-)

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :
 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 
 steve+comp.lang.pyt...@pearwood.info wrote:
 
  UTF-8 uses a flexible representation on a character-by-character basis.
 
  When parsing UTF-8, one needs to look at EVERY character to decide how
 
  many bytes you need to read. In Python 3, the flexible representation is
 
  on a string-by-string basis: once Python has looked at the string header,
 
  it can tell whether the *entire* string takes 1, 2 or 4 bytes per
 
  character, and the string is then fixed-width. You can't do that with
 
  UTF-8.
 
 
 
 UTF-8 does not use a flexible representation.  A codec that is
 
 encoding a string in UTF-8 and examining a particular character does
 
 not have any choice of how to encode that character; there is exactly
 
 one sequence of bits that is the UTF-8 encoding for the character.
 
 Further, for any given sequence of code points there is exactly one
 
 sequence of bytes that is the UTF-8 encoding of those code points.  In
 
 contrast, with the FSR there are as many as three different sequences
 
 of bytes that encode a sequence of code points, with one of them (the
 
 shortest) being canonical.  That's what makes it flexible.
 
 
 
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 
 claim about implementing text editors, because UTF-8 is not what he
 
 (or anybody else) is referring to when speaking of the FSR or
 
 something like the FSR.

-

Let's be clear. I'm perfectly understanding what is utf-8
and that's for that precise reason, I put the editor
as an exemple on the table.

This FSR is not *a* coding scheme. It is more a composite
coding scheme. (And form there, all the problems).

BTW, I'm pleased to read sequence of bits and not bytes.
Again, utf transformers are producing sequence of bits,
call Unicode Transformation Units, with lengths of
8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there
the names ucs-2/4.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-26 Thread Michael Torrie

On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote:
 sys.getsizeof('––') - sys.getsizeof('–')
 
 I have already explained / commented this.

Maybe it got lost in translation, but I don't understand your point with
that.

 Hint: To understand Unicode (and every coding scheme), you should
 understand utf. The how and the *why*.

Hmm, so if python used utf-8 internally to represent unicode strings
would not that punish *all* users (not just non-ascii users) since
searching a string for a certain character position requires an O(n)
operation?  UTF-32 I could see (and indeed that's essentially what FSR
uses when necessary does it not?), but not utf-8 or utf-16.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-26 Thread Steven D'Aprano

On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote:

 On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 UTF-8 uses a flexible representation on a character-by-character basis.
 When parsing UTF-8, one needs to look at EVERY character to decide how
 many bytes you need to read. In Python 3, the flexible representation
 is on a string-by-string basis: once Python has looked at the string
 header, it can tell whether the *entire* string takes 1, 2 or 4 bytes
 per character, and the string is then fixed-width. You can't do that
 with UTF-8.
 
 UTF-8 does not use a flexible representation.

I disagree, and so does Jeremy Sanders who first pointed out the 
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the 
Emacs documentation again:

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc.

And the Python FSR:

To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any 
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc.

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of 
the string, Python based on the largest code-point in the string.


[...]
 Anyway, my point was just that Emacs is not a counter-example to jmf's
 claim about implementing text editors, because UTF-8 is not what he (or
 anybody else) is referring to when speaking of the FSR or something
 like the FSR.

Whether JMF can see the similarities between different implementations of 
strings or not is beside the point, those similarities do exist. As do 
the differences, of course, but in this case the differences are in 
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4 
bytes, while a FSR implementation can simply record the fact that the 
string is pure Latin1 at creation time, and then treat it as fixed-width 
from then on.

JMF claims that FSR is impossible to use efficiently, and yet he 
supports encoding schemes which are *less* efficient. Go figure. He tells 
us he has no problem with any of the established UTF encodings, and yet 
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not 
UTF-16, since there are no surrogate pairs. But the difference is 
insignificant.)

Having watched this issue from Day One when JMF first complained about 
it, I believe this is entirely about denying any benefit to ASCII users. 
Had Python implemented a system identical to the current FSR except that 
it added a fourth category, all ASCII, which used an eight-byte 
encoding scheme (thus making ASCII strings twice as expensive as strings 
including code points from the Supplementary Multilingual Planes), JMF 
would be the scheme's number one champion.

I cannot see any other rational explanation for why JMF prefers broken, 
buggy Unicode implementations, or implementations which are equally 
expensive for all strings, over one which is demonstrably correct, 
demonstrably saves memory, and for realistic, non-contrived benchmarks, 
demonstrably faster, except that he wants to punish ASCII users more than 
he wants to support Unicode users.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-26 Thread Ian Kelly

On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 See the similarity now? Both flexibly change the width used by code-
 points, UTF-8 based on the code-point itself regardless of the rest of
 the string, Python based on the largest code-point in the string.

No, I think we're just using the word flexible differently.  In my
view, simply being variable-width does not make an encoding flexible
in the sense of the FSR.  But I'm not going to keep repeating myself
in order to argue about it.

 Having watched this issue from Day One when JMF first complained about
 it, I believe this is entirely about denying any benefit to ASCII users.
 Had Python implemented a system identical to the current FSR except that
 it added a fourth category, all ASCII, which used an eight-byte
 encoding scheme (thus making ASCII strings twice as expensive as strings
 including code points from the Supplementary Multilingual Planes), JMF
 would be the scheme's number one champion.

I agree.  In fact I made a similar observation back in December:

http://mail.python.org/pipermail/python-list/2012-December/636942.html
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-26 Thread Steven D'Aprano

On Fri, 26 Jul 2013 22:12:36 -0600, Ian Kelly wrote:

 On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 See the similarity now? Both flexibly change the width used by code-
 points, UTF-8 based on the code-point itself regardless of the rest of
 the string, Python based on the largest code-point in the string.
 
 No, I think we're just using the word flexible differently.  In my
 view, simply being variable-width does not make an encoding flexible
 in the sense of the FSR.  But I'm not going to keep repeating myself in
 order to argue about it.

But I paid for the full half hour!

http://en.wikipedia.org/wiki/The_Argument_Sketch


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF:

 His most recent argument that Python should use UTF as a representation
 is very strange to be honest.

He's not arguing for anything, he is just hating on anything that gives 
even the tiniest benefit to ASCII users. This isn't about Python 3.3. 
hurting non-ASCII users, because that is demonstrably untrue: they are 
*better off* in Python 3.3. This is about denying even a tiny benefit to 
ASCII users.

In Python 3.3, non-ASCII users have these advantages compared to previous 
versions:

- strings will usually take less memory, and aside from trivial changes 
to the object header, they never take more memory than a wide build would 
use;

- consequently nearly all objects will take less memory (especially 
builtins and standard library objects, which are all ASCII), since 
objects contain dozens of internal strings (attribute and method names in 
__dict__, class name, etc.);

- consequently whole-application benchmarks show most applications will 
use significantly less memory, which leads to faster speeds;

- you cannot break surrogate pairs apart by accident, which you can do in 
narrow builds;

- in previous versions, code which works when run in a wide build may 
fail in a narrow build, but that is no longer an issue since the 
distinction between wide and narrow builds is gone;

- Latin1 users, which includes JMF himself, will likewise see memory 
savings, since Latin1 strings will take half the size of narrow builds 
and a quarter the size of wide builds.


The cost of all these benefits is a small overhead when creating a string 
in the first place, and some purely internal added complication to the 
string implementation.

I'm the first to argue against complication unless there is a 
corresponding benefit. This is a case where the benefit has proven itself 
doubly: Python 3.3's Unicode implementation is *more correct* than 
before, and it uses less memory to do so.

 The cons of UTF are apparent and widely
 known.  The main con is that UTF strings are O(n) for indexing a
 position within the string.

Not so for UTF-32.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka storch...@gmail.com wrote:
 24.07.13 21:15, Chris Angelico написав(ла):

 To my mind, exposing UTF-16
 surrogates to the application is a bug to be fixed, not a feature to
 be maintained.


 Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
 area) to represent undecodable bytes with surrogateescape error handler.

That's a deliberate and conscious use of the codepoints; that's not
what I'm talking about here. Suppose you read a UTF-8 stream of bytes
from a file, and decode them into your language's standard string
type. At this point, you should be working with a string of Unicode
codepoints:

\22\341\210\264\360\222\215\205

--

\x12\u1234\U00012345

The incoming byte stream has a length of 8, the resulting character
stream has a length of 3. Now, if the language wants to use UTF-16
internally, it's free to do so:

0012 1234 d808 df45

When I referred to exposing surrogates to the application, this is
what I'm talking about. If decoding the above byte stream results in a
length 4 string where the last two are \xd808 and \xdf45, then it's
exposing them. If it's a length 3 string where the last is \U00012345,
then it's hiding them. To be honest, I don't imagine I'll ever see a
language that stores strings in UTF-16 and then exposes them to the
application as UTF-32; there's very little point. But such *is*
possible, and if it's working closely with libraries that demand
UTF-16, it might well make sense to do things that way.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote:

 But mainly, I'm just wondering how many people here have any basis from
 which to argue the point he's trying to make. I doubt most of us have
 (a) implemented an editor widget, or (b) tested multiple different
 internal representations to learn the true pros and cons of each. And
 even if any of us had, that still wouldn't have any bearing on PEP 393,
 which is about applications, not editor widgets. As stated above, Python
 strings before AND after PEP 393 are poor choices for an editor, ergo
 arguing from that standpoint is pretty useless.

That's a misleading way to put it. Using immutable strings as editor 
buffers might be a bad way to implement all but the most trivial, low-
performance (i.e. slow) editor, but the basic concept of PEP 393, picking 
an internal representation of the text based on its contents, is not. 
That's just normal. The only difference with PEP 393 is that the choice 
is made on the fly, at runtime, instead of decided in advance by the 
programmer.

I expect that the PEP 393 concept of optimizing memory per string buffer 
would work well in an editor. However the internal buffer is arranged, 
you can safely assume that each chunk of text (word, sentence, paragraph, 
buffer...) will very rarely shift from all Latin 1 to all BMP to 
includes SMP chars. So, for example, entering a SMP character will need 
to immediately up-cast the chunk from 1-byte per char to 4-bytes per 
char, which is relatively pricey, but it's a one-off cost. Down-casting 
when the SMP character is deleted doesn't need to be done immediately, it 
can be performed when the application is idle.

If the chunks are relatively small (say, a paragraph rather than multiple 
pages of text) then even that initial conversion will be invisible. A 
fast touch typist hits a key about every 0.1 of a second; if it takes a 
millisecond to convert the chunk, you wouldn't even notice the delay. You 
can copy and up-cast a lot of bytes in a millisecond.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript) decides
 to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
 even if it broke code to do so.

Unfortunately, so long as most language designers are European-centric, 
there is going to be a lot of push-back against any attempt to fix (say) 
Javascript, or Java just for the sake of a bunch of dead languages in 
the SMPs. Thank goodness for emoji. Wait til the young kids start 
complaining that their emoticons and emoji are broken in Javascript, and 
eventually it will get fixed. It may take a decade, for the young kids to 
grow up and take over Javascript from the old-codgers, but it will happen.


 To my mind, exposing UTF-16 surrogates
 to the application is a bug to be fixed, not a feature to be maintained.

This, times a thousand.

It is *possible* to have non-buggy string routines using UTF-16, but the 
implementation is a lot more complex than most language developers can be 
bothered with. I'm not aware of any language that uses UTF-16 internally 
that doesn't give wrong results for surrogate pairs.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, Jul 25, 2013 at 5:02 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote:

 But mainly, I'm just wondering how many people here have any basis from
 which to argue the point he's trying to make. I doubt most of us have
 (a) implemented an editor widget, or (b) tested multiple different
 internal representations to learn the true pros and cons of each. And
 even if any of us had, that still wouldn't have any bearing on PEP 393,
 which is about applications, not editor widgets. As stated above, Python
 strings before AND after PEP 393 are poor choices for an editor, ergo
 arguing from that standpoint is pretty useless.

 That's a misleading way to put it. Using immutable strings as editor
 buffers might be a bad way to implement all but the most trivial, low-
 performance (i.e. slow) editor, but the basic concept of PEP 393, picking
 an internal representation of the text based on its contents, is not.
 That's just normal. The only difference with PEP 393 is that the choice
 is made on the fly, at runtime, instead of decided in advance by the
 programmer.

Maybe I worded it poorly, but my point was the same as you're saying
here: that a Python string is a poor buffer for editing, regardless of
PEP 393. It's not that PEP 393 makes Python strings worse for writing
a text editor, it's that immutability does that.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript) decides
 to move from UTF-16 to UTF-32, I would wholeheartedly support the move,
 even if it broke code to do so.

 Unfortunately, so long as most language designers are European-centric,
 there is going to be a lot of push-back against any attempt to fix (say)
 Javascript, or Java just for the sake of a bunch of dead languages in
 the SMPs. Thank goodness for emoji. Wait til the young kids start
 complaining that their emoticons and emoji are broken in Javascript, and
 eventually it will get fixed. It may take a decade, for the young kids to
 grow up and take over Javascript from the old-codgers, but it will happen.

I don't know that that'll happen like that. Emoticons aren't broken in
Javascript - you can use them just fine. You only start seeing
problems when you index into that string. People will start to wonder
why, for instance, a 500 character maximum field deducts two from
the limit when an emoticon goes in. Example:

Type here:brtextarea id=content oninput=showlimit(this)/textarea
brYou have span id=limit1500/span characters left (self.value.length).
brYou have span id=limit2500/span characters left (self.textLength).
script
function showlimit(self)
{
document.getElementById(limit1).innerHTML=500-self.value.length;
document.getElementById(limit2).innerHTML=500-self.textLength;
}
/script

I've included an attribute documented here[1] as the codepoint length
of the control's value, but in Chrome on Windows, it still counts
UTF-16 code units. However, I very much doubt that this will result in
language changes. People will just live with it. Chinese and Japanese
users will complain, perhaps, and the developers will write it off as
whinging, and just say That's what the internet does. Maybe, if
you're really lucky, they'll acknowledge that that's what JavaScript
does, but even then I doubt it'd result in language changes.

 To my mind, exposing UTF-16 surrogates
 to the application is a bug to be fixed, not a feature to be maintained.

 This, times a thousand.

 It is *possible* to have non-buggy string routines using UTF-16, but the
 implementation is a lot more complex than most language developers can be
 bothered with. I'm not aware of any language that uses UTF-16 internally
 that doesn't give wrong results for surrogate pairs.

The problem isn't the underlying representation, the problem is what
gets exposed to the application. Once you've decided to expose
codepoints to the app (abstracting over your UTF-16 underlying
representation), the change to using UTF-32, or mimicking PEP 393, or
some other structure, is purely internal and an optimization. So I
doubt any language will use UTF-16 internally and UTF-32 to the app.
It'd be needlessly complex.

ChrisA

[1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote:

 On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote:

 If nobody had ever thought of doing a multi-format string
 representation, I could well imagine the Python core devs debating
 whether the cost of UTF-32 strings is worth the correctness and
 consistency improvements... and most likely concluding that narrow
 builds get abolished. And if any other language (eg ECMAScript)
 decides to move from UTF-16 to UTF-32, I would wholeheartedly support
 the move, even if it broke code to do so.

 Unfortunately, so long as most language designers are European-centric,
 there is going to be a lot of push-back against any attempt to fix
 (say) Javascript, or Java just for the sake of a bunch of dead
 languages in the SMPs. Thank goodness for emoji. Wait til the young
 kids start complaining that their emoticons and emoji are broken in
 Javascript, and eventually it will get fixed. It may take a decade, for
 the young kids to grow up and take over Javascript from the
 old-codgers, but it will happen.
 
 I don't know that that'll happen like that. Emoticons aren't broken in
 Javascript - you can use them just fine. You only start seeing problems
 when you index into that string. People will start to wonder why, for
 instance, a 500 character maximum field deducts two from the limit
 when an emoticon goes in.

I get that. I meant *Javascript developers*, not end-users. The young 
kids today who become Javascript developers tomorrow will grow up in a 
world where they expect to be able to write band names like
▼□■□■□■ (yes, really, I didn't make that one up) and have it just work.
Okay, all those characters are in the BMP, but emoji aren't, and I 
guarantee that even as we speak some new hipster band is trying to decide 
whether to name themselves Smiling  or Crying .

:-)



 It is *possible* to have non-buggy string routines using UTF-16, but
 the implementation is a lot more complex than most language developers
 can be bothered with. I'm not aware of any language that uses UTF-16
 internally that doesn't give wrong results for surrogate pairs.
 
 The problem isn't the underlying representation, the problem is what
 gets exposed to the application. Once you've decided to expose
 codepoints to the app (abstracting over your UTF-16 underlying
 representation), the change to using UTF-32, or mimicking PEP 393, or
 some other structure, is purely internal and an optimization. So I doubt
 any language will use UTF-16 internally and UTF-32 to the app. It'd be
 needlessly complex.

To be honest, I don't understand what you are trying to say.

What I'm trying to say is that it is possible to use UTF-16 internally, 
but *not* assume that every code point (character) is represented by a 
single 2-byte unit. For example, the len() of a UTF-16 string should not 
be calculated by counting the number of bytes and dividing by two. You 
actually need to walk the string, inspecting each double-byte:

# calculate length
count = 0
inside_surrogate = False
for bb in buffer:  # get two bytes at a time
if is_lower_surrogate(bb):
inside_surrogate = True
continue
if is_upper_surrogate(bb):
if inside_surrogate:
count += 1
inside_surrogate = False
continue
raise ValueError(missing lower surrogate)
if inside_surrogate:
break
count += 1
if inside_surrogate:
raise ValueError(missing upper surrogate)


Given immutable strings, you could validate the string once, on creation, 
and from then on assume they are well-formed:

# calculate length, assuming the string is well-formed:
count = 0
skip = False
for bb in buffer:  # get two bytes at a time
if skip:
count += 1
skip = False
continue
if is_surrogate(bb):
skip = True
count += 1


String operations such as slicing become much more complex once you can 
no longer assume a 1:1 relationship between code points and code units, 
whether they are 1, 2 or 4 bytes. Most (all?) language developers don't 
handle that complexity, and push responsibility for it back onto the 
coder using the language. 



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread wxjmfauth

Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit :
 On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote:
 
  Sorry, you are not understanding Unicode. What is a Unicode
 
  Transformation Format (UTF), what is the goal of a UTF and
 
  why it is important for an implementation to work with a UTF.
 
 
 
 Really?  Enlighten me.
 
 
 
 Personally, I would never use UTF as a representation *in memory* for a
 
 unicode string if it were up to me.  Why?  Because UTF characters are
 
 not uniform in byte width so accessing positions within the string is
 
 terribly slow and has to always be done by starting at the beginning of
 
 the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
 
 understand this.  Do you dispute this fact?
 
 
 
 UTF is a great choice for interchange, though, and indeed that's what it
 
 was designed for.
 
 
 
 Are you calling for UTF to be adopted as the internal, in-memory
 
 representation of unicode?  Or would you simply settle for UCS-4?
 
 Please be clear here.  What are you saying?
 
 
 
  Short example. Writing an editor with something like the
 
  FSR is simply impossible (properly).
 
 
 
 How? FSR is just an implementation detail.  It could be UCS-4 and it
 
 would also work.

-

A coding scheme works with a unique set of characters (the repertoire),
and the implementation (the programming) works with a unique set
of encoded code points. The critical step is the path
{unique set of characters} -- {unique set of encoded code points}


Fact: there is no other way to do it properly (This is explaining
why we have to live today with all these coding schemes or also
explaining why so many coding schemes hadto be created).

How to understand it? With a sheet of paper and a pencil.

In the byte string world, this step is a no-op.

In Unicode, it is exactly the purpose of a utf to achieve this
step. utf: a confusing name covering at the same time the
process and the result of the process.
A utf chunk, a series of bits (not bytes), hold intrisically
the information about the character it is representing.

Other exotic coding schemes like iso6937 of CID-fonts are woking
in the same way.

Unicode with the help of utf(s) does not differ from the basic
rule.

-

ucs-2: ucs-2 is a perfecly and correctly working coding scheme.
ucs-2 is not different from the other coding schemes and does
not behave differently (cp... or iso-... or ...). It only
covers a smaller repertoire.

-

utf32: as a pointed many times. You are already using it (maybe
without knowing it). Where? in fonts (OpenType technology),
rendering engines, pdf files. Why? Because there is not other
way to do it better.

--

The Unicode table (its constuction) is a problem per se.
It is not a technical problem, a very important linguistic
aspect of Unicode.
See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

--

If you are not understanding my editor analogy. One other
proposed exercise. Build/create a flexible iso-8859-X coding
scheme. You will quickly understand where the bottleneck
is.
Two working ways:
- stupidly with an editor and your fingers.
- lazily with a sheet of paper and you head.




About my benchmarks: No offense. You are not understanding them,
because you do not understand what this FSR does and the coding
of characters. It's a little bit a devil's circle.

Conceptually, this FSR is spending its time in solving the
problem it creates itsself, with plenty of side effects.

-

There is a clear difference between FSR and ucs-4/utf32.

-

See also:
http://www.unicode.org/reports/tr17/

(In my mind, quite dry and not easy to understand at
a first reading).


jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 What I'm trying to say is that it is possible to use UTF-16 internally,
 but *not* assume that every code point (character) is represented by a
 single 2-byte unit. For example, the len() of a UTF-16 string should not
 be calculated by counting the number of bytes and dividing by two. You
 actually need to walk the string, inspecting each double-byte

Anything's possible. But since underlying representations can be
changed fairly easily (relative term of course - it's a lot of work,
but it can be changed in a single release, no deprecation required or
anything), there's very little reason to continue using UTF-16
underneath. May as well switch to UTF-32 for convenience, or PEP 393
for convenience and efficiency, or maybe some other system that's
still mostly fixed-width.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, Jul 25, 2013 at 7:27 PM,  wxjmfa...@gmail.com wrote:
 A coding scheme works with a unique set of characters (the repertoire),
 and the implementation (the programming) works with a unique set
 of encoded code points. The critical step is the path
 {unique set of characters} -- {unique set of encoded code points}

That's called Unicode. It maps the character 'A' to the code point
U+0041 and so on. Code points are integers. In fact, they are very
well represented in Python that way (also in Pike, fwiw):

 ord('A')
65
 chr(65)
'A'
 chr(123456)
'\U0001e240'
 ord(_)
123456

 In the byte string world, this step is a no-op.

 In Unicode, it is exactly the purpose of a utf to achieve this
 step. utf: a confusing name covering at the same time the
 process and the result of the process.
 A utf chunk, a series of bits (not bytes), hold intrisically
 the information about the character it is representing.

No, now you're looking at another level: how to store codepoints in
memory. That demands that they be stored as bits and bytes, because PC
memory works that way.

 utf32: as a pointed many times. You are already using it (maybe
 without knowing it). Where? in fonts (OpenType technology),
 rendering engines, pdf files. Why? Because there is not other
 way to do it better.

And UTF-32 is an excellent system... as long as you're okay with
spending four bytes for every character.

 See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

I refuse to click this link. Give us a link to the
python-list@python.org archive, or gmane, or something else more
suited to the audience. I'm not going to Google Groups just to figure
out what you're saying.

 If you are not understanding my editor analogy. One other
 proposed exercise. Build/create a flexible iso-8859-X coding
 scheme. You will quickly understand where the bottleneck
 is.
 Two working ways:
 - stupidly with an editor and your fingers.
 - lazily with a sheet of paper and you head.

What has this to do with the editor?

 There is a clear difference between FSR and ucs-4/utf32.

Yes. Memory usage. PEP 393 strings might take up half or even a
quarter of what they'd take up in fixed UTF-32. Other than that,
there's no difference.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread Jeremy Sanders

wxjmfa...@gmail.com wrote:

 Short example. Writing an editor with something like the
 FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are 
codepoints of text characters within buffers and strings. Rather, Emacs uses a 
variable-length internal representation of characters, that stores each 
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of 
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a 
Latin-1 character takes up 2 bytes, etc. We call this representation of text 
multibyte.

...

[1] This internal representation is based on one of the encodings defined by 
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but 
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.



Jeremy


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread Devyn Collier Johnson



On 07/25/2013 09:36 AM, Jeremy Sanders wrote:

wxjmfa...@gmail.com wrote:


Short example. Writing an editor with something like the
FSR is simply impossible (properly).

http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations

To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are
codepoints of text characters within buffers and strings. Rather, Emacs uses a
variable-length internal representation of characters, that stores each
character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of
its codepoint[1]. For example, any ASCII character takes up only 1 byte, a
Latin-1 character takes up 2 bytes, etc. We call this representation of text
multibyte.

...

[1] This internal representation is based on one of the encodings defined by
the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but
Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8-
bit bytes and characters not unified with Unicode.



Jeremy


Wow! The thread that I started has changed a lot and lived a long time. 
I look forward to its first birthday (^u^).


Devyn Collier Johnson
--
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:

 wxjmfa...@gmail.com wrote:
 
 Short example. Writing an editor with something like the FSR is simply
 impossible (properly).
 
 http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-
Representations.html#Text-Representations
 
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example, any
 ASCII character takes up only 1 byte, a Latin-1 character takes up 2
 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs 
doesn't really exist.


 [1] This internal representation is based on one of the encodings
 defined by the Unicode Standard, called UTF-8, for representing any
 Unicode codepoint, but Emacs extends UTF-8 to represent the additional
 codepoints it uses for raw 8- bit bytes and characters not unified with
 Unicode.
 

Do you know what those characters not unified with Unicode are? Is there 
a list somewhere? I've read all of the pages from here to no avail:

http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example, any
 ASCII character takes up only 1 byte, a Latin-1 character takes up 2
 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

... lolwut?

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.
 
 ... lolwut?


JMF has explained that it is impossible, impossible I say!, to write an 
editor using a flexible string representation. Since Emacs uses such a 
flexible string representation, Emacs is impossible, and therefore Emacs 
doesn't exist.

QED.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Fri, Jul 26, 2013 at 3:18 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

 ... lolwut?


 JMF has explained that it is impossible, impossible I say!, to write an
 editor using a flexible string representation. Since Emacs uses such a
 flexible string representation, Emacs is impossible, and therefore Emacs
 doesn't exist.

 QED.

Quad Error Demonstrated.

I never got past the level of Canis Latinicus in debating class.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread wxjmfauth

Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit :
 On Thu, Jul 25, 2013 at 7:27 PM,  wxjmfa...@gmail.com wrote:
 
  A coding scheme works with a unique set of characters (the repertoire),
 
  and the implementation (the programming) works with a unique set
 
  of encoded code points. The critical step is the path
 
  {unique set of characters} -- {unique set of encoded code points}
 
 
 
 That's called Unicode. It maps the character 'A' to the code point
 
 U+0041 and so on. Code points are integers. In fact, they are very
 
 well represented in Python that way (also in Pike, fwiw):
 
 
 
  ord('A')
 
 65
 
  chr(65)
 
 'A'
 
  chr(123456)
 
 '\U0001e240'
 
  ord(_)
 
 123456
 
 
 
  In the byte string world, this step is a no-op.
 
 
 
  In Unicode, it is exactly the purpose of a utf to achieve this
 
  step. utf: a confusing name covering at the same time the
 
  process and the result of the process.
 
  A utf chunk, a series of bits (not bytes), hold intrisically
 
  the information about the character it is representing.
 
 
 
 No, now you're looking at another level: how to store codepoints in
 
 memory. That demands that they be stored as bits and bytes, because PC
 
 memory works that way.
 
 
 
  utf32: as a pointed many times. You are already using it (maybe
 
  without knowing it). Where? in fonts (OpenType technology),
 
  rendering engines, pdf files. Why? Because there is not other
 
  way to do it better.
 
 
 
 And UTF-32 is an excellent system... as long as you're okay with
 
 spending four bytes for every character.
 
 
 
  See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0
 
 
 
 I refuse to click this link. Give us a link to the
 
 python-list@python.org archive, or gmane, or something else more
 
 suited to the audience. I'm not going to Google Groups just to figure
 
 out what you're saying.
 
 
 
  If you are not understanding my editor analogy. One other
 
  proposed exercise. Build/create a flexible iso-8859-X coding
 
  scheme. You will quickly understand where the bottleneck
 
  is.
 
  Two working ways:
 
  - stupidly with an editor and your fingers.
 
  - lazily with a sheet of paper and you head.
 
 
 
 What has this to do with the editor?
 
 
 
  There is a clear difference between FSR and ucs-4/utf32.
 
 
 
 Yes. Memory usage. PEP 393 strings might take up half or even a
 
 quarter of what they'd take up in fixed UTF-32. Other than that,
 
 there's no difference.
 
 
 
 ChrisA




Let start with a simple string \textemdash or \texttendash

 sys.getsizeof('–')
40
 sys.getsizeof('a')
26

jmf

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

On Fri, Jul 26, 2013 at 5:07 AM,  wxjmfa...@gmail.com wrote:
 Let start with a simple string \textemdash or \texttendash

 sys.getsizeof('–')
 40
 sys.getsizeof('a')
 26

Most of the cost is in those two apostrophes, look:

 sys.getsizeof('a')
26
 sys.getsizeof(a)
8

Okay, that's slightly unfair (bonus points: figure out what I did to
make this work; there are at least two right answers) but still, look
at what an empty string costs:

 sys.getsizeof('')
25

Or look at the difference between one of these characters and two:

 sys.getsizeof('aa')-sys.getsizeof('a')
1
 sys.getsizeof('––')-sys.getsizeof('–')
2

That's what the characters really cost. The overhead is fixed. It is,
in fact, almost completely insignificant. The storage requirement for
a non-ASCII, BMP-only string converges to two bytes per character.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: RE Module Performance

2013-07-25 Thread Prasad, Ramit

Chris Angelico wrote:
 On Fri, Jul 26, 2013 at 5:07 AM,  wxjmfa...@gmail.com wrote:
  Let start with a simple string \textemdash or \texttendash
 
  sys.getsizeof('-')
  40
  sys.getsizeof('a')
  26
 
 Most of the cost is in those two apostrophes, look:
 
  sys.getsizeof('a')
 26
  sys.getsizeof(a)
 8
 
 Okay, that's slightly unfair (bonus points: figure out what I did to
 make this work; there are at least two right answers) but still, look
 at what an empty string costs:

I like bonus points. :)
 a = None 
 sys.getsizeof(a)
8

Not sure what the other right answer is...booleans take 12 bytes (on 2.6)

 
  sys.getsizeof('')
 25
 
 Or look at the difference between one of these characters and two:
 
  sys.getsizeof('aa')-sys.getsizeof('a')
 1
  sys.getsizeof('--')-sys.getsizeof('-')
 2
 
 That's what the characters really cost. The overhead is fixed. It is,
 in fact, almost completely insignificant. The storage requirement for
 a non-ASCII, BMP-only string converges to two bytes per character.
 
 ChrisA
 --
 http://mail.python.org/mailman/listinfo/python-list


Ramit



This email is confidential and subject to important disclaimers and conditions 
including on offers for the purchase or sale of securities, accuracy and 
completeness of information, viruses, confidentiality, legal privilege, and 
legal entity disclaimers, available at 
http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread Ian Kelly

On Wed, Jul 24, 2013 at 9:34 AM, Chris Angelico ros...@gmail.com wrote:
 On Thu, Jul 25, 2013 at 12:17 AM, David Hutto dwightdhu...@gmail.com wrote:
 I've screwed up plenty of times in python, but can write code like a pro
 when I'm feeling better(on SSI and medicaid). An editor can be built simply,
 but it's preference that makes the difference. Some might have used tkinter,
 gtk. wxpython or other methods for the task.

 I think the main issue in responding is your library preference, or widget
 set preference. These can make you right with some in your response, or
 wrong with others that have a preferable gui library that coincides with
 one's personal cognitive structure that makes t

 jmf's point is more about writing the editor widget (Scintilla, as
 opposed to SciTE), which most people will never bother to do. I've
 written several text editors, always by embedding someone else's
 widget, and therefore not concerning myself with its internal string
 representation. Frankly, Python's strings are a *terrible* internal
 representation for an editor widget - not because of PEP 393, but
 simply because they are immutable, and every keypress would result in
 a rebuilding of the string. On the flip side, I could quite plausibly
 imagine using a list of strings; whenever text gets inserted, the
 string gets split at that point, and a new string created for the
 insert (which also means that an Undo operation simply removes one
 entire string). In this usage, the FSR is beneficial, as it's possible
 to have different strings at different widths.

 But mainly, I'm just wondering how many people here have any basis
 from which to argue the point he's trying to make. I doubt most of us
 have (a) implemented an editor widget, or (b) tested multiple
 different internal representations to learn the true pros and cons of
 each. And even if any of us had, that still wouldn't have any bearing
 on PEP 393, which is about applications, not editor widgets. As stated
 above, Python strings before AND after PEP 393 are poor choices for an
 editor, ergo arguing from that standpoint is pretty useless. Not that
 that bothers jmf...

I think you've just motivated me to finally get around to writing the
custom output widget for my MUD client.  Of course that will be
simpler than a standard rich text editor widget, since it will never
receive input from the user and modifications will (typically) always
come in the form of append operations.  I intend to write it in pure
Python (well, wxPython), however.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance

2013-07-25 Thread Ian Kelly

On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote:

 On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote:
 To conserve memory, Emacs does not hold fixed-length 22-bit numbers
 that are codepoints of text characters within buffers and strings.
 Rather, Emacs uses a variable-length internal representation of
 characters, that stores each character as a sequence of 1 to 5 8-bit
 bytes, depending on the magnitude of its codepoint[1]. For example,
 any ASCII character takes up only 1 byte, a Latin-1 character takes up
 2 bytes, etc. We call this representation of text multibyte.

 Well, you've just proven what Vim users have always suspected: Emacs
 doesn't really exist.

 ... lolwut?


 JMF has explained that it is impossible, impossible I say!, to write an
 editor using a flexible string representation. Since Emacs uses such a
 flexible string representation, Emacs is impossible, and therefore Emacs
 doesn't exist.

 QED.

Except that the described representation used by Emacs is a variant of
UTF-8, not an FSR.  It doesn't have three different possible encodings
for the letter 'a' depending on what other characters happen to be in
the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8
(or presumably the Emacs variant) as its internal string
representation.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: RE Module Performance