On 23/08/2012 19:33, wxjmfa...@gmail.com wrote:
Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
wxjmfa...@gmail.com:



Small illustration. Take an a4 page containing 50 lines of 80 ascii

characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),

and you will see all the optimization efforts destroyed.



sys.getsizeof('a' * 80 * 50)

4025

sys.getsizeof('a' * 80 * 50 + '•')

8040



     This example is still benefiting from shrinking the number of bytes

in half over using 32 bits per character as was the case with Python 3.2:



  >>> sys.getsizeof('a' * 80 * 50)

16032

  >>> sys.getsizeof('a' * 80 * 50 + '•')

16036

Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf


What do you propose should be used instead, as you appear to be the resident expert in the field?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to