Re: Flexible string representation, unicode, typography, ...

Mark Lawrence Thu, 23 Aug 2012 12:35:34 -0700

On 23/08/2012 19:33, [email protected] wrote:

Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :

[email protected]:

Small illustration. Take an a4 page containing 50 lines of 80 ascii

characters, add a single 'EM DASH' or an 'BULLET' (code points>  0x2000),

and you will see all the optimization efforts destroyed.

sys.getsizeof('a' * 80 * 50)

sys.getsizeof('a' * 80 * 50 + '•')




     This example is still benefiting from shrinking the number of bytes

in half over using 32 bits per character as was the case with Python 3.2:



  >>> sys.getsizeof('a' * 80 * 50)

16032

  >>> sys.getsizeof('a' * 80 * 50 + '•')

16036

Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf

What do you propose should be used instead, as you appear to be theresident expert in the field?


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Flexible string representation, unicode, typography, ...

Reply via email to