On 23/08/2012 13:47, wxjmfa...@gmail.com wrote:
This is neither a complaint nor a question, just a comment.

In the previous discussion related to the flexible
string representation, Roy Smith added this comment:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42

Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."

he used in his comment a very intersting word: "punctuation".

There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.

Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.

A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.

The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.

Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.

Python, in my mind, fell in this trap.

"Simple is better than complex."
   -> hard to maintained
"Flat is better than nested."
   -> code points range
"Special cases aren't special enough to break the rules."
   -> special unicode code points?
"Although practicality beats purity."
  -> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
   -> guessing a user will only work with the "optimmized" char subrange.
...

Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.

sys.getsizeof('a' * 80 * 50)
4025
sys.getsizeof('a' * 80 * 50 + '•')
8040

Just my 2 € (code point 0x20ac) cents.

jmf


I'm looking forward to all the patches you are going to provide to correct all these (presumably) cPython deficiencies. When do they start arriving on the bug tracker?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to