Steven D'Aprano:

So while you might save memory by using "UTF-24" instead of UTF-32, it
would probably be slower because you would have to grab three bytes at a
time instead of four, and the hardware probably does not directly support
that.

Low-level string manipulation often deals with blocks larger than an individual character for speed. Generally 32 or 64-bits at a time using the CPU or 128 or 256 using the vector unit. Then there may be entry/exit code to handle initial alignment to a block boundary and dealing with a smaller than block-size tail.

For an example of this kind of thing, see find_max_char in python\Objects\stringlib\find_max_char.h which can examine a char* 32 or 64-bits at a time.

24-bit is likely to be a win in many circumstances due to decreased memory traffic. a 12-bit implementation may also be worthwhile as the low 0x1000 characters of Unicode contains Latin (with extensions), Greek, Cyrillic, Arabic, Hebrew, and most Indic scripts.

   Neil
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to