From: "Theo" <[EMAIL PROTECTED]>
From: Asmus Freytag <[EMAIL PROTECTED]>
So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the counter must look up through thousands of words (Every single word its come across in the text), on every single word lookup.

Anyhow, from my experience, UTF-8 is great for speed and RAM.

Probably true for English or most Western European Latin-based languages (plus Greek and Coptic).


But for other languages that still use lots of characters in the range U+0000 to U+03FF (C0 and C1 controls, Basic Latin, Latin-1 suplement, Latin Extended-A and -B, IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek and Coptic) UTF-8 and UTF-16 may be nearly as efficient.

For all others, that need lots of characters out of the range U+0000 to U+03FF (Cyrillic, Armenian, Hebrew, Arabic, and all Asian or Native-American or African scripts, or even PUAs), UTF-16 is better (more compact in memory, so faster).

UTF-32 will be better only for historic texts written nearly completely with characters out of the BMP (for now, only Old Ialic, Gothic, Ugaritic, Deseret, Shavian, Osmanya, Cypriot Syllabary), if C0 controls (such as TAB, CR and LF), or ASCII SPACE, or NBSP are a minority.





Reply via email to