From: Asmus Freytag <[EMAIL PROTECTED]>

I use ... and UTF-32 for most internal processing that I write
myself. Let people say UTF-32 is wasteful if they want; I don't tend to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.


For performance-critical applications on the other hand, you need to use
whichever UTF gives you the correct balance in speed and average storage
size for your data.


If you have very large amounts of data, you'll be sensitive to cache
overruns. Enough so, that UTF-32 may be disqualified from the start.
I have encountered systems for which that was true.

For both of these, I'd recommend UTF-8. Its compact, especially when parsing source code! which is mostly ASCII even if it contains other languages)... and its fast to process. Just use the byte processing functions.


I've done natural language word processing functions on UTF-8 also, and its damn fast even there, even despite being case insensitive!

My test was to do "word counting" to see word frequencies. I did this with UTF-8. All strings were entered as UTF-8 into a special "scanner" which I invented. All strings were entered as both uppercase and lowercase. The scanner would then have both uppercase and lowercase variants of UTF-8.

The scanner however, only does byte (case sensitive) searching.

So, despite it being UTF-8 case insensitive, it was totally blastingly fast. (One person reported counting words at 1MB/second of pure text, from within a mixed Basic / C environment). You'll need to keep in mind, that the counter must look up through thousands of words (Every single word its come across in the text), on every single word lookup.

Anyhow, from my experience, UTF-8 is great for speed and RAM.




Reply via email to