If only MS Word was coded this well (was Re: Nicest UTF)

Theodore H. Smith Tue, 07 Dec 2004 14:56:10 -0800

From: "D. Starner" <[EMAIL PROTECTED]>

(Sorry for sending this twice, Marcin.)

"Marcin 'Qrczak' Kowalczyk" writes:

UTF-8 is poorly suitable for internal processing of strings in a
modern programming language (i.e. one which doesn't already have a
pile of legacy functions working of bytes, but which can be designed
to make Unicode convenient at all). It's because code points have
variable lengths in bytes, so extracting individual characters is
almost meaningless

Same with UTF-16 and UTF-32. A character is multiple code-points, remember? (decomposed chars?)

(unless you care only about the ASCII subset, and
sequences of all other characters are treated as non-interpreted bags
of bytes).

Nope. I've done tons of UTF-8 string processing. I've even done a case insensitive word-frequency measuring algorithm on UTF-8. It runs blastingly fast, because I can do the processing with bytes.

It just requires you to understand the actual logic of UTF-8 well enough to know that you can treat it as bytes, most of the time.

And the times you can't treat it as bytes, usually you can't even treat UTF-32 as bytes!

If you are talking about creating an editfield or text control or something, that is true that UTF-32 is better. However, UTF-16 is the worst of all cases, you'd be better off using UTF-8 as the native encoding of an editfield.

The thing is, very very very few people write editfields.

I've seen tons of XML parsers in my lifetime (at least 3 I wrote myself), but only a few editfield libraries.

Its a shame that very few people understand the different UTFs properly.

As for isspace... sure there is a UTF-8 non-byte space.

My case insensitive utf-8 word frequency counter (which runs blastingly fast) however didn't find this to be any problem. It dealt with non-single byte all sorts of word breaks :o)

It appears to run at about 3MB/second on my laptop, which involves for every word, doing a word check on the entire previous collection of words.

Thats like having MS Word spell-check 3MB of pure Unicode text (no style junk bloating up the file-size) in one second, for you. (The words would all be spelt correctly though, so as to not require expensive RAM copying when doing the replacements.)

Yes, I do know how to code ;o)

Too bad so few others do.

--
   Theodore H. Smith - Software Developer - www.elfdata.com/plugin/
   Industrial strength string processing code, made easy.
   (If you believe that's an oxymoron, see for yourself.)

If only MS Word was coded this well (was Re: Nicest UTF)

Reply via email to