> From:Mark H Weaver <m...@netris.org> > > Mike Gran <spk...@yahoo.com> writes: > > We do, in a matter of speaking, have a single string representation: > > UTF-32. The 'narrow' encoding is UTF-32 with the initial 3 bytes > of > > zero removed. > > Despite the similarity of these two representations, they are > sufficiently different that they cannot be handled by the same machine > code. That means you must either implement multiple inner loops, one > for each combination of string parameter representations, or else you > must dispatch on the string representation within the inner loop. On > modern architectures, wrongly predicted conditional branches are very > expensive.
Keep in mind that the UTF-8 forward iterator operation has conditional branches. Merely the act of advancing from one character to another could take one of four paths, or more if you include the possibility of invalid UTF-8 sequences. Also, you need to wrap what libunistring does into your calculations of what is expensive. > > > I actually at one point had a nearly complete version of Guile 1.8 > > that used UTF-8 and another that used UTF-32. There are some > > other reasons why UTF-8 is bad, which I could bore you with > > ad naseum. > > Can you please tell me why UTF-8 is bad, or point me to something that > explains it? Everything I have found suggests that UTF-8 is very good. OK Well, we covered O(1) vs O(n). To make UTF-8 O(1), you need to store additional indexing information of some sort. There are various schemes, but, depending the the scheme, you lose some of memory advantage of UTF-8 vs UTF-32. You can likely to better than UTF-32, though. Next, you have the question of string character validation. For example, in Latin-1, all 8-bit values are valid characters. In UTF-32, not all 32-bit values are valid codepoints: anything over 10FFFF or in the surrogate region D800 to D8FF is invalid. Likewise, in UTF-8, you can have invalid sequences. You can also have valid UTF-8 sequences that point to the invalid codepoints in the surrogate region. In our current scheme, we test the validity the codepoints in UTF-32 strings when the strings are created. So, one of the supposed benefits of UTF-8 strings for Guile is that you can more easily pass information back and forth between Guile and external libraries or programs. But in that case, when UTF-8 strings are received by a UTF-8-enabled Guile from an external source, you have to deal with both its O(1)-ness, the validity of its UTF-8 sequences, and the validity of the codepoints that the UTF-8 sequences decode to (if the string is passed back from program or library that you don't completely trust.) So the question becomes *when* you'd deal with those questions. The best policy would be to deal with them at the point a UTF-8 string enters your application. When a string enters your application, you should scrub it for invalid sequences and codepoints, throwing errors or replacing bad characters with the Unicode replacement character and whatnot. If you don't deal with O(1) indexing and invalid sequences at the point a UTF-8 string enters the system, you will have a couple of interesting effects. First, UTF-8 iterators that have to deal with the possibility of invalid sequences are less efficient that those that can trust the validity of a sequence, because of the extra code involved. And then what do you do when you find an invalid sequence? If you throw an error, will there the possibility of non-local exits at any time there is a string iteration? (You can't replace bad sequences with the the Unicode replacement character on the fly as bad sequences are discovered during iteration, because the string could be read-only or be being used by another thread.) And then there is a code-correctness argument against UTF-8. It is just too easy to confuse a string's codepoint count with its memory size, and it is just too easy to confuse iteration over bytes with iteration over characters. I know that the Guile maintainers are geniuses, but, they are still human. (I don't have any real data for this, but anecdotally it seems that languages that have gone the ISO-8859-n to 8-bit clean to UTF-8 upgrade route seem to have taken longer to get up to speed.) But this argument, for Guile, is now largely moot, since much of the string access is through encoding-agnostic accessor functions. None of this is insurmountable, of course; all of this could be overcome, if the costs outweighed the benefits. -Mike