On 2008-11-23 14:34, Mattias Gaertner wrote:
On Sun, 23 Nov 2008 14:11:50 +0200
listmember<[EMAIL PROTECTED]>  wrote:

That leaves me wondering how much do we lose performance-wise in
endlessly decompressing UTF-8 data, instead of using, say, UCS-4
strings.

I'm wondering what you mean with 'endlessly decompressing UTF-8
data'.

I am referring to going to the nth character in a string. With UTF-8 it is no more a simple arithmetic and an index operation. You have to start from zero and iterate until you get to your characters --at every step, calculating whether it is 2, 3 or 4 bytes long. Doing this is decompression.

You have to make a compromise between memory, ease of use and
compatibility. There is no solution without drawbacks.

If you want to process large 8bit text files then UTF-8 is better.
If you want to paint glyphs then normalized UTF-32 is better.
If you want some unicode with some mem overhead and some easy usage and
have compiler support for some compatibility then UTF-16 is better.

Do we have to think in terms of encodings (which are, ways of compressing text) when what we actually mean 1-byte, 2-byte and 4-byte per char strings.
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to