Richard

> How can we know which is in use for a given string?
> 
> Suppose I wanted to process a lot of text, so performance is critical. Using 
> bytes would be optimal, since any chunk type or even Unicode characters may 
> vary in length.
> 
> So if I wanted to create an index of byte offsets into a large chunk of text, 
> how would I know how long a character is?

Some Unicode characters, such as emojis, have to be represented by two 
codepoints in UTF-16 (known as surrogates) so they take four bytes not two. 
Additionally, the number of bytes for characters with accents will take either 
one codepoint or two depending on whether they have been coded in pre-composed 
or decomposed form. (e.g. ç can be either U+0063 U+0327 (decomposed) or U+00E7 
(precomposed).

So it is isn’t easy to estimate the number of bytes in a UTF-16 string.

I would guess that LiveCode will store the characters of a string in single 
bytes if all the letters of the string conform to ISO-8859-1. So if you can be 
certain that your text is all ISO-8859-1 encoded, you can estimate at 1 byte 
per character. (The guess is base on the fact that the first 256 Unicode code 
points replicate ISO-8859-1).

Regards

Peter


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to