Re: Why UTF-8/16 character encodings?

Peter Alexander Sat, 25 May 2013 07:20:25 -0700

On Saturday, 25 May 2013 at 13:47:42 UTC, Joakim wrote:

On Saturday, 25 May 2013 at 12:26:47 UTC, Vladimir Panteleevwrote:
On Saturday, 25 May 2013 at 11:07:54 UTC, Joakim wrote:
If you want to split a string by ASCII whitespace (newlines,tabs and spaces), it makes no difference whether the stringis in ASCII or UTF-8 - the code will behave correctly ineither case, variable-width-encodings regardless.
Except that a variable-width encoding will take longer todecode while splitting, when compared to a single-byteencoding.
No. Are you sure you understand UTF-8 properly?
Are you sure _you_ understand it properly? Both encodings haveto check every single character to test for whitespace, but thesingle-byte encoding simply has to load each byte in the stringand compare it against the whitespace-signifying bytes, whilethe variable-length code has to first load and parsepotentially 4 bytes before it can compare, because it has to gothrough the state machine that you linked to above. Obviouslythe constant-width encoding will be faster. Did I really needto explain this?

I suggest you read up on UTF-8. You really don't understand it.There is no need to decode, you just treat the UTF-8 string as ifit is an ASCII string.

This code will count all spaces in a string whether it is encodedas ASCII or UTF-8:


int countSpaces(const(char)* c)
{
    int n = 0;
    while (*c)
        if (*c == ' ')
            ++n;
    return n;
}

I repeat: there is no need to decode. Please read up on UTF-8.You do not understand it. The reason you don't need to decode isbecause UTF-8 is self-synchronising.

The code above tests for spaces only, but it works the same whensearching for any substring or single character. It is no slowerthan fixed-width encoding for these operations.

Again, I urge you, please read up on UTF-8. It is very welldesigned.

Re: Why UTF-8/16 character encodings?

Reply via email to