Re: Why UTF-8/16 character encodings?

Joakim Sat, 25 May 2013 12:55:34 -0700

On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:

You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a headerthat denotes a set of windows in code space? I still fail tosee how that would scale see below.

Something like that. For a multi-language string encoding, theheader would contain a single byte for every language used in thestring, along with multiple index bytes to signify the start andfinish of every run of single-language characters in the string.So, a list of languages and a list of pure single-languagesubstrings. This is just off the top of my head, I'm notsuggesting it is definitive.

Mm... strictly speaking (let's turn that argument backwards) -what are algorithms that require slicing say [5..$] of stringwithout ever looking at it left to right, searching etc.?

Don't know, I was just pointing out that all the claims of easyslicing with UTF-8 are wrong. But a single-byte encoding wouldbe scanned much faster also, as I've noted above, no decodingnecessary and single bytes will always be faster than multiplebytes, even without decoding.

How would it look like? Or how the processing will go?

Detailed a bit above. As I mentioned earlier in this thread,functions like toUpper would execute much faster because youwouldn't have to scan substrings containing languages that don'thave uppercase, which you have to scan in UTF-8.

long before you mentioned this unicode compression
scheme.
It does inline headers or rather tags. That hop between fixedchar windows. It's not random-access nor claims to be.

I wasn't criticizing it, just saying that it seems to besuperficially similar to my scheme. :)

version of my single-byte encoding scheme! You do raise agood point:the only reason why we're likely using such a bad encoding inUTF-8 is
that nobody else wants to tackle this hairy problem.
Yup, where have you been say almost 10 years ago? :)

I was in grad school, avoiding writing my thesis. :) I'd neverhave thought I'd be discussing Unicode today, didn't even knowwhat it was back then.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above. toUpper is aNOP for asingle-byte encoding string with an Asian script, you can't dothat with
a UTF-8 string.
But you have to check what encoding it's in and given that notall codepages are that simple to upper case some genericalgorithm is required.

You have to check the language, but my point is that you can lookat the header and know that toUpper has to do nothing for asingle-byte-encoded string of an Asian script which doesn't haveuppercase characters. With UTF-8, you have to decode the entirestring to find that out.

They may seem superficially similar but they're not. Forexample, fromthe beginning, I have suggested a more complex header that canenablemulti-language strings, as one possible solution. I don'tthink code
pages provided that.
The problem is how would you define an uppercase algorithm formultilingual string with 3 distinct 256 codespaces (windows)? Ibet it's won't be pretty.

How is it done now? It isn't pretty with UTF-8 now either, assome languages have uppercase characters and others don't. Theversion of toUpper for my encoding will be similar, but it willdo less work, because it doesn't have to be invoked for everycharacter in the string.

I still don't see how your solution scales to beyond 256different codepoints per string (= multiple pages/parts of UCS;) ).

I assume you're talking about Chinese, Korean, etc. alphabets? Imentioned those to Walter earlier, they would have a two-byteencoding. No way around that, but they would still be easier todeal with than UTF-8, because of the header.

Re: Why UTF-8/16 character encodings?

Reply via email to