On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that denotes a set of windows in code space? I still fail to see how that would scale see below.
Something like that. For a multi-language string encoding, the header would contain a single byte for every language used in the string, along with multiple index bytes to signify the start and finish of every run of single-language characters in the string. So, a list of languages and a list of pure single-language substrings. This is just off the top of my head, I'm not suggesting it is definitive.

Mm... strictly speaking (let's turn that argument backwards) - what are algorithms that require slicing say [5..$] of string without ever looking at it left to right, searching etc.?
Don't know, I was just pointing out that all the claims of easy slicing with UTF-8 are wrong. But a single-byte encoding would be scanned much faster also, as I've noted above, no decoding necessary and single bytes will always be faster than multiple bytes, even without decoding.

How would it look like? Or how the processing will go?
Detailed a bit above. As I mentioned earlier in this thread, functions like toUpper would execute much faster because you wouldn't have to scan substrings containing languages that don't have uppercase, which you have to scan in UTF-8.

long before you mentioned this unicode compression
scheme.

It does inline headers or rather tags. That hop between fixed char windows. It's not random-access nor claims to be.
I wasn't criticizing it, just saying that it seems to be superficially similar to my scheme. :)

version of my single-byte encoding scheme! You do raise a good point: the only reason why we're likely using such a bad encoding in UTF-8 is
that nobody else wants to tackle this hairy problem.

Yup, where have you been say almost 10 years ago? :)
I was in grad school, avoiding writing my thesis. :) I'd never have thought I'd be discussing Unicode today, didn't even know what it was back then.

Not necessarily.  But that is actually one of the advantages of
single-byte encodings, as I have noted above. toUpper is a NOP for a single-byte encoding string with an Asian script, you can't do that with
a UTF-8 string.

But you have to check what encoding it's in and given that not all codepages are that simple to upper case some generic algorithm is required.
You have to check the language, but my point is that you can look at the header and know that toUpper has to do nothing for a single-byte-encoded string of an Asian script which doesn't have uppercase characters. With UTF-8, you have to decode the entire string to find that out.


They may seem superficially similar but they're not. For example, from the beginning, I have suggested a more complex header that can enable multi-language strings, as one possible solution. I don't think code
pages provided that.

The problem is how would you define an uppercase algorithm for multilingual string with 3 distinct 256 codespaces (windows)? I bet it's won't be pretty.
How is it done now? It isn't pretty with UTF-8 now either, as some languages have uppercase characters and others don't. The version of toUpper for my encoding will be similar, but it will do less work, because it doesn't have to be invoked for every character in the string.

I still don't see how your solution scales to beyond 256 different codepoints per string (= multiple pages/parts of UCS ;) ).
I assume you're talking about Chinese, Korean, etc. alphabets? I mentioned those to Walter earlier, they would have a two-byte encoding. No way around that, but they would still be easier to deal with than UTF-8, because of the header.

Reply via email to