25-May-2013 23:51, Joakim пишет:
On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote:
You can map a codepage to a subset of UCS :)
That's what they do internally anyway.
If I take you right you propose to define string as a header that
denotes a set of windows in code space? I still fail to see how that
would scale see below.
Something like that. For a multi-language string encoding, the header
would contain a single byte for every language used in the string, along
with multiple index bytes to signify the start and finish of every run
of single-language characters in the string. So, a list of languages and
a list of pure single-language substrings. This is just off the top of
my head, I'm not suggesting it is definitive.
Runs away in horror :) It's mess even before you've got to details.
Another point about using sometimes a 2-byte encoding - welcome to the
nice world of BigEndian/LittleEndian i.e. the very trap UTF-16 has
stepped into.
--
Dmitry Olshansky