On Sat, May 25, 2013 at 09:51:42PM +0200, Joakim wrote: > On Saturday, 25 May 2013 at 19:03:53 UTC, Dmitry Olshansky wrote: > >If I take you right you propose to define string as a header that > >denotes a set of windows in code space? I still fail to see how > >that would scale see below. > > Something like that. For a multi-language string encoding, the > header would contain a single byte for every language used in the > string, along with multiple index bytes to signify the start and > finish of every run of single-language characters in the string. > So, a list of languages and a list of pure single-language > substrings. This is just off the top of my head, I'm not suggesting > it is definitive. [...]
And just how exactly does that help with slicing? If anything, it makes slicing way hairier and error-prone than UTF-8. In fact, this one point alone already defeated any performance gains you may have had with a single-byte encoding. Now you can't do *any* slicing at all without convoluted algorithms to determine what encoding is where at the endpoints of your slice, and the resulting slice must have new headers to indicate the start/end of every different-language substring. By the time you're done with all that, you're going way slower than processing UTF-8. Again I say, I'm not 100% sold on UTF-8, but what you're proposing here is far worse. T -- The best compiler is between your ears. -- Michael Abrash