Hi Simon, On 09/25/2015 01:01 AM, Simon Spero wrote: > [Some of this is may simple or prohibitively tricksy depending on alignment > constraints (even though it's restricted to Prime Multilingual Plane :-) ] > > For some not un-realistic use cases, the most significant bytes for all the > characters in a string are identical, even if the string is non-latin. For > example, all the characters may be in the range U+0400--U+04FF, or > U+0500--U+05FF. > In these cases, it may be feasible to save the upper byte, then splat it > into place when reconstituting the UTF-16 chars. > > Because of the assignment of unicode code-points, this technique is not as > big as win as it might have been. Unlike (e.g.) 8859-5 or 8859-8, there are > no punctuation marks, digits, or whitespace characters, which restricts use > cases to very short strings (the lack of whitespace is the biggest > problem). For the 254-like coding system I was experimenting with, for the > cases were I didn't fall back to UTF-16, the savings were overwhelmed by > the cost of header words and padding. > > It is possible to handle some of these mixtures, on some architectures, > without resorting to LUTs or branches, but that's well in to non-goal > territory for JEP-254. There might be some useful win just from being able > to have an offset to be added to the packed value based if the high-bit is > set or not. Anyone here from Москва?
Sure, many theoretical constructions may be devised. Not many of them are practical. JEP-254 wins big time exactly because many strings *are* single-byte storeable in ASCII/8859-1, *especially* those with long lengths. So, the very first thing you have to do is prove that an alternative scheme successfully encodes a fair amount of real strings. Otherwise, it does not worth exploring any further. As you say, a lack of "usual" characters like whitespace may be the deal breaker. Adding an alternative coder is easy, but making sure it does not regress the prevailing cases of 8859-1/UTF16 strings is much harder. Think about branching costs, eliminating the bit tricks that are employed now with binary 0/1 coder, etc. Thanks, -Aleksey