On 02-Aug-12 22:42, Andrei Alexandrescu wrote:
On 8/2/12 12:47 PM, Dmitry Olshansky wrote:
char[] input = ...;
size_t idx = ...;
size_t len = stride(input, idx);
uint u8word = *cast(uint*)(input.ptr+idx);
//u8word contains full UTF-8 sequence
u8word &= (1<<(8*len)) -1; //mask out extra bytes
//now u8word is a complete UTF-8 sequence in one uint


Barring its hacky nature, I claim that the number obtained is in no way
worse then distilled codepoint. It is a number that maps 1:1 any
codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.

I like a lot this idea of an "minimally decoded" character that's
isomorphic with UTF-32 but much cheaper to extract. (We could use ulong
if they add 5- and 6-byte characters).

The good news is that there *used to be* 5 and 6-bytes. Now there is only up to 4. That's probably why such technique was not deployed widely yet. I don't think such a decision is easy to roll back.

I wonder if people came up with
this and gave it a name. If not, I'd say we call such a number an "olsh".

Cool, thought it'd better be olsh8 so that we can use olsh16 for UTF16 :)

--
Dmitry Olshansky

Reply via email to