Why do you decode ? (Seriously)

2012-08-02 Thread Dmitry Olshansky
Intrigued by a familiar topic in std.lexer. I've split it out. It's not as easy question as it seems. Before you start the usual because codepoint has semantic meaning, codeunit is just bytes ya-da, ya-da let me explain you something. Codepoint is indeed a complete piece of symbolic

Re: Why do you decode ? (Seriously)

2012-08-02 Thread Andrei Alexandrescu
On 8/2/12 12:47 PM, Dmitry Olshansky wrote: char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); //u8word contains full UTF-8 sequence u8word = (1(8*len)) -1; //mask out extra bytes //now u8word is a complete UTF-8 sequence in one uint

Re: Why do you decode ? (Seriously)

2012-08-02 Thread Walter Bright
On 8/2/2012 11:42 AM, Andrei Alexandrescu wrote: I like a lot this idea of an minimally decoded character that's isomorphic with UTF-32 but much cheaper to extract. (We could use ulong if they add 5- and 6-byte characters). I wonder if people came up with this and gave it a name. If not, I'd say

Re: Why do you decode ? (Seriously)

2012-08-02 Thread Dmitry Olshansky
On 02-Aug-12 22:42, Andrei Alexandrescu wrote: On 8/2/12 12:47 PM, Dmitry Olshansky wrote: char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); //u8word contains full UTF-8 sequence u8word = (1(8*len)) -1; //mask out extra bytes //now

Re: Why do you decode ? (Seriously)

2012-08-02 Thread Artur Skawina
On 08/02/12 18:47, Dmitry Olshansky wrote: char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster to obtain? Iff unaligned accesses happen to be legal

Re: Why do you decode ? (Seriously)

2012-08-02 Thread Dmitry Olshansky
On 03-Aug-12 00:40, Artur Skawina wrote: On 08/02/12 18:47, Dmitry Olshansky wrote: char[] input = ...; size_t idx = ...; size_t len = stride(input, idx); uint u8word = *cast(uint*)(input.ptr+idx); So why do we use dchar and not UTF-8 word, as it's as good as dchar and faster to obtain?