On 2010-01-30 22:06:06 -0500, Lionello Lunesu <l...@lunesu.remove.com> said:

On 30-1-2010 1:59, Andrei Alexandrescu wrote:
bearophile wrote:
Andrei Alexandrescu:
Currently arrays of characters count as random-access ranges, which
is not true for arrays of char and wchar. I plan to make std.range
aware of that and only characterize char[] and wchar[] (and their
qualified versions) as bidirectional ranges.

32 bits are not enough to represent certain "characters", they need
more than one of such dchar. So dchar too may be a bidirectional range.

[citation needed]

I also doubt 32-bit is not enough. In fact, Unicode has 0x10FFFF
as the highest code point.

32-bit is enough to cover all code points. But there are many combining code points in Unicode, allowing you to combine diacritic with various other characters, such as an acute accent with a 'k'. Some of these combinations exists in precombined form and are considered equivalent. So if you want to count the number of characters the user actually see instead of counting code points, then you need to take these combining code points into account.

But if you really wanted to iterate over "characters" instead of code points, note that it can become quite hard if you take into account double diacritics, combining diacritic signs placed across two letters. So I think it's reasonable to have dchar, a code point, as the base unit for iterating over a string.

http://en.wikipedia.org/wiki/Combining_character
http://en.wikipedia.org/wiki/Unicode_normalization

Another interesting case:
http://en.wikipedia.org/wiki/Combining_grapheme_joiner

Unicode, isn't it great?


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to