On 2011-01-15 20:49:00 -0500, Jonathan M Davis <jmdavisp...@gmx.com> said:

On Saturday 15 January 2011 04:24:33 Michel Fortin wrote:
I have my idea.

I think it'd be a good idea is to improve upon Andrei's first idea --
which was to treat char[], wchar[], and dchar[] all as ranges of dchar
elements -- by changing the element type to be the same as the string.
For instance, iterating on a char[] would give you slices of char[],
each having one grapheme.

The second component would be to make the string equality operator (=
=)
for strings compare them in their normalized form, so that ("e" with
combining acute accent) == (pre-combined "é"). I think this would m
ake
D support for Unicode much more intuitive.

This implies some semantic changes, mainly that everywhere you write a
"character" you must use double-quotes (string "a") instead of single
quote (code point 'a'), but from the user's point of view that's pretty
much all there is to change.

There'll still be plenty of room for specialized algorithms, but their
purpose would be limited to optimization. Correctness would be taken
care of by the basic range interface, and foreach should follow suit
and iterate by grapheme by default.

I wrote this example (or something similar) earlier in this thread:

        foreach (grapheme; "exposé")
                if (grapheme == "é")
                        break;

In this example, even if one of these two strings use the pre-combined
form of "é" and the other uses a combining acute accent, the equality
would still hold since foreach iterates on full graphemes and =
compares using normalization.

I think that that would cause definite problems. Having the element type of the range be the same type as the range seems like it could cause a lot of problems in std.algorithm and the like, and it's _definitely_ going to confuse programmers. I'd expect it to be highly bug-prone. They _need_ to be separate types.

I remember that someone already complained about this issue because he had a tree of ranges, and Andrei said he would take a look at this problem eventually. Perhaps now would be a good time.


Now, given that dchar can't actually work completely as an element type, you'd either need the string type to be a new type or the element type to be a new type. So, either the string type has char[], wchar[], or dchar[] for its element type, or char[], wchar[], and dchar[] have something like uchar as their element type, where uchar is a struct which contains a char[], wchar[], or dchar[]
which holds a single grapheme.

Having a new type for grapheme would work too. My preference still goes to reusing the string type because it makes the semantic simpler to understand, especially when comparing graphemes with literals.


--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Reply via email to