Steven Schveighoffer Wrote: > On Sat, 15 Jan 2011 15:55:48 -0500, Michel Fortin > <michel.for...@michelf.com> wrote: > > > On 2011-01-15 15:20:08 -0500, "Steven Schveighoffer" > > <schvei...@yahoo.com> said: > > > >>> I'm not suggesting we impose it, just that we make it the default. If > >>> you want to iterate by dchar, wchar, or char, just write: > >>> foreach (dchar c; "exposé") {} > >>> foreach (wchar c; "exposé") {} > >>> foreach (char c; "exposé") {} > >>> // or > >>> foreach (dchar c; "exposé".by!dchar()) {} > >>> foreach (wchar c; "exposé".by!wchar()) {} > >>> foreach (char c; "exposé".by!char()) {} > >>> and it'll work. But the default would be a slice containing the > >>> grapheme, because this is the right way to represent a Unicode > >>> character. > >> I think this is a good idea. I previously was nervous about it, but > >> I'm not sure it makes a huge difference. Returning a char[] is > >> certainly less work than normalizing a grapheme into one or more code > >> points, and then returning them. All that it takes is to detect all > >> the code points within the grapheme. Normalization can be done if > >> needed, but would probably have to output another char[], since a > >> normalized grapheme can occupy more than one dchar. > > > > I'm glad we agree on that now. > > It's a matter of me slowly wrapping my brain around unicode and how it's > used. It seems like it's a typical committee defined standard where there > are 10 ways to do everything, I was trying to weed out the lesser used (or > so I perceived) pieces to allow a more implementable library. It's doubly > hard for me since I have limited experience with other languages, and I've > never tried to write them with a computer (my language classes in high > school were back in the days of actually writing stuff down on paper). > > I once told a colleague who was on a standards committee that their > proposed KLV standard (key length value) was ridiculous. The wise > committee had decided that in order to avoid future issues, the length > would be encoded as a single byte if < 128, or 128 + length of the length > field for anything higher. This means you could potentially have to parse > and process a 127-byte integer! > > > > > > >> What if I modified my proposed string_t type to return T[] as its > >> element type, as you say, and string literals are typed as > >> string_t!(whatever)? In addition, the restrictions I imposed on > >> slicing a code point actually get imposed on slicing a grapheme. That > >> is, it is illegal to substring a string_t in a way that slices through > >> a grapheme (and by deduction, a code point)? > > > > I'm not opposed to that on principle. I'm a little uneasy about having > > so many types representing a string however. Some other raw comments: > > > > I agree that things would be more coherent if char[], wchar[], and > > dchar[] behaved like other arrays, but I can't really see a > > justification for those types to be in the language if there's nothing > > special about them (why not a library type?). > > I would not be opposed to getting rid of those types. But I am very > opposed to char[] not being an array. If you want a string to be > something other than an array, make it have a different syntax. We also > have to consider C compatibility. > > However, we are in radical-change mode then, and this is probably pushed > to D3 ;) If we can find some way to fix the situation without > invalidating TDPL, we should strive for that first IMO. > > > If strings and arrays of code units are distinct, slicing in the middle > > of a grapheme or in the middle of a code point could throw an error, but > > for performance reasons it should probably check for that only when > > array bounds checking is turned on (that would require compiler support > > however). > > Not really, it could use assert, but that throws an assert error instead > of a RangeError. Of course, both are errors and will abort the program. > I do wish there was a version(noboundscheck) to do this kind of stuff > with... > > >> Actually, we would need a grapheme to be its own type, because > >> comparing two char[]'s that don't contain equivalent bits and having > >> them be equal, violates the expectation that char[] is an array. > >> So the string_t!char would return a grapheme_t!char (names to be > >> discussed) as its element type. > > > > Or you could make a grapheme a string_t. ;-) > > I'm a little uneasy having a range return itself as its element type. For > all intents and purposes, a grapheme is a string of one 'element', so it > could potentially be a string_t. > > It does seem daunting to have so many types, but at the same time, types > convey relationships at compile time that can make coding impossible to > get wrong, or make things actually possible when having a single type > doesn't. > > I'll give you an example from a previous life: > > Tango had a type called DateTime. This type represented *either* a point > in time, or a span of time (depending on how you used it). But I proposed > we switch to two distinct types, one for a point in time, one for a span > of time. It was argued that both were so similar, why couldn't we just > keep one type? The answer is simple -- having them be separate types > allows me to express relationships that the compiler enforces. For > example, you can add two time spans together, but you can't add two points > in time together. Or maybe you want a function to accept a time span > (like a sleep operation). If there was only one type, then > sleep(DateTime.now()) compiles and sleeps for what, 2011 years? ;) > > I feel that making extra types when the relationship between them is > important is worth the possible repetition of functionality. Catching > bugs during compilation is soooo much better than experiencing them during > runtime. > > -Steve
I like Michel's proposed semantics and I also agree with you that it should be a distinct string type and not break consistency of regular arrays. Regarding your last point: Do you mean that a grapheme would be a sub-type of string? (a specialization where the string represents a single element)? If so, than it sounds good to me.