On Thu, Jun 02, 2016 at 04:38:28PM -0400, Andrei Alexandrescu via Digitalmars-d wrote: > On 06/02/2016 04:36 PM, tsbockman wrote: > > Your examples will pass or fail depending on how (and whether) the > > 'ö' grapheme is normalized. > > And that's fine. Want graphemes, .byGrapheme wags its tail in that > corner. Otherwise, you work on code points which is a completely > meaningful way to go about things. What's not meaningful is the random > results you get from operating on code units. > > > They only ever succeeds because 'ö' happens to be one of the > > privileged graphemes that *can* be (but often isn't!) represented as > > a single code point. Many other graphemes have no such > > representation. > > Then there's no dchar for them so no problem to start with. > > s.find(c) ----> "Find code unit c in string s" [...]
This is a ridiculous argument. We might as well say, "there's no single byte UTF-8 that can represent Ш, so that's no problem to start with" -- since we can just define it away by saying s.find(c) == "find byte c in string s", and thereby justify using ASCII as our standard string representation. The point is that dchar is NOT ENOUGH TO REPRESENT A SINGLE CHARACTER in the general case. It is adequate for a subset of characters -- just like ASCII is also adequate for a subset of characters. If you only need to work with ASCII, it suffices to work with ubyte[]. Similarly, if your work is restricted to only languages without combining diacritics, then a range of dchar suffices. But a range of dchar is NOT good enough in the general case, and arguing that it does only makes you look like a fool. Appealing to normalization doesn't change anything either, since only a subset of base character + diacritic combinations will normalize to a single code point. If the string has a base character + diacritic combination doesn't have a precomposed code point, it will NOT fit in a dchar. (And keep in mind that the notion of diacritic is still very Euro-centric. In Korean, for example, a single character is composed of multiple parts, each of which occupies 1 code point. While some precomposed combinations do exist, they don't cover all of the possibilities, so normalization won't help you there.) T -- Frank disagreement binds closer than feigned agreement.