Steven Schveighoffer Wrote: > On Sat, 15 Jan 2011 12:11:59 -0500, Lutger Blijdestijn > <lutger.blijdest...@gmail.com> wrote: > > > Steven Schveighoffer wrote: > > > > ... > >>> I think a good standard to evaluate our handling of Unicode is to see > >>> how easy it is to do things the right way. In the above, foreach would > >>> slice the string grapheme by grapheme, and the == operator would > >>> perform > >>> a normalized comparison. While it works correctly, it's probably not > >>> the > >>> most efficient way to do thing however. > >> > >> I think this is a good alternative, but I'd rather not impose this on > >> people like myself who deal mostly with English. I think this should be > >> possible to do with wrapper types or intermediate ranges which have > >> graphemes as elements (per my suggestion above). > >> > >> Does this sound reasonable? > >> > >> -Steve > > > > If its a matter of choosing which is the 'default' range, I'd think > > proper > > unicode handling is more reasonable than catering for english / ascii > > only. > > Especially since this is already the case in phobos string algorithms. > > English and (if I understand correctly) most other languages. Any > language which can be built from composable graphemes would work. And in > fact, ones that use some graphemes that cannot be composed will also work > to some degree (for example, opEquals). > > What I'm proposing (or think I'm proposing) is not exactly catering to > English and ASCII, what I'm proposing is simply not catering to more > complex languages such as Hebrew and Arabic. What I'm trying to find is a > middle ground where most languages work, and the code is simple and > efficient, with possibilities to jump down to lower levels for performance > (i.e. switch to char[] when you know ASCII is all you are using) or jump > up to full unicode when necessary. > > Essentially, we would have three levels of types: > > char[], wchar[], dchar[] -- Considered to be arrays in every way. > string_t!T (string, wstring, dstring) -- Specialized string types that do > normalization to dchars, but do not handle perfectly all graphemes. Works > with any algorithm that deals with bidirectional ranges. This is the > default string type, and the type for string literals. Represented > internally by a single char[], wchar[] or dchar[] array. > * utfstring_t!T -- specialized string to deal with full unicode, which may > perform worse than string_t, but supports everything unicode supports. > May require a battery of specialized algorithms. > > * - name up for discussion > > Also note that phobos currently does *no* normalization as far as I can > tell for things like opEquals. Two char[]'s that represent equivalent > strings, but not in the same way, will compare as !=. > > -Steve
The above compromise provides zero benefit. The proposed default type string_t is incorrect and will cause bugs. I prefer the standard lib to not provide normalization at all and force me to use a 3rd party lib rather than provide an incomplete implementation that will give me a false sense of correctness and cause very subtle and hard to find bugs. More over, Even if you ignore Hebrew as a tiny insignificant minority you cannot do the same for Arabic which has over one *billion* people that use that language. I firmly believe that in accordance with D's principle that the default behavior should be the correct & safe option, D should have the full unicode type (utfstring_t above) as the default. You need only a subset of the functionality because you only use English? For the same reason, you don't want the Unicode overhead? Use an ASCII type instead. In the same vain, a geneticist should use a DNA sequence type and not Unicode text.