On Tuesday, May 31, 2016 23:36:20 Marco Leise via Digitalmars-d wrote: > Am Tue, 31 May 2016 16:56:43 -0400 > > schrieb Andrei Alexandrescu <seewebsiteforem...@erdani.org>: > > On 05/31/2016 03:44 PM, Jonathan M Davis via Digitalmars-d wrote: > > > In the vast majority of cases what folks care about is full character > > > > How are you so sure? -- Andrei > > Because a full character is the typical unit of a written > language. It's what we visualize in our heads when we think > about finding a substring or counting characters. A special > case of this is the reduction to ASCII where we can use code > units in place of grapheme clusters.
Exactly. How many folks here have written code where the correct thing to do is to search on code points? Under what circumstances is that even useful? Code points are a mid-level abstraction between UTF-8/16 and graphemes that are not particularly useful on their own. Yes, by using code points, we eliminate the differences between the encodings, but how much code even operates on multiple string types? Having all of your strings have the same encoding fixes the consistency problem just as well as autodecoding to dchar evereywhere does - and without the efficiency hit. Typically, folks operate on string or char[] unless they're talking to the Windows API, in which case, they need wchar[]. Our general recommendation is that D code operate on UTF-8 except when it needs to operate on a different encoding because of other stuff it has to interact with (like the Win32 API), in which case, ideally it converts those strings to UTF-8 once they get into the D code and operates on them as UTF-8, and anything that has to be output in a different encoding is operated on as UTF-8 until it needs to be outputed, in which case, it's converted to UTF-16 or whatever the target encoding is. Not much of anyone is recommending that you use dchar[] everywhere, but that's essentially what the range API is trying to force. I think that it's very safe to say that the vast majority of string processing either is looking to operate on strings as a whole or on individual, full characters within a string. Code points are neither. While code may play tricks with Unicode to be efficient (e.g. operating at the code unit level where it can rather than decoding to either code points or graphemes), or it might make assumptions about its data being ASCII-only, aside from explicit Unicode processing code, I have _never_ seen code that was actually looking to logically operate on only pieces of characters. While it may operate on code units for efficiency, it's always looking to be logically operating on string as a unit or on whole characters. Anyone looking to operate on code points is going to need to take into account the fact that they're not full characters, just like anyone who operates on code units needs to take into account the fact that they're not whole characters. Operating on code points as if they were characters - which is exactly what D currently does with ranges - is just plain wrong. We need to support operating at the code point level for those rare cases where it's actually useful, but autedecoding makes no sense. It incurs a performance penality without actually giving correct results except in those rare cases where you want code points instead of full characters. And only Unicode experts are ever going to want that. The average programmer who is not super Unicode savvy doesn't even know what code points are. They're clearly going to be looking to operate on strings as sequences of characters, not sequences of code points. I don't see how anyone could expect otherwise. Code points are a mid-level, Unicode abstraction that only those who are Unicode savvy are going to know or care about, let alone want to operate on. - Jonathan M Davis