On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev
wrote:
>On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
>wrote:
[...]
>>Clearly one might argue that their app has no business
>>dealing
>>with diacriticals or Asian characters. But that's the typical
>>provincial view that marred many languages' approach to UTF
>>and
>>internationalization.
>
>So is yours, if you think that making everything magically a
>dchar
>is going to solve all problems.
>
>The TDPL example only showcases the problem. Yes, it works
>with
>Swedish. Now try it again with Sanskrit.
+1
In Indian languages, a character consists of one or more
UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char
I have
to use string search.
[...]
That's what I've been arguing for. The most general form of
character
searching in Unicode requires substring searching, and
similarly many
character-based operations on Unicode strings are effectively
substring-based operations, because said "character" may be a
multibyte
code point, or, in your case, multiple code points. Since
that's the
case, we might as well just forget about the distinction between
"character" and "string", and treat all such operations as
substring
operations (even if the operand is supposedly "just 1 character
long").
This would allow us to get rid of the hackish auto-decoding of
narrow
strings, and thus eliminate the needless overhead of always
decoding.
That won't work, because your needle might be in a different
normalization form than your haystack, thus a byte-by-byte
comparison will not be able to find it.