On Friday, 7 March 2014 at 23:13:50 UTC, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 10:35:46PM +0000, Sarath Kodali wrote:
On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev wrote:
>On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu
>wrote:
[...]
>>Clearly one might argue that their app has no business >>dealing
>>with diacriticals or Asian characters. But that's the typical
>>provincial view that marred many languages' approach to UTF >>and
>>internationalization.
>
>So is yours, if you think that making everything magically a >dchar
>is going to solve all problems.
>
>The TDPL example only showcases the problem. Yes, it works >with
>Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I have
to use string search.
[...]

That's what I've been arguing for. The most general form of character searching in Unicode requires substring searching, and similarly many
character-based operations on Unicode strings are effectively
substring-based operations, because said "character" may be a multibyte code point, or, in your case, multiple code points. Since that's the
case, we might as well just forget about the distinction between
"character" and "string", and treat all such operations as substring operations (even if the operand is supposedly "just 1 character long").

This would allow us to get rid of the hackish auto-decoding of narrow strings, and thus eliminate the needless overhead of always decoding.

That won't work, because your needle might be in a different normalization form than your haystack, thus a byte-by-byte comparison will not be able to find it.

Reply via email to