On 21.10.2011 06:06, Jonathan M Davis wrote:
It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.
Indeed - expressing strings as arrays of characters will always fall short of the unicode concept in some way. A true unicode-compliant languages have to handle strings as opaque objects that do not have any encoding. There is a number of operations that can be done with these objects (concatenation, comparison, searching, etc.). Any kind of defined memory representation can only be obtained by an explicit encoding operation.
Python3, for example, did a fundamental step by introducing this fundamental distinction. At first it seems silly, having to think about encodings so often when writing trivial code. After a short while, the strict conceptual separation between unencoded "strings" and encoded "arrays of something" really helps avoiding ugly problems.
Sure, for a performance-critical language, the issue becomes a lot trickier. I still think it is possible and ultimately the only way to solve tricky problems that will otherwise always crop up somewhere.