On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
This is wrong. String in D is de facto (by implementation, spec
may say whatever is convenient for advertising D) array of
single bytes which can keep UTF-8 code units. No way string
type in D is always a string in a sense of code
points/characters. Sometimes it happens that string type
behaves like 'string', but if you put UTF-16 or UTF-32 text it
would remind you what string type really is.
By implementation they are also UTF strings. String literals use
UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, foreach over
narrow strings with `dchar` iterator variable type does UTF
decoding etc.
I don't think you know what you're talking about; putting UTF-16
or UTF-32 in `string` is utter madness and not trivially
possible. We have `wchar`/`wstring` and `dchar`/`dstring` for
UTF-16 and UTF-32, respectively.
Operations on code units are rare, which is why the standard
library instead treats strings as ranges of code points, for
correctness by default. However, we must not prevent the user
from being able to work on arrays of code units, as many
string algorithms can be optimized by not doing full UTF
decoding. The standard library does this on many occasions,
and there are more to come.
This is attempt to explain problematic design as a wise action.
No, it's not. Please leave crappy, unsubstantiated arguments like
this out of these forums.
[1] http://dlang.org/type
By the way, the link you provide says char is unsigned 8 bit
type which can keep value of UTF-8 code unit.
Not *can*, but *does*. Otherwise it is an error in the program.
The specification, compiler implementation (as shown above) and
standard library all treat `char` as a UTF-8 code unit. Treat it
otherwise at your own peril.
UTF is irrelevant because the problem is in D implementation.
See
http://forum.dlang.org/thread/hoopiiobddbapybbw...@forum.dlang.org
(in particular 2nd page).
The root of the issue is that D does not provide 'utf' type
which would handle correctly strings and characters
irrespective of the format. But instead the language pretends
that it supports such type by allowing to convert to single
byte char array both literals "sad" and "säд". And ['s', 'ä',
'д'] is by the way neither char[], no wchar[], even not dchar[]
but sequence of integers, which compounds oddities in character
types.
The only problem in the implementation here that you illustrate
is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. It
should be `dchar[]`. The length of `char[]` works as intended.
Problems with string type can be illustrated as possible
situation in domain of integers type. Assume that user wants
'number' type which accepts both integers, floats and doubles
and treats them properly. This would require either library
solution or a new special type in a language which is supported
by both compiler and runtime library, which performs operation
at runtime on objects of number type according to their
effective type.
D designers want to support such feature (to make the language
better), but as it happens in other situations, the support is
only limited: compiler allows to do
alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l];
I don't understand this example. The compiler does *not* allow
that code; try it for yourself.