Am 13.10.2013 16:14, schrieb nickles:
Ok, I understand, that "length" is - obviously - used in analogy to any
array's length value.

Still, this seems to be inconsistent. D elaborates on implementing
"char"s as UTF-8 which means that a "char" in D can be of any length
between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
then this (i.e. the character's length) be the "unit of measurement" for
"char"s - like e.g. the size of the underlying struct in an array of
"struct"s? The story continues with indexing "string"s: In a consistent
implementation, shouldn't

    writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic letter?

This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4). However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic. The .length of the UTF-32 string could be either 3 or 4.

There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1). In-place modifications of char[] arrays also wouldn't be possible anymore as the size of the underlying array might have to change.

Reply via email to