Re: Inconsitency

Sönke Ludwig Sun, 13 Oct 2013 07:50:42 -0700

Am 13.10.2013 16:14, schrieb nickles:

Ok, I understand, that "length" is - obviously - used in analogy to any
array's length value.


Still, this seems to be inconsistent. D elaborates on implementing
"char"s as UTF-8 which means that a "char" in D can be of any length
between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
then this (i.e. the character's length) be the "unit of measurement" for
"char"s - like e.g. the size of the underlying struct in an array of
"struct"s? The story continues with indexing "string"s: In a consistent
implementation, shouldn't

    writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic letter?

This will _not_ return a trailing surrogate of a Cyrillic letter. Itwill return the second code unit of the "ä" character (U+00E4). However,it could also yield the first code unit of the umlaut diacritic,depending on how the string is represented. If the string were inUTF-32, [2] could yield either the Cyrillic character, or the umlautdiacritic. The .length of the UTF-32 string could be either 3 or 4.

There are multiple reasons why .length and index access is based on codeunits rather than code points or any higher level representation, butone is that the complexity would suddenly be O(n) instead of O(1).In-place modifications of char[] arrays also wouldn't be possibleanymore as the size of the underlying array might have to change.

Re: Inconsitency

Reply via email to