Re: Inconsitency

2013-10-14 Thread nickles

It's easy to state this, but - please - don't get sarcastical!

I'm obviously (as I've learned) speaking about UTF-8 "char"s as 
they are NOT implemented right now in D; so I'm criticizing that 
D, as a language which emphasizes on "UTF-8 characters", isn't 
taking "the last step", like e.g. Python does (and no, I'm not a 
Python fan, nor do I consider D a bad language).


Re: Inconsitency

2013-10-13 Thread nickles
This will _not_ return a trailing surrogate of a Cyrillic 
letter. It will return the second code unit of the "ä" 
character (U+00E4).


True. It's UTF-8, not UTF-16.

However, it could also yield the first code unit of the umlaut 
diacritic, depending on how the string is represented.


This is not true for UTF-8, which is not subject to "endianism".

If the string were in UTF-32, [2] could yield either the 
Cyrillic character, or the umlaut diacritic.

The .length of the UTF-32 string could be either 3 or 4.


Both are not true for UTF-32. There is no interpretation (except 
for the "endianism", which could be taken care of in a 
library/the core) for the code point.


There are multiple reasons why .length and index access is 
based on code units rather than code points or any higher level 
representation, but one is that the complexity would suddenly 
be O(n) instead of O(1).


see my last statement below

In-place modifications of char[] arrays also wouldn't be 
possible anymore


They would be, but


as the size of the underlying array might have to change.


Well that's a point; on the other hand, D is constantly creating 
and throwing away new strings, so this isn't quite an argument. 
The current solution puts the programmer in charge of dealing 
with UTF-x, where a more consistent implementation would put the 
burden on the implementors of the libraries/core, i.e. the ones 
who usually have a better understanding of Unicode than the 
average programmer.


Also, implementing such a semantics would not per se abandon a 
byte-wise access, would it?


So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this mean 
in terms of speed?


These is no irony in my questions. I'm really looking for 
solutions...


Re: Inconsitency

2013-10-13 Thread nickles
Ok, I understand, that "length" is - obviously - used in analogy 
to any array's length value.


Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D can 
be of any length between 1 and 4 bytes for an arbitrary Unicode 
code point. Shouldn't then this (i.e. the character's length) be 
the "unit of measurement" for "char"s - like e.g. the size of the 
underlying struct in an array of "struct"s? The story continues 
with indexing "string"s: In a consistent implementation, shouldn't


   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?
Btw. how do YOU implement this for "string" (for "dstring" it 
works - logically, for "wstring" the same problem arises for code 
points above D800)?


Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, why - 
if D is so UTF-8-centric - isn't this function implemented in the 
core like ".length"?




Re: Inconsitency

2013-10-13 Thread nickles
Ok, if my understandig is wrong, how do YOU measure the length of 
a string?

Do you always use count(), or is there an alternative?




Re: Inconsitency

2013-10-13 Thread nickles
This is simply wrong. All strings return number of codeunits. 
And it's only UTF-32 where codepoint (~ character) happens to 
fit into one codeunit.


I do not agree:

   writeln("säд".length);=> 5  chars: 5 (1 + 2 [C3A4] + 2 
[D094], UTF-8)

   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);   => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);   => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.


Inconsitency

2013-10-13 Thread nickles

Why does .length return the number of bytes and not the
number of UTF-8 characters, whereas length and
.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have .length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count()?