Re: Inconsitency

2013-10-20 Thread Kagamin

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:

Most code might be buggy then.


All code is buggy.

An issue the often comes up is file names. A file called "bär" 
will be normalized differently depending on the operating 
system. In both cases it is one grapheme. However, on Linux it 
is one code point, but on OS X it is two code points.


And on Windows it's case-insensitive - 2^^N variants of each 
string. So what?


Re: Inconsitency

2013-10-16 Thread monarch_dodra

On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:
I agree with your point. Nevertheless you understanding of 
grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
grapheme. U+00e4 is the same grapheme as "a\u0308".


http://en.wikipedia.org/wiki/Grapheme


Ah. Learn something new every day. :)


Re: Inconsitency

2013-10-16 Thread Dmitry Olshansky

16-Oct-2013 23:42, qznc пишет:

On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:

On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:

On 2013-10-16 14:33, qznc wrote:


It is either [U+00E4] as one code point or [a,U+0308] for two code
points. The second is "combining diaeresis" [0]. Not required, but
possible. Those combining characters [1] provide a nearly infinite
number of combinations. You can go crazy with it:
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character


Aha, now I see.


One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r",
you can run a replace to replace 'a' with 'o'. Then, you'll get:
"boär" vs "boör"

Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" (eg,
sorting "oäa" should *not* generate "aaö". What it *should* generate
is up to debate), you can't entirely consider that a letter+grapheme
is a single entity.

Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In particular,
it does an awesome job of *teaching* the coder *what* unicode is.
Virtually everyone here has solid knowledge of unicode (I feel). They
understand, and can explain it, and can work with.

On the other hand, I don't know many C++ coders that understand unicode.


I agree with your point. Nevertheless you understanding of grapheme is
off. U+0308 is not a grapheme.  "a\u0308" is one grapheme. U+00e4 is the
same grapheme as "a\u0308".


s/the same/canonically equivalent/ :)



http://en.wikipedia.org/wiki/Grapheme



--
Dmitry Olshansky


Re: Inconsitency

2013-10-16 Thread qznc
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra 
wrote:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
wrote:

On 2013-10-16 14:33, qznc wrote:

It is either [U+00E4] as one code point or [a,U+0308] for two 
code
points. The second is "combining diaeresis" [0]. Not 
required, but
possible. Those combining characters [1] provide a nearly 
infinite

number of combinations. You can go crazy with it:
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] 
http://www.fileformat.info/info/unicode/char/0308/index.htm

[1] http://en.wikipedia.org/wiki/Combining_character


Aha, now I see.


One of the interesting points, is with "ba\u00E4r" vs 
"baa\u0308r", you can run a replace to replace 'a' with 'o'. 
Then, you'll get: "boär" vs "boör"


Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's 
"letter" (eg, sorting "oäa" should *not* generate "aaö". What 
it *should* generate is up to debate), you can't entirely 
consider that a letter+grapheme is a single entity.


Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In 
particular, it does an awesome job of *teaching* the coder 
*what* unicode is. Virtually everyone here has solid knowledge 
of unicode (I feel). They understand, and can explain it, and 
can work with.


On the other hand, I don't know many C++ coders that understand 
unicode.


I agree with your point. Nevertheless you understanding of 
grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one 
grapheme. U+00e4 is the same grapheme as "a\u0308".


http://en.wikipedia.org/wiki/Grapheme


Re: Inconsitency

2013-10-16 Thread monarch_dodra
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg 
wrote:

On 2013-10-16 14:33, qznc wrote:

It is either [U+00E4] as one code point or [a,U+0308] for two 
code
points. The second is "combining diaeresis" [0]. Not required, 
but
possible. Those combining characters [1] provide a nearly 
infinite

number of combinations. You can go crazy with it:
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character


Aha, now I see.


One of the interesting points, is with "ba\u00E4r" vs 
"baa\u0308r", you can run a replace to replace 'a' with 'o'. 
Then, you'll get: "boär" vs "boör"


Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" 
(eg, sorting "oäa" should *not* generate "aaö". What it *should* 
generate is up to debate), you can't entirely consider that a 
letter+grapheme is a single entity.


Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In 
particular, it does an awesome job of *teaching* the coder *what* 
unicode is. Virtually everyone here has solid knowledge of 
unicode (I feel). They understand, and can explain it, and can 
work with.


On the other hand, I don't know many C++ coders that understand 
unicode.


Re: Inconsitency

2013-10-16 Thread Jacob Carlborg

On 2013-10-16 14:33, qznc wrote:


It is either [U+00E4] as one code point or [a,U+0308] for two code
points. The second is "combining diaeresis" [0]. Not required, but
possible. Those combining characters [1] provide a nearly infinite
number of combinations. You can go crazy with it:
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character


Aha, now I see.

--
/Jacob Carlborg


Re: Inconsitency

2013-10-16 Thread qznc
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg 
wrote:

On 2013-10-16 10:03, qznc wrote:


Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" 
will be
normalized differently depending on the operating system. In 
both cases
it is one grapheme. However, on Linux it is one code point, 
but on OS X

it is two code points.


Why would it require two code points?


It is either [U+00E4] as one code point or [a,U+0308] for two 
code points. The second is "combining diaeresis" [0]. Not 
required, but possible. Those combining characters [1] provide a 
nearly infinite number of combinations. You can go crazy with it: 
http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work


[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character


Re: Inconsitency

2013-10-16 Thread Jacob Carlborg

On 2013-10-16 10:03, qznc wrote:


Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" will be
normalized differently depending on the operating system. In both cases
it is one grapheme. However, on Linux it is one code point, but on OS X
it is two code points.


Why would it require two code points?

--
/Jacob Carlborg


Re: Inconsitency

2013-10-16 Thread Maxim Fomin
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:

On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() 
function which returns the length that I was searching for. 
However, why - if D is so UTF-8-centric - isn't this 
function implemented in the core like ".length"?


Most code doesn't need to count graphemes and lives happily 
with just strings, that's why it's not in the core.


Most code might be buggy then.

An issue the often comes up is file names. A file called 
"bär" will be normalized differently depending on the 
operating system. In both cases it is one grapheme. However, 
on Linux it is one code point, but on OS X it is two code 
points.


Now that you mention it, I had a program that would send 
strings to a socket written in D. Before I could process the 
strings on OS X, I had to normalize the decomposed OS X 
version of the strings to the composed form that D could 
handle, else it wouldn't work. I used libutf8proc for it (only 
one tiny little function). It was no problem to interface to 
the C library, however, I thought it would have been nice, if 
D could've handled this on its own without depending on third 
party libraries.


I'm not sure this is a "D" issue though: It's a fact of unicode
that there are two different ways to write ä.


As I argued previously, it is implementation issue which treats 
"bär" is sequence of objects which are not capable of 
representing values (like int[] = [3.14]). By the way, it is a 
rare case of type system hole. Usually in D you need cast or 
union to reinterpret some value, with "bär"[X] you need not.


Re: Inconsitency

2013-10-16 Thread Chris
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra 
wrote:

On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() 
function which returns the length that I was searching for. 
However, why - if D is so UTF-8-centric - isn't this 
function implemented in the core like ".length"?


Most code doesn't need to count graphemes and lives happily 
with just strings, that's why it's not in the core.


Most code might be buggy then.

An issue the often comes up is file names. A file called 
"bär" will be normalized differently depending on the 
operating system. In both cases it is one grapheme. However, 
on Linux it is one code point, but on OS X it is two code 
points.


Now that you mention it, I had a program that would send 
strings to a socket written in D. Before I could process the 
strings on OS X, I had to normalize the decomposed OS X 
version of the strings to the composed form that D could 
handle, else it wouldn't work. I used libutf8proc for it (only 
one tiny little function). It was no problem to interface to 
the C library, however, I thought it would have been nice, if 
D could've handled this on its own without depending on third 
party libraries.


I'm not sure this is a "D" issue though: It's a fact of unicode
that there are two different ways to write ä.


My point was it would have been nice to have a native D function 
that can convert between the two types, especially because this 
is a well known issue. NSString (Cocoa / Objective-C) for example 
has things like precomposedStringWithCompatibilityMapping etc.


Re: Inconsitency

2013-10-16 Thread monarch_dodra

On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() 
function which returns the length that I was searching for. 
However, why - if D is so UTF-8-centric - isn't this 
function implemented in the core like ".length"?


Most code doesn't need to count graphemes and lives happily 
with just strings, that's why it's not in the core.


Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" 
will be normalized differently depending on the operating 
system. In both cases it is one grapheme. However, on Linux it 
is one code point, but on OS X it is two code points.


Now that you mention it, I had a program that would send 
strings to a socket written in D. Before I could process the 
strings on OS X, I had to normalize the decomposed OS X version 
of the strings to the composed form that D could handle, else 
it wouldn't work. I used libutf8proc for it (only one tiny 
little function). It was no problem to interface to the C 
library, however, I thought it would have been nice, if D 
could've handled this on its own without depending on third 
party libraries.


I'm not sure this is a "D" issue though: It's a fact of unicode
that there are two different ways to write ä.


Re: Inconsitency

2013-10-16 Thread Chris

On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() 
function which returns the length that I was searching for. 
However, why - if D is so UTF-8-centric - isn't this function 
implemented in the core like ".length"?


Most code doesn't need to count graphemes and lives happily 
with just strings, that's why it's not in the core.


Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" 
will be normalized differently depending on the operating 
system. In both cases it is one grapheme. However, on Linux it 
is one code point, but on OS X it is two code points.


Now that you mention it, I had a program that would send strings 
to a socket written in D. Before I could process the strings on 
OS X, I had to normalize the decomposed OS X version of the 
strings to the composed form that D could handle, else it 
wouldn't work. I used libutf8proc for it (only one tiny little 
function). It was no problem to interface to the C library, 
however, I thought it would have been nice, if D could've handled 
this on its own without depending on third party libraries.


Re: Inconsitency

2013-10-16 Thread qznc

On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, 
why - if D is so UTF-8-centric - isn't this function 
implemented in the core like ".length"?


Most code doesn't need to count graphemes and lives happily 
with just strings, that's why it's not in the core.


Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" 
will be normalized differently depending on the operating system. 
In both cases it is one grapheme. However, on Linux it is one 
code point, but on OS X it is two code points.


Re: Inconsitency

2013-10-15 Thread Kagamin

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, why 
- if D is so UTF-8-centric - isn't this function implemented in 
the core like ".length"?


Most code doesn't need to count graphemes and lives happily with 
just strings, that's why it's not in the core.


Re: Inconsitency

2013-10-15 Thread Kagamin

On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote:
If single element access is needed, str.front yields decoded 
`dchar`. Or simple `foreach (dchar d; str)` - it won't hide the 
fact it is O(n) operation at least. As `str.front` yields 
dchar, most `std.algorithm` and `std.range` utilities will also 
work correctly on default UTF-8 strings.


No, he needs graphemes, so `std.algorithm` won't work correctly 
for him as Peter has shown: grapheme doesn't fit in dchar.


Re: Inconsitency

2013-10-14 Thread Andrei Alexandrescu

On 10/14/13 1:09 AM, nickles wrote:

It's easy to state this, but - please - don't get sarcastical!


Thanks for making this point.

String handling in D follows two simple principles:

1. The support is a slice of code units (which often are immutable, 
seeing as string is an alias for immutable(char)[]). Slice primitives 
are readily accessible.


2. The standard library (and the foreach language construct) recognize 
that arrays of code units are special and define bidirectional range 
primitives on top of them. These are empty, save, front, back, popFront, 
and popBack.


So for a string you may use the range primitives and related algorithms 
to manipulate code points, or the slice primitives to manipulate code units.


This duality has been discussed in the past, and alternatives have 
proposed (mainly gravitating around making one of the aspects explicit 
rather than implicit). It is my opinion that a better solution exists 
(in the form of making representation accessible only through a property 
.rep). But the current design has "won" not only because it's the 
existing one, but also because it has good simplicity and flexibility 
advantages. At this point there is no question about changing the 
semantics of existing constructs.



Andrei


Re: Inconsitency

2013-10-14 Thread Chris

On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote:

Am 13.10.2013 15:25, schrieb nickles:
Ok, if my understandig is wrong, how do YOU measure the length 
of a string?

Do you always use count(), or is there an alternative?




The thing is that even count(), which gives you the number of 
*code points*, isn't necessarily what is desired - that is, the 
number of actual display characters. UTF is quite a complex 
beast and doing any operations on it _correctly_ generally 
requires a lot of care. If you need to do these kinds of 
operations, I would highly recommend to read up the basics of 
UTF and Unicode first (quick overview on Wikipedia: 
).


arr.length is meant to be used in conjunction with array 
indexing and slicing (arr[...]) and its value is consistent for 
all string and array types for this purpose.


I recently discovered a bug in my program. If you take the letter 
"é" for example (Linux, Ubuntu 12.04), std.utf.count() returns 1 
and .length returns 2. I needed the length to slice the string at 
a given point. Using .length instead of std.utf.count() fixed the 
bug.


Re: Inconsitency

2013-10-14 Thread nickles

It's easy to state this, but - please - don't get sarcastical!

I'm obviously (as I've learned) speaking about UTF-8 "char"s as 
they are NOT implemented right now in D; so I'm criticizing that 
D, as a language which emphasizes on "UTF-8 characters", isn't 
taking "the last step", like e.g. Python does (and no, I'm not a 
Python fan, nor do I consider D a bad language).


Re: Inconsitency

2013-10-13 Thread deadalnix

On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote:

I've found another one inconsitency problem.

void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);

void main() {
foo(`123`);
foo(`123`w);
foo(`123`d);
}

Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(wchar)[])
Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(dchar)[])


And typeof(`123`).stringof == `string`. Why `123` can be stored 
as null terminated utf8 string in rdata segment and `123`w nor 
`123`d are not? For example wide strings(utf16) are usable with 
windows *W functions.


The first one is made to interface with C. It is a special case.


Re: Inconsitency

2013-10-13 Thread Andrej Mitrovic
On 10/14/13, Temtaime  wrote:
> And typeof(`123`).stringof == `string`. Why `123` can be stored
> as null terminated utf8 string in rdata segment and `123`w nor
> `123`d are not? For example wide strings(utf16) are usable with
> windows *W functions.
>

http://d.puremagic.com/issues/show_bug.cgi?id=6032


Re: Inconsitency

2013-10-13 Thread Temtaime

I've found another one inconsitency problem.

void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);

void main() {
foo(`123`);
foo(`123`w);
foo(`123`d);
}

Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(wchar)[])
Error: function hello.foo (const(char*)) is not callable using 
argument types (immutable(dchar)[])


And typeof(`123`).stringof == `string`. Why `123` can be stored 
as null terminated utf8 string in rdata segment and `123`w nor 
`123`d are not? For example wide strings(utf16) are usable with 
windows *W functions.


Re: Inconsitency

2013-10-13 Thread deadalnix

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that "length" is - obviously - used in 
analogy to any array's length value.




That isn't an analogy. It is usually a good idea to try to 
understand thing before judging if it is consistent.


Re: Inconsitency

2013-10-13 Thread monarch_dodra

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that "length" is - obviously - used in 
analogy to any array's length value.


Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D 
can be of any length between 1 and 4 bytes for an arbitrary 
Unicode code point. Shouldn't then this (i.e. the character's 
length) be the "unit of measurement" for "char"s - like e.g. 
the size of the underlying struct in an array of "struct"s? The 
story continues with indexing "string"s: In a consistent 
implementation, shouldn't


   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?


I think the root misunderstanding is that you think that a string 
is random access.


A string *isn't* random access. They are implemented *inside* an 
array, but unless you know *exactly* what you are doing, you 
shouldn't index, slice or take the length of a string.


A string should be handled like a bidirectional range.

Once you've understood that, it becomes much simpler.
You want the first character? front.
You want to skip the first character? popFront.

You want an arbitrary character in o(N) time?
myString.dropFrontExactly(N).front;
You want an arbitrary character in o(1) time?
You can't.


Re: Inconsitency

2013-10-13 Thread Peter Alexander

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
However, it could also yield the first code unit of the umlaut 
diacritic, depending on how the string is represented.


This is not true for UTF-8, which is not subject to "endianism".


You are correct in that UTF-8 is endian agnostic, but I don't
believe that was Sönke's point. The point is that ä can be
produced in Unicode in more than one way. This program
illustrates:

import std.stdio;
void main()
{
  string a = "ä";
  string b = "a\u0308";
  writeln(a);
  writeln(b);
  writeln(cast(ubyte[])a);
  writeln(cast(ubyte[])b);
}

This prints:

ä
ä
[195, 164]
[97, 204, 136]

Notice that they are both the same "character" but have different
representations. The first is just the 'ä' code point, which
consists of two code units, the second is the 'a' code point
followed by a Combining Diaeresis code point.

In short, the string "ä" could be either 2 or 3 code units, and
either 1 or 2 code points.


Re: Inconsitency

2013-10-13 Thread anonymous

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
However, it could also yield the first code unit of the umlaut 
diacritic, depending on how the string is represented.


This is not true for UTF-8, which is not subject to "endianism".


This is not about endianness. It's "\u00E4" vs "a\u0308". The 
first is the single code point 'ä', the second is two code 
points, 'a' plus umlaut dots.


[...]
Well that's a point; on the other hand, D is constantly 
creating and throwing away new strings, so this isn't quite an 
argument. The current solution puts the programmer in charge of 
dealing with UTF-x, where a more consistent implementation 
would put the burden on the implementors of the libraries/core, 
i.e. the ones who usually have a better understanding of 
Unicode than the average programmer.


Also, implementing such a semantics would not per se abandon a 
byte-wise access, would it?


So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this 
mean in terms of speed?


These is no irony in my questions. I'm really looking for 
solutions...


I think, std.uni and std.utf are supposed to supply everything 
Unicode.


Re: Inconsitency

2013-10-13 Thread Maxim Fomin

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that "length" is - obviously - used in 
analogy to any array's length value.


Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D 
can be of any length between 1 and 4 bytes for an arbitrary 
Unicode code point. Shouldn't then this (i.e. the character's 
length) be the "unit of measurement" for "char"s - like e.g. 
the size of the underlying struct in an array of "struct"s? The 
story continues with indexing "string"s: In a consistent 
implementation, shouldn't


   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?


This is impossible given current design. At runtime "säд"[2] is 
viewed as struct { void *ptr; size_t length; }; ptr points to 
memory having at least five bytes and length having value 5. 
Druntime hasn't taken UTF course.


One option would be to add support in druntime so it can 
correctly handle such strings, or implement separate string type 
which does not default to char[], but of course the easiest way 
is to convince everybody that everything is OK and advice to use 
some library function which does the job correctly essentially 
implying that the language does the job wrong (pardon me, some D 
skepticism, the deeper I am in it, the more critically view it).


Re: Inconsitency

2013-10-13 Thread Dicebot

On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
Well that's a point; on the other hand, D is constantly 
creating and throwing away new strings, so this isn't quite an 
argument. The current solution puts the programmer in charge of 
dealing with UTF-x, where a more consistent implementation 
would put the burden on the implementors of the libraries/core, 
i.e. the ones who usually have a better understanding of 
Unicode than the average programmer.


Ironically, reason is consistency. `string` is just 
`immutable(char)[]` and it conforms to usual array behavior 
rules. Saying that array element value assignment may allocate it 
hardly a good option.


So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this 
mean in terms of speed?


If single element access is needed, str.front yields decoded 
`dchar`. Or simple `foreach (dchar d; str)` - it won't hide the 
fact it is O(n) operation at least. As `str.front` yields dchar, 
most `std.algorithm` and `std.range` utilities will also work 
correctly on default UTF-8 strings.


Slicing / .length are probably the only operations that do not 
respect UTF-8 encoding (because they are exactly the same for all 
arrays).


Re: Inconsitency

2013-10-13 Thread Maxim Fomin

On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
This is simply wrong. All strings return number of codeunits. 
And it's only UTF-32 where codepoint (~ character) happens to 
fit into one codeunit.


I do not agree:

   writeln("säд".length);=> 5  chars: 5 (1 + 2 [C3A4] + 
2 [D094], UTF-8)

   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);   => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);   => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.


This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char) 
while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], 
nor even dchar[] but int[]. In this case D is close to C which 
also treats character literals as integer type. Secondly, 
character arrays are only one who have two kinds of array 
literals - usual [item. item, item] and special "blah", as you 
see there is no correspondence between them.


If you try char[] x = cast(char[])['s', 'ä', 'д'] then length 
would be indeed 3 (but don't use it - it is broken).


In D dynamic array is at binary level represented as struct { 
void *ptr; size_t length; }. When you perform some operations on 
dynamic arrays they are implemented by compiler as calls to 
runtime functions. However, during runtime it is impossible to do 
something useful on arrays for which there is only information 
about address of beginning and total elements (this is a source 
of other problems in D). To handle this, compiler generates and 
passes as separate argument "TypeInfo" to runtime functions. 
TypeInfo contains some data, most relevant here is size of the 
element.


What happens is follows. Compiler recognizes that "säд" should be 
string literal and encoded as UTF-8 
(http://dlang.org/lex.html#DoubleQuotedString), so element type 
should be char, which requires to have 5 elements in array. So, 
at runtime an object "säд" is treated as array of 5 elements each 
having 1 byte per element.


Basically string (and char[]) plays dual role in the language - 
on the one hand, it is array of elements having strictly 1 byte 
size by definition, on the other hand D tries to use it as 
'generic' UTF type for which size is not fixed. So, there is 
contradiction - in source code such strings are viewed by 
programmer as some abstract UTF string, but druntime views it as 
5 byte array. In my view, trouble begins when "säд" is internally 
casted to char (which is no better than int[] x = [3.14, 5.6]). 
And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so 
there is great inconsistency here.


By the way, UTF definition is irrelevant here, this is pure 
implementation issue (I think it is design fault).


Re: Inconsitency

2013-10-13 Thread nickles
This will _not_ return a trailing surrogate of a Cyrillic 
letter. It will return the second code unit of the "ä" 
character (U+00E4).


True. It's UTF-8, not UTF-16.

However, it could also yield the first code unit of the umlaut 
diacritic, depending on how the string is represented.


This is not true for UTF-8, which is not subject to "endianism".

If the string were in UTF-32, [2] could yield either the 
Cyrillic character, or the umlaut diacritic.

The .length of the UTF-32 string could be either 3 or 4.


Both are not true for UTF-32. There is no interpretation (except 
for the "endianism", which could be taken care of in a 
library/the core) for the code point.


There are multiple reasons why .length and index access is 
based on code units rather than code points or any higher level 
representation, but one is that the complexity would suddenly 
be O(n) instead of O(1).


see my last statement below

In-place modifications of char[] arrays also wouldn't be 
possible anymore


They would be, but


as the size of the underlying array might have to change.


Well that's a point; on the other hand, D is constantly creating 
and throwing away new strings, so this isn't quite an argument. 
The current solution puts the programmer in charge of dealing 
with UTF-x, where a more consistent implementation would put the 
burden on the implementors of the libraries/core, i.e. the ones 
who usually have a better understanding of Unicode than the 
average programmer.


Also, implementing such a semantics would not per se abandon a 
byte-wise access, would it?


So, how do you guys handle UTF-8 strings in D? What are your 
solutions to the problems described? Does it all come down to 
converting "string"s and "wstring"s to "dstring"s, manipulating 
them, and re-convert them to "string"s? Btw, what would this mean 
in terms of speed?


These is no irony in my questions. I'm really looking for 
solutions...


Re: Inconsitency

2013-10-13 Thread Sönke Ludwig

Am 13.10.2013 15:50, schrieb Dmitry Olshansky:

13-Oct-2013 17:25, nickles пишет:

Ok, if my understandig is wrong, how do YOU measure the length of a
string?
Do you always use count(), or is there an alternative?



It's all there:
http://www.unicode.org/glossary/
http://www.unicode.org/versions/Unicode6.3.0/

I measure string length in code units (as defined in the above
standard). This bears no easy relation to the number of visible
characters but I don't mind it.

Measuring number of visible characters isn't trivial but can be done by
counting number of graphemes. For simple alphabets counting code points
will do the trick as well (what count does).



But you have to take care to normalize the string WRT diacritics if the 
estimate is supposed to work. OS X for example (if I remember right) 
always uses explicit combining characters, while Windows uses 
precomposed characters if possible.


Re: Inconsitency

2013-10-13 Thread Sönke Ludwig

Am 13.10.2013 16:14, schrieb nickles:

Ok, I understand, that "length" is - obviously - used in analogy to any
array's length value.

Still, this seems to be inconsistent. D elaborates on implementing
"char"s as UTF-8 which means that a "char" in D can be of any length
between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
then this (i.e. the character's length) be the "unit of measurement" for
"char"s - like e.g. the size of the underlying struct in an array of
"struct"s? The story continues with indexing "string"s: In a consistent
implementation, shouldn't

writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic letter?


This will _not_ return a trailing surrogate of a Cyrillic letter. It 
will return the second code unit of the "ä" character (U+00E4). However, 
it could also yield the first code unit of the umlaut diacritic, 
depending on how the string is represented. If the string were in 
UTF-32, [2] could yield either the Cyrillic character, or the umlaut 
diacritic. The .length of the UTF-32 string could be either 3 or 4.


There are multiple reasons why .length and index access is based on code 
units rather than code points or any higher level representation, but 
one is that the complexity would suddenly be O(n) instead of O(1). 
In-place modifications of char[] arrays also wouldn't be possible 
anymore as the size of the underlying array might have to change.


Re: Inconsitency

2013-10-13 Thread Michael

implementation, shouldn't

   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?


First index is zero, no?


Re: Inconsitency

2013-10-13 Thread nickles
Ok, I understand, that "length" is - obviously - used in analogy 
to any array's length value.


Still, this seems to be inconsistent. D elaborates on 
implementing "char"s as UTF-8 which means that a "char" in D can 
be of any length between 1 and 4 bytes for an arbitrary Unicode 
code point. Shouldn't then this (i.e. the character's length) be 
the "unit of measurement" for "char"s - like e.g. the size of the 
underlying struct in an array of "struct"s? The story continues 
with indexing "string"s: In a consistent implementation, shouldn't


   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic 
letter?
Btw. how do YOU implement this for "string" (for "dstring" it 
works - logically, for "wstring" the same problem arises for code 
points above D800)?


Also, I understand, that there is the std.utf.count() function 
which returns the length that I was searching for. However, why - 
if D is so UTF-8-centric - isn't this function implemented in the 
core like ".length"?




Re: Inconsitency

2013-10-13 Thread Dmitry Olshansky

13-Oct-2013 17:25, nickles пишет:

Ok, if my understandig is wrong, how do YOU measure the length of a string?
Do you always use count(), or is there an alternative?



It's all there:
http://www.unicode.org/glossary/
http://www.unicode.org/versions/Unicode6.3.0/

I measure string length in code units (as defined in the above 
standard). This bears no easy relation to the number of visible 
characters but I don't mind it.


Measuring number of visible characters isn't trivial but can be done by 
counting number of graphemes. For simple alphabets counting code points 
will do the trick as well (what count does).


--
Dmitry Olshansky


Re: Inconsitency

2013-10-13 Thread Sönke Ludwig

Am 13.10.2013 15:25, schrieb nickles:

Ok, if my understandig is wrong, how do YOU measure the length of a string?
Do you always use count(), or is there an alternative?




The thing is that even count(), which gives you the number of *code 
points*, isn't necessarily what is desired - that is, the number of 
actual display characters. UTF is quite a complex beast and doing any 
operations on it _correctly_ generally requires a lot of care. If you 
need to do these kinds of operations, I would highly recommend to read 
up the basics of UTF and Unicode first (quick overview on Wikipedia: 
).


arr.length is meant to be used in conjunction with array indexing and 
slicing (arr[...]) and its value is consistent for all string and array 
types for this purpose.


Re: Inconsitency

2013-10-13 Thread David Nadlinger

On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
Ok, if my understandig is wrong, how do YOU measure the length 
of a string?


Depends on how you define the "length" of a string. Doing that is 
surprisingly difficult once the full variety of Unicode code 
points comes into play, even if you ignore the question of 
encoding (UTF-8, UTF-16, …).


David


Re: Inconsitency

2013-10-13 Thread nickles
Ok, if my understandig is wrong, how do YOU measure the length of 
a string?

Do you always use count(), or is there an alternative?




Re: Inconsitency

2013-10-13 Thread Dicebot

On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:

I do not agree:

   writeln("säд".length);=> 5  chars: 5 (1 + 2 [C3A4] + 
2 [D094], UTF-8)

   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);   => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);   => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.


Because you have wrong understanding of what does "length" mean.


Re: Inconsitency

2013-10-13 Thread nickles
This is simply wrong. All strings return number of codeunits. 
And it's only UTF-32 where codepoint (~ character) happens to 
fit into one codeunit.


I do not agree:

   writeln("säд".length);=> 5  chars: 5 (1 + 2 [C3A4] + 2 
[D094], UTF-8)

   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);   => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);   => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.


Re: Inconsitency

2013-10-13 Thread ilya-stromberg

On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:

Why does .length return the number of bytes and not the
number of UTF-8 characters, whereas length and
.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have .length return 
the

number of UTF-8 characters as well (instead of having to use
std.utf.count()?


Technically, UTF-16 can contain 2 ushort's for 1 character, so 
length return the number of ushort's, not the UTF-16 
characters.


Re: Inconsitency

2013-10-13 Thread Dmitry Olshansky

13-Oct-2013 16:36, nickles пишет:

Why does .length return the number of bytes and not the
number of UTF-8 characters, whereas length and
.length return the number of UTF-16 and UTF-32
characters?



???
This is simply wrong. All strings return number of codeunits. And it's 
only UTF-32 where codepoint (~ character) happens to fit into one codeunit.



Wouldn't it be more consistent to have .length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count()?


It's consistent as is.

--
Dmitry Olshansky


Re: Inconsitency

2013-10-13 Thread Dicebot

On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:

Why does .length return the number of bytes and not the
number of UTF-8 characters, whereas length and
.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have .length return 
the

number of UTF-8 characters as well (instead of having to use
std.utf.count()?


Because `length` must be O(1) operation for built-in arrays and 
for UTF-8 strings it would require storing additional length 
field making it binary incompatible with other array types.


Inconsitency

2013-10-13 Thread nickles

Why does .length return the number of bytes and not the
number of UTF-8 characters, whereas length and
.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have .length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count()?