Re: Inconsitency

2013-10-20 Thread Kagamin
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: Most code might be buggy then. All code is buggy. An issue the often comes up is file names. A file called bär will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it

Re: Inconsitency

2013-10-16 Thread qznc
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote: On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function

Re: Inconsitency

2013-10-16 Thread Chris
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote: On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for.

Re: Inconsitency

2013-10-16 Thread monarch_dodra
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote: On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote: On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Also, I understand, that there is the std.utf.count()

Re: Inconsitency

2013-10-16 Thread Chris
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote: On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote: On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote: On Sunday, 13 October 2013 at 14:14:14 UTC,

Re: Inconsitency

2013-10-16 Thread Maxim Fomin
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote: On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote: On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote: On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote: On Sunday, 13 October 2013 at 14:14:14 UTC,

Re: Inconsitency

2013-10-16 Thread Jacob Carlborg
On 2013-10-16 10:03, qznc wrote: Most code might be buggy then. An issue the often comes up is file names. A file called bär will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code

Re: Inconsitency

2013-10-16 Thread qznc
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg wrote: On 2013-10-16 10:03, qznc wrote: Most code might be buggy then. An issue the often comes up is file names. A file called bär will be normalized differently depending on the operating system. In both cases it is one

Re: Inconsitency

2013-10-16 Thread Jacob Carlborg
On 2013-10-16 14:33, qznc wrote: It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is combining diaeresis [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it:

Re: Inconsitency

2013-10-16 Thread monarch_dodra
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote: On 2013-10-16 14:33, qznc wrote: It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is combining diaeresis [0]. Not required, but possible. Those combining characters [1] provide a nearly

Re: Inconsitency

2013-10-16 Thread qznc
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote: On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote: On 2013-10-16 14:33, qznc wrote: It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is combining diaeresis [0]. Not

Re: Inconsitency

2013-10-16 Thread Dmitry Olshansky
16-Oct-2013 23:42, qznc пишет: On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote: On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote: On 2013-10-16 14:33, qznc wrote: It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is

Re: Inconsitency

2013-10-16 Thread monarch_dodra
On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote: I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme. a\u0308 is one grapheme. U+00e4 is the same grapheme as a\u0308. http://en.wikipedia.org/wiki/Grapheme Ah. Learn something new

Re: Inconsitency

2013-10-15 Thread Kagamin
On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote: If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range`

Re: Inconsitency

2013-10-15 Thread Kagamin
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like .length? Most code doesn't

Re: Inconsitency

2013-10-14 Thread nickles
It's easy to state this, but - please - don't get sarcastical! I'm obviously (as I've learned) speaking about UTF-8 chars as they are NOT implemented right now in D; so I'm criticizing that D, as a language which emphasizes on UTF-8 characters, isn't taking the last step, like e.g. Python

Re: Inconsitency

2013-10-14 Thread Chris
On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote: Am 13.10.2013 15:25, schrieb nickles: Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative? The thing is that even count(), which gives you the

Re: Inconsitency

2013-10-14 Thread Andrei Alexandrescu
On 10/14/13 1:09 AM, nickles wrote: It's easy to state this, but - please - don't get sarcastical! Thanks for making this point. String handling in D follows two simple principles: 1. The support is a slice of code units (which often are immutable, seeing as string is an alias for

Inconsitency

2013-10-13 Thread nickles
Why does string.length return the number of bytes and not the number of UTF-8 characters, whereas wstring.length and dstring.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have string.length return the number of UTF-8 characters as well (instead of

Re: Inconsitency

2013-10-13 Thread Dicebot
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote: Why does string.length return the number of bytes and not the number of UTF-8 characters, whereas wstring.length and dstring.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have string.length

Re: Inconsitency

2013-10-13 Thread Dmitry Olshansky
13-Oct-2013 16:36, nickles пишет: Why does string.length return the number of bytes and not the number of UTF-8 characters, whereas wstring.length and dstring.length return the number of UTF-16 and UTF-32 characters? ??? This is simply wrong. All strings return number of codeunits. And it's

Re: Inconsitency

2013-10-13 Thread ilya-stromberg
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote: Why does string.length return the number of bytes and not the number of UTF-8 characters, whereas wstring.length and dstring.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have string.length

Re: Inconsitency

2013-10-13 Thread nickles
This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit. I do not agree: writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count(säд)) = 3 chars: 5

Re: Inconsitency

2013-10-13 Thread Dicebot
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote: I do not agree: writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count(säд)) = 3 chars: 5 (ibidem) writeln(säдw.length); = 3 chars: 6 (2 x 3, UTF-16) writeln(säдd.length);

Re: Inconsitency

2013-10-13 Thread nickles
Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative?

Re: Inconsitency

2013-10-13 Thread David Nadlinger
On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote: Ok, if my understandig is wrong, how do YOU measure the length of a string? Depends on how you define the length of a string. Doing that is surprisingly difficult once the full variety of Unicode code points comes into play, even if

Re: Inconsitency

2013-10-13 Thread Sönke Ludwig
Am 13.10.2013 15:25, schrieb nickles: Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative? The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is,

Re: Inconsitency

2013-10-13 Thread Dmitry Olshansky
13-Oct-2013 17:25, nickles пишет: Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative? It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code

Re: Inconsitency

2013-10-13 Thread nickles
Ok, I understand, that length is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing chars as UTF-8 which means that a char in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't

Re: Inconsitency

2013-10-13 Thread Michael
implementation, shouldn't writeln(säд[2]) return д instead of the trailing surrogate of this cyrillic letter? First index is zero, no?

Re: Inconsitency

2013-10-13 Thread Sönke Ludwig
Am 13.10.2013 16:14, schrieb nickles: Ok, I understand, that length is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing chars as UTF-8 which means that a char in D can be of any length between 1 and 4 bytes for an

Re: Inconsitency

2013-10-13 Thread Sönke Ludwig
Am 13.10.2013 15:50, schrieb Dmitry Olshansky: 13-Oct-2013 17:25, nickles пишет: Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative? It's all there: http://www.unicode.org/glossary/

Re: Inconsitency

2013-10-13 Thread nickles
This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the ä character (U+00E4). True. It's UTF-8, not UTF-16. However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. This is not

Re: Inconsitency

2013-10-13 Thread Maxim Fomin
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote: This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit. I do not agree: writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] + 2

Re: Inconsitency

2013-10-13 Thread Dicebot
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote: Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent

Re: Inconsitency

2013-10-13 Thread Maxim Fomin
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Ok, I understand, that length is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing chars as UTF-8 which means that a char in D can be of any length between 1

Re: Inconsitency

2013-10-13 Thread anonymous
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote: However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. This is not true for UTF-8, which is not subject to endianism. This is not about endianness. It's \u00E4 vs

Re: Inconsitency

2013-10-13 Thread Peter Alexander
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote: However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. This is not true for UTF-8, which is not subject to endianism. You are correct in that UTF-8 is endian agnostic,

Re: Inconsitency

2013-10-13 Thread monarch_dodra
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Ok, I understand, that length is - obviously - used in analogy to any array's length value. Still, this seems to be inconsistent. D elaborates on implementing chars as UTF-8 which means that a char in D can be of any length between 1

Re: Inconsitency

2013-10-13 Thread deadalnix
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: Ok, I understand, that length is - obviously - used in analogy to any array's length value. That isn't an analogy. It is usually a good idea to try to understand thing before judging if it is consistent.

Re: Inconsitency

2013-10-13 Thread Temtaime
I've found another one inconsitency problem. void foo(const char *); void foo(const wchar *); void foo(const dchar *); void main() { foo(`123`); foo(`123`w); foo(`123`d); } Error: function hello.foo (const(char*)) is not callable using argument types (immutable(wchar

Re: Inconsitency

2013-10-13 Thread Andrej Mitrovic
On 10/14/13, Temtaime temta...@gmail.com wrote: And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions.

Re: Inconsitency

2013-10-13 Thread deadalnix
On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote: I've found another one inconsitency problem. void foo(const char *); void foo(const wchar *); void foo(const dchar *); void main() { foo(`123`); foo(`123`w); foo(`123`d); } Error: function hello.foo (const