On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
Most code might be buggy then.
All code is buggy.
An issue the often comes up is file names. A file called bär
will be normalized differently depending on the operating
system. In both cases it is one grapheme. However, on Linux it
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() function
which returns the length that I was searching for. However,
why - if D is so UTF-8-centric - isn't this function
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count()
function which returns the length that I was searching for.
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count()
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra
wrote:
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
On Sunday, 13 October 2013 at 14:14:14 UTC,
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra
wrote:
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
On Sunday, 13 October 2013 at 14:14:14 UTC,
On 2013-10-16 10:03, qznc wrote:
Most code might be buggy then.
An issue the often comes up is file names. A file called bär will be
normalized differently depending on the operating system. In both cases
it is one grapheme. However, on Linux it is one code point, but on OS X
it is two code
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg
wrote:
On 2013-10-16 10:03, qznc wrote:
Most code might be buggy then.
An issue the often comes up is file names. A file called bär
will be
normalized differently depending on the operating system. In
both cases
it is one
On 2013-10-16 14:33, qznc wrote:
It is either [U+00E4] as one code point or [a,U+0308] for two code
points. The second is combining diaeresis [0]. Not required, but
possible. Those combining characters [1] provide a nearly infinite
number of combinations. You can go crazy with it:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg
wrote:
On 2013-10-16 14:33, qznc wrote:
It is either [U+00E4] as one code point or [a,U+0308] for two
code
points. The second is combining diaeresis [0]. Not required,
but
possible. Those combining characters [1] provide a nearly
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra
wrote:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg
wrote:
On 2013-10-16 14:33, qznc wrote:
It is either [U+00E4] as one code point or [a,U+0308] for two
code
points. The second is combining diaeresis [0]. Not
16-Oct-2013 23:42, qznc пишет:
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
On 2013-10-16 14:33, qznc wrote:
It is either [U+00E4] as one code point or [a,U+0308] for two code
points. The second is
On Wednesday, 16 October 2013 at 19:42:59 UTC, qznc wrote:
I agree with your point. Nevertheless you understanding of
grapheme is off. U+0308 is not a grapheme. a\u0308 is one
grapheme. U+00e4 is the same grapheme as a\u0308.
http://en.wikipedia.org/wiki/Grapheme
Ah. Learn something new
On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote:
If single element access is needed, str.front yields decoded
`dchar`. Or simple `foreach (dchar d; str)` - it won't hide the
fact it is O(n) operation at least. As `str.front` yields
dchar, most `std.algorithm` and `std.range`
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Also, I understand, that there is the std.utf.count() function
which returns the length that I was searching for. However, why
- if D is so UTF-8-centric - isn't this function implemented in
the core like .length?
Most code doesn't
It's easy to state this, but - please - don't get sarcastical!
I'm obviously (as I've learned) speaking about UTF-8 chars as
they are NOT implemented right now in D; so I'm criticizing that
D, as a language which emphasizes on UTF-8 characters, isn't
taking the last step, like e.g. Python
On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote:
Am 13.10.2013 15:25, schrieb nickles:
Ok, if my understandig is wrong, how do YOU measure the length
of a string?
Do you always use count(), or is there an alternative?
The thing is that even count(), which gives you the
On 10/14/13 1:09 AM, nickles wrote:
It's easy to state this, but - please - don't get sarcastical!
Thanks for making this point.
String handling in D follows two simple principles:
1. The support is a slice of code units (which often are immutable,
seeing as string is an alias for
Why does string.length return the number of bytes and not the
number of UTF-8 characters, whereas wstring.length and
dstring.length return the number of UTF-16 and UTF-32
characters?
Wouldn't it be more consistent to have string.length return the
number of UTF-8 characters as well (instead of
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
Why does string.length return the number of bytes and not the
number of UTF-8 characters, whereas wstring.length and
dstring.length return the number of UTF-16 and UTF-32
characters?
Wouldn't it be more consistent to have string.length
13-Oct-2013 16:36, nickles пишет:
Why does string.length return the number of bytes and not the
number of UTF-8 characters, whereas wstring.length and
dstring.length return the number of UTF-16 and UTF-32
characters?
???
This is simply wrong. All strings return number of codeunits. And it's
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
Why does string.length return the number of bytes and not the
number of UTF-8 characters, whereas wstring.length and
dstring.length return the number of UTF-16 and UTF-32
characters?
Wouldn't it be more consistent to have string.length
This is simply wrong. All strings return number of codeunits.
And it's only UTF-32 where codepoint (~ character) happens to
fit into one codeunit.
I do not agree:
writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] + 2
[D094], UTF-8)
writeln(std.utf.count(säд)) = 3 chars: 5
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
I do not agree:
writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] +
2 [D094], UTF-8)
writeln(std.utf.count(säд)) = 3 chars: 5 (ibidem)
writeln(säдw.length); = 3 chars: 6 (2 x 3, UTF-16)
writeln(säдd.length);
Ok, if my understandig is wrong, how do YOU measure the length of
a string?
Do you always use count(), or is there an alternative?
On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
Ok, if my understandig is wrong, how do YOU measure the length
of a string?
Depends on how you define the length of a string. Doing that is
surprisingly difficult once the full variety of Unicode code
points comes into play, even if
Am 13.10.2013 15:25, schrieb nickles:
Ok, if my understandig is wrong, how do YOU measure the length of a string?
Do you always use count(), or is there an alternative?
The thing is that even count(), which gives you the number of *code
points*, isn't necessarily what is desired - that is,
13-Oct-2013 17:25, nickles пишет:
Ok, if my understandig is wrong, how do YOU measure the length of a string?
Do you always use count(), or is there an alternative?
It's all there:
http://www.unicode.org/glossary/
http://www.unicode.org/versions/Unicode6.3.0/
I measure string length in code
Ok, I understand, that length is - obviously - used in analogy
to any array's length value.
Still, this seems to be inconsistent. D elaborates on
implementing chars as UTF-8 which means that a char in D can
be of any length between 1 and 4 bytes for an arbitrary Unicode
code point. Shouldn't
implementation, shouldn't
writeln(säд[2])
return д instead of the trailing surrogate of this cyrillic
letter?
First index is zero, no?
Am 13.10.2013 16:14, schrieb nickles:
Ok, I understand, that length is - obviously - used in analogy to any
array's length value.
Still, this seems to be inconsistent. D elaborates on implementing
chars as UTF-8 which means that a char in D can be of any length
between 1 and 4 bytes for an
Am 13.10.2013 15:50, schrieb Dmitry Olshansky:
13-Oct-2013 17:25, nickles пишет:
Ok, if my understandig is wrong, how do YOU measure the length of a
string?
Do you always use count(), or is there an alternative?
It's all there:
http://www.unicode.org/glossary/
This will _not_ return a trailing surrogate of a Cyrillic
letter. It will return the second code unit of the ä
character (U+00E4).
True. It's UTF-8, not UTF-16.
However, it could also yield the first code unit of the umlaut
diacritic, depending on how the string is represented.
This is not
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
This is simply wrong. All strings return number of codeunits.
And it's only UTF-32 where codepoint (~ character) happens to
fit into one codeunit.
I do not agree:
writeln(säд.length);= 5 chars: 5 (1 + 2 [C3A4] +
2
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
Well that's a point; on the other hand, D is constantly
creating and throwing away new strings, so this isn't quite an
argument. The current solution puts the programmer in charge of
dealing with UTF-x, where a more consistent
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that length is - obviously - used in
analogy to any array's length value.
Still, this seems to be inconsistent. D elaborates on
implementing chars as UTF-8 which means that a char in D
can be of any length between 1
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
However, it could also yield the first code unit of the umlaut
diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to endianism.
This is not about endianness. It's \u00E4 vs
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
However, it could also yield the first code unit of the umlaut
diacritic, depending on how the string is represented.
This is not true for UTF-8, which is not subject to endianism.
You are correct in that UTF-8 is endian agnostic,
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that length is - obviously - used in
analogy to any array's length value.
Still, this seems to be inconsistent. D elaborates on
implementing chars as UTF-8 which means that a char in D
can be of any length between 1
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
Ok, I understand, that length is - obviously - used in
analogy to any array's length value.
That isn't an analogy. It is usually a good idea to try to
understand thing before judging if it is consistent.
I've found another one inconsitency problem.
void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);
void main() {
foo(`123`);
foo(`123`w);
foo(`123`d);
}
Error: function hello.foo (const(char*)) is not callable using
argument types (immutable(wchar
On 10/14/13, Temtaime temta...@gmail.com wrote:
And typeof(`123`).stringof == `string`. Why `123` can be stored
as null terminated utf8 string in rdata segment and `123`w nor
`123`d are not? For example wide strings(utf16) are usable with
windows *W functions.
On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote:
I've found another one inconsitency problem.
void foo(const char *);
void foo(const wchar *);
void foo(const dchar *);
void main() {
foo(`123`);
foo(`123`w);
foo(`123`d);
}
Error: function hello.foo (const
43 matches
Mail list logo