Re: size of a string in bytes

2017-01-28 Thread Nestor via Digitalmars-d-learn

On Saturday, 28 January 2017 at 19:09:01 UTC, ag0aep6g wrote:
In D, a `char` is a UTF-8 code unit. Its size is one byte, 
exactly and always.


A `char` is not a "character" in the common meaning of the 
word. There's a more specialized word for "character" as a 
visual unit: grapheme. For example, 'Ä' is a grapheme (a visual 
unit, a "character"), but there is no single `char` for it. To 
encode 'Ä' in UTF-8, a sequence of multiple code units is used.


...

The elements of a `string` are (immutable) `char`s. That is, 
`string` is an array of UTF-8 code units. It's not an array of 
graphemes.


A `string`'s .length gives you the number of `char`s in it, 
i.e. the number of UTF-8 code units, i.e. the number of bytes.


Very good explanation.
Thank you all for making this clear to me.


Re: size of a string in bytes

2017-01-28 Thread ag0aep6g via Digitalmars-d-learn

On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
I believe I saw somewhere that in D a char was not neccesarrily 
the same as an ubyte because chars sometimes take more than one 
byte,


In D, a `char` is a UTF-8 code unit. Its size is one byte, 
exactly and always.


A `char` is not a "character" in the common meaning of the word. 
There's a more specialized word for "character" as a visual unit: 
grapheme. For example, 'Ä' is a grapheme (a visual unit, a 
"character"), but there is no single `char` for it. To encode 'Ä' 
in UTF-8, a sequence of multiple code units is used.


so since a string is an array of chars, I thought length 
behaved like walkLength (which I had not seen), in other words, 
that it simply returned the amount of elements in the array.


The elements of a `string` are (immutable) `char`s. That is, 
`string` is an array of UTF-8 code units. It's not an array of 
graphemes.


A `string`'s .length gives you the number of `char`s in it, i.e. 
the number of UTF-8 code units, i.e. the number of bytes.


Re: size of a string in bytes

2017-01-28 Thread Adam D. Ruppe via Digitalmars-d-learn

On Saturday, 28 January 2017 at 18:04:58 UTC, Nestor wrote:
I believe I saw somewhere that in D a char was not neccesarrily 
the same as an ubyte because chars sometimes take more than


Not true in the language, but the Phobos library does treat char 
and ubyte differently because of the multi-char things.


But the built-in .length on a string and indexing all work the 
same as bytes.


Note that .length on a wstring or dstring (utf-16 or utf-32) are 
not bytes, but words. So wstring.length = number of wchars = 
number of 16 bit items. And dstring is 32 bit. Exactly the same 
as ushort[].length or int[].length - it is length of elements so 
if you actually want byte length, you'd cast it first or 
something.


Re: size of a string in bytes

2017-01-28 Thread Nestor via Digitalmars-d-learn

On Saturday, 28 January 2017 at 16:01:38 UTC, Ivan Kazmenko wrote:

As said, the byte count is indeed string.length.
The number of code points can be found by std.range.walkLength, 
but be aware it takes O(answer) time to compute.


Example:

-
import std.range, std.stdio;
void main () {
auto s = "Привет!";
writeln (s.length); // 13 bytes
writeln (s.walkLength); // 7 code points
}


Thank you Ivan,

I believe I saw somewhere that in D a char was not neccesarrily 
the same as an ubyte because chars sometimes take more than one 
byte, so since a string is an array of chars, I thought length 
behaved like walkLength (which I had not seen), in other words, 
that it simply returned the amount of elements in the array.


Re: size of a string in bytes

2017-01-28 Thread Ivan Kazmenko via Digitalmars-d-learn

On Saturday, 28 January 2017 at 15:32:33 UTC, Nestor wrote:
I want to know variable size in memory. For example, say I have 
an UTF-8 string of only 2 characters, but each of them takes 2 
bytes. string length would be 2, but the content of the string 
would take 4 bytes in memory (excluding overhead for type size).


As said, the byte count is indeed string.length.
The number of code points can be found by std.range.walkLength, 
but be aware it takes O(answer) time to compute.


Example:

-
import std.range, std.stdio;
void main () {
auto s = "Привет!";
writeln (s.length); // 13 bytes
writeln (s.walkLength); // 7 code points
}
-

Ivan Kazmenko.



Re: size of a string in bytes

2017-01-28 Thread rikki cattermole via Digitalmars-d-learn

On 29/01/2017 4:32 AM, Nestor wrote:

On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole wrote:

On 29/01/2017 3:51 AM, Nestor wrote:

Hi,

One can get the length of a string easily, however since strings are
UTF-8, sometimes characters take more than one byte. I would like to
know then how many bytes does a string take, but this code didn't work
as I expected:

import std.stdio;
void main() {
  string mystring1;
  string mystring2 = "A string of just 48 characters for testing size.";
  writeln(mystring1.sizeof);
  writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me just the
default size of a string type and not the size of the variable in
memory, which is what I want.

Ideas?


A few misconceptions going on here.
A string element is not a grapheme it is a character which is one byte.

So what you want is mystring.length

Now sizeof is not telling you about the elements, its telling you how
big the reference to it is. Specifically length + pointer. It would
have been 16 if you compiled in 64bit mode for example.

If you want to know about graphemes and code points that is another
story.
For that you'll want std.uni[0] and std.utf[1].

[0] http://dlang.org/phobos/std_uni.html
[1] http://dlang.org/phobos/std_utf.html


I do not want string lenth or code points. Perhaps I didn't explain
myselft.

I want to know variable size in memory. For example, say I have an UTF-8
string of only 2 characters, but each of them takes 2 bytes. string
length would be 2, but the content of the string would take 4 bytes in
memory (excluding overhead for type size).

How can I get that?


.length

You are misunderstanding a char will always be exactly one byte in size.

Check[0] for proof.

Keep in mind here is the definition of string[1]:
alias immutable(char)[]  string;

There is nothing fancy going on.
What you were asking about "characters" wise is actually graphemes as 
per the unicode standard, they can be multiple bytes and codepoints in 
size but not a char.


[0] http://dlang.org/spec/type.html
[1] https://github.com/dlang/druntime/blob/master/src/object.d


Re: size of a string in bytes

2017-01-28 Thread H. S. Teoh via Digitalmars-d-learn
On Sat, Jan 28, 2017 at 03:32:33PM +, Nestor via Digitalmars-d-learn wrote:
[...]
> I do not want string lenth or code points. Perhaps I didn't explain
> myselft.

The .length property of a string is the number of bytes used to store
the string.


> I want to know variable size in memory. For example, say I have an
> UTF-8 string of only 2 characters, but each of them takes 2 bytes.
> string length would be 2, but the content of the string would take 4
> bytes in memory (excluding overhead for type size).

What you call "string length" is called grapheme count in D.  What you
want is the .length property.

The number of bytes in a UTF-8 string is the same thing as the number of
code units (note: do not confuse with code points, which is something
else).


--T


Re: size of a string in bytes

2017-01-28 Thread Nestor via Digitalmars-d-learn
On Saturday, 28 January 2017 at 14:56:03 UTC, rikki cattermole 
wrote:

On 29/01/2017 3:51 AM, Nestor wrote:

Hi,

One can get the length of a string easily, however since 
strings are
UTF-8, sometimes characters take more than one byte. I would 
like to
know then how many bytes does a string take, but this code 
didn't work

as I expected:

import std.stdio;
void main() {
  string mystring1;
  string mystring2 = "A string of just 48 characters for 
testing size.";

  writeln(mystring1.sizeof);
  writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me 
just the
default size of a string type and not the size of the variable 
in

memory, which is what I want.

Ideas?


A few misconceptions going on here.
A string element is not a grapheme it is a character which is 
one byte.


So what you want is mystring.length

Now sizeof is not telling you about the elements, its telling 
you how big the reference to it is. Specifically length + 
pointer. It would have been 16 if you compiled in 64bit mode 
for example.


If you want to know about graphemes and code points that is 
another story.

For that you'll want std.uni[0] and std.utf[1].

[0] http://dlang.org/phobos/std_uni.html
[1] http://dlang.org/phobos/std_utf.html


I do not want string lenth or code points. Perhaps I didn't 
explain myselft.


I want to know variable size in memory. For example, say I have 
an UTF-8 string of only 2 characters, but each of them takes 2 
bytes. string length would be 2, but the content of the string 
would take 4 bytes in memory (excluding overhead for type size).


How can I get that?


Re: size of a string in bytes

2017-01-28 Thread rikki cattermole via Digitalmars-d-learn

On 29/01/2017 3:51 AM, Nestor wrote:

Hi,

One can get the length of a string easily, however since strings are
UTF-8, sometimes characters take more than one byte. I would like to
know then how many bytes does a string take, but this code didn't work
as I expected:

import std.stdio;
void main() {
  string mystring1;
  string mystring2 = "A string of just 48 characters for testing size.";
  writeln(mystring1.sizeof);
  writeln( mystring2.sizeof);
}

In both cases the size is 8, so apparently sizeof is giving me just the
default size of a string type and not the size of the variable in
memory, which is what I want.

Ideas?


A few misconceptions going on here.
A string element is not a grapheme it is a character which is one byte.

So what you want is mystring.length

Now sizeof is not telling you about the elements, its telling you how 
big the reference to it is. Specifically length + pointer. It would have 
been 16 if you compiled in 64bit mode for example.


If you want to know about graphemes and code points that is another story.
For that you'll want std.uni[0] and std.utf[1].

[0] http://dlang.org/phobos/std_uni.html
[1] http://dlang.org/phobos/std_utf.html