Some questions about strings

2020-06-21 Thread Denis via Digitalmars-d-learn

I have a few questions about how strings are stored.

- First, is there any difference between string, wstring and 
dstring? For example, a 3-byte Unicode character literal can be 
assigned to a variable of any of these types, then printed, etc, 
without errors.


- Are the characters of a string stored in memory by their 
Unicode codepoint(s), as opposed to some other encoding?


- Assuming that the answer to the first question is "no 
difference", do strings always allocate 4 bytes per codepoint?


- Can a series of codepoints, appropriately padded to the 
required width, and terminated by a null character, be directly 
assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING 
TRANSLATION?


The last question gets to the heart of what I'd ultimately like 
to accomplish and avoid.


Thanks for your help.


Re: Some questions about strings

2020-06-21 Thread Adam D. Ruppe via Digitalmars-d-learn

On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
- First, is there any difference between string, wstring and 
dstring?


Yes, they encode the same content differently in the bytes. If 
you cast it to ubyte[] and print that out you can see the 
difference.


- Are the characters of a string stored in memory by their 
Unicode codepoint(s), as opposed to some other encoding?


no, they are encoded in utf-8, 16, or 32 for string, wstring, and 
dstring respectively.


- Can a series of codepoints, appropriately padded to the 
required width, and terminated by a null character, be directly 
assigned to a string WITHOUT GOING THROUGH A DECODING / 
ENCODING TRANSLATION?


no, they must be encoded. Unicode code points are an abstract 
concept that must be encoded somehow to exist in memory (similar 
to the idea of a number).


Re: Some questions about strings

2020-06-21 Thread Ali Çehreli via Digitalmars-d-learn
On 6/21/20 8:17 PM, Denis wrote:> I have a few questions about how 
strings are stored.

>
> - First, is there any difference between string, wstring and dstring?

string is char[]
wstring is wchar[]
dstring is dchar[]

char is 1 byte: UTF-8 code unit
wchar is 2 bytes: UTF-16 code unit
dchar is 4 bytes: UTF-32 code unit

> For example, a 3-byte Unicode character literal can be assigned to a
> variable of any of these types, then printed, etc, without errors.

You can reveal some of the mystery by looking at their .length property. 
Additionally, foreach will visit these types element-by-element: char, 
wchar, and dchar, respectively.


> - Are the characters of a string stored in memory by their Unicode
> codepoint(s), as opposed to some other encoding?

As UTF encodings; nothing else.

> - Assuming that the answer to the first question is "no difference", do
> strings always allocate 4 bytes per codepoint?

No. They always allocate sufficient bytes to represent the code points 
in their respective UTF encodings. dstring is the only one where the 
number of code points equals the number of elements: UTF-32 code units, 
each being 4 bytes.


> - Can a series of codepoints, appropriately padded to the required
> width, and terminated by a null character,

null character is not required but may be a part of the strings.

> be directly assigned to a
> string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?

It will go through decoding/encoding.

> The last question gets to the heart of what I'd ultimately like to
> accomplish and avoid.
>
> Thanks for your help.

There is also the infamous "auto decoding" of Phobos algorithms (which 
is as a mistake). I think one tool to get away from auto decoding of 
strings is std.string.representation:


  https://dlang.org/phobos/std_string.html#.representation

Because it returns a type that is not a string, there is not auto 
decoding to speak of. :)


Ali



Re: Some questions about strings

2020-06-21 Thread Denis via Digitalmars-d-learn

On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:

On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
- First, is there any difference between string, wstring and 
dstring?


Yes, they encode the same content differently in the bytes. If 
you cast it to ubyte[] and print that out you can see the 
difference.


- Are the characters of a string stored in memory by their 
Unicode codepoint(s), as opposed to some other encoding?


no, they are encoded in utf-8, 16, or 32 for string, wstring, 
and dstring respectively.


- Can a series of codepoints, appropriately padded to the 
required width, and terminated by a null character, be 
directly assigned to a string WITHOUT GOING THROUGH A DECODING 
/ ENCODING TRANSLATION?


no, they must be encoded. Unicode code points are an abstract 
concept that must be encoded somehow to exist in memory 
(similar to the idea of a number).


OK, then that actually simplifies what's needed, because I won't 
need to decode the UTF-8, only validate it.


My code reads a UTF-8 encoded file into a buffer and validates, 
byte by byte, the UTF-8 encoding along with some additional 
validation. If I simply return the UTF-8 encoded string, there 
won't be another decoding/encoding done -- correct?


Re: Some questions about strings

2020-06-21 Thread Adam D. Ruppe via Digitalmars-d-learn

On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
My code reads a UTF-8 encoded file into a buffer and validates, 
byte by byte, the UTF-8 encoding along with some additional 
validation. If I simply return the UTF-8 encoded string, there 
won't be another decoding/encoding done -- correct?


Yeah D doesn't do extra work when you are just passing stuff 
around, only when you specifically ask for it by calling a 
function or maybe doing foreach (depends on if you ask for char 
or dchar in the foreach type)


Re: Some questions about strings

2020-06-21 Thread Denis via Digitalmars-d-learn

On Monday, 22 June 2020 at 03:49:01 UTC, Adam D. Ruppe wrote:

On Monday, 22 June 2020 at 03:43:58 UTC, Denis wrote:
My code reads a UTF-8 encoded file into a buffer and 
validates, byte by byte, the UTF-8 encoding along with some 
additional validation. If I simply return the UTF-8 encoded 
string, there won't be another decoding/encoding done -- 
correct?


Yeah D doesn't do extra work when you are just passing stuff 
around, only when you specifically ask for it by calling a 
function or maybe doing foreach (depends on if you ask for char 
or dchar in the foreach type)


Excellent. I'm trying to make this efficient, so I'm doing all of 
the validation together, without using any external functions 
(apart from the buffer reads).


Thanks!


Re: Some questions about strings

2020-06-21 Thread Denis via Digitalmars-d-learn

On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
:

string is char[]
wstring is wchar[]
dstring is dchar[]


Got it now. This is the critical piece I missed: I understand the 
relations between the char types and the UTF encodings (thanks to 
your book). But I mistakenly thought that the string types were 
different.


You can reveal some of the mystery by looking at their .length 
property. Additionally, foreach will visit these types 
element-by-element: char, wchar, and dchar, respectively.


I did not try this test -- my bad.


null character is not required but may be a part of the strings.


The terminating null character was one of the reasons I thought 
strings were different from char arrays. Now I know better.


Thank you for these clarifications.
Denis


Re: Some questions about strings

2020-06-21 Thread Mike Parker via Digitalmars-d-learn

On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:

On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
:

string is char[]
wstring is wchar[]
dstring is dchar[]


Got it now. This is the critical piece I missed: I understand 
the relations between the char types and the UTF encodings 
(thanks to your book). But I mistakenly thought that the string 
types were different.




They're aliases in object.d:

https://github.com/dlang/druntime/blob/master/src/object.d#L35


Re: Some questions about strings

2020-06-21 Thread Denis via Digitalmars-d-learn

On Monday, 22 June 2020 at 04:32:32 UTC, Mike Parker wrote:

On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:

On Monday, 22 June 2020 at 03:31:17 UTC, Ali Çehreli wrote:
:

string is char[]
wstring is wchar[]
dstring is dchar[]


Got it now. This is the critical piece I missed: I understand 
the relations between the char types and the UTF encodings 
(thanks to your book). But I mistakenly thought that the 
string types were different.




They're aliases in object.d:

https://github.com/dlang/druntime/blob/master/src/object.d#L35


Right at the top and plain as day too... ;)

I appreciate the link to the source -- thanks!


Re: Some questions about strings

2020-06-22 Thread Jacob Carlborg via Digitalmars-d-learn

On Monday, 22 June 2020 at 04:08:10 UTC, Denis wrote:

The terminating null character was one of the reasons I thought 
strings were different from char arrays. Now I know better.


String **literals** have a terminating null character, to help 
with integrating with C functions. But this null character will 
disappear when manipulating strings.


You cannot assume that a function parameter of type `string` will 
have a terminating null character, but calling `printf` with a 
string literal is fine:


printf("foobar\n"); // this will work since string literals have 
have a terminating null character


--
/Jacob Carlborg


Re: Some questions about strings

2020-06-22 Thread Denis via Digitalmars-d-learn

On Monday, 22 June 2020 at 09:06:35 UTC, Jacob Carlborg wrote:

String **literals** have a terminating null character, to help 
with integrating with C functions. But this null character will 
disappear when manipulating strings.


You cannot assume that a function parameter of type `string` 
will have a terminating null character, but calling `printf` 
with a string literal is fine:


printf("foobar\n"); // this will work since string literals 
have have a terminating null character


OK, it makes sense that the null terminator would be added where 
compatability with C is required.


Good to know.