Re: String Type Usage. String vs DString vs WString

2018-01-15 Thread SimonN via Digitalmars-d-learn

On Monday, 15 January 2018 at 14:44:46 UTC, Adam D. Ruppe wrote:

On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
D's foreach [...] will autodecode and silently iterate over 
dchar, not char, even when the input is string


That's not true. foreach will only decode on demand:
foreach(c; s) { /* c is a char here, it goes over bytes */ }


Thanks for the correction! Surprised I got foreach(c, s) wrong, 
its non-decoding iteration is even the prominent example in TDPL.


Even `each`, the template function that implements a foreach, 
still infers as char:


"aƤ".each!writeln; // prints a plus two broken characters

Only `map`



When I wrote "D's ranges", I meant Phobos's range-producing 
templates; a range itself is again encoding-agnostic.


Re: String Type Usage. String vs DString vs WString

2018-01-15 Thread Jonathan M Davis via Digitalmars-d-learn
On Monday, January 15, 2018 14:56:33 Kagamin via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
> > D's foreach and D's ranges will autodecode and silently iterate
> > over dchar, not char
>
> foreach doesn't do it silently, decoding must be requested from
> it by explicitly specifying element type, it can also encode this
> way.

Yeah, one of the joys of that is that you have to be careful of using
foreach in range-based functions, because if you haven't specialized for
strings, and you use foreach without specifying the element type, then when
your function template is instantiated with a string, the foreach won't
match what front does.

I really don't have any complaints about how foreach does this aside from
the fact that it doesn't currently use the replacement character (so, it
will throw on invalid Unicode if it's told to decode), but the way that
interacts with Phobos is poor.

Ideally, we'd get rid of auto-decoding, and we'd get rid of the whole
exception on bad Unicode thing and just use the replacement character, but
since changing it would break a lot of code... :|

- Jonathan M Davis



Re: String Type Usage. String vs DString vs WString

2018-01-15 Thread Kagamin via Digitalmars-d-learn

On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
D's foreach and D's ranges will autodecode and silently iterate 
over dchar, not char


foreach doesn't do it silently, decoding must be requested from 
it by explicitly specifying element type, it can also encode this 
way.


Re: String Type Usage. String vs DString vs WString

2018-01-15 Thread Adam D. Ruppe via Digitalmars-d-learn

On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
D's foreach [...] will autodecode and silently iterate over 
dchar, not char, even when the input is string



That's not true. foreach will only decode on demand:

string s;

foreach(c; s) { /* c is a char here, it goes over bytes */ }
foreach(char c; s) { /* c is a char here, same as above */ }
foreach(dchar c; s) { /* c is a dchar - this decodes */ }



Autodecoding is a Phobos library artifact, NOT something in the D 
language itself.


Re: String Type Usage. String vs DString vs WString

2018-01-15 Thread Patrick Schluter via Digitalmars-d-learn
On Monday, 15 January 2018 at 04:27:15 UTC, Jonathan M Davis 
wrote:
On Monday, January 15, 2018 03:14:02 Tony via 
Digitalmars-d-learn wrote:

On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole

wrote:
> Unicode has three main variants, UTF-8, UTF-16 and UTF-32. 
> The size of a code point is 1, 2 or 4 bytes.


I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
(UTF-32) bytes are referred to as "code units" and the size of 
a

code point varies in UTF-8 and UTF-16.


Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 
of them (IIRC) in a code point.


Nooo!!! Only 4 maximum for Unicode. Beyond that it's 
obsolete crap that is not Unicode since version 2 of Unicode.






Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread SimonN via Digitalmars-d-learn

On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:

Is usage of one type over the others encouraged?


I would use string (UTF-8) throughout the program, but there 
seems to be no style guideline for this. Keep in mind two gotchas:


D's foreach and D's ranges will autodecode and silently iterate 
over dchar, not char, even when the input is string, not dstring. 
(It's also possible to explicitly decode strings, see std.utf and 
std.uni.)


If you call into the Windows API, some functions require extra 
care if everything in your program is UTF-8. But I still agree 
with the approach to keep everything as string in your program, 
and then wrap the Windows API calls, as the UTF-8 Everywhere 
manifesto suggests:

http://utf8everywhere.org/

-- Simon


Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread Jonathan M Davis via Digitalmars-d-learn
On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole
>
> wrote:
> > Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
> > The size of a code point is 1, 2 or 4 bytes.
>
> I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
> (UTF-32) bytes are referred to as "code units" and the size of a
> code point varies in UTF-8 and UTF-16.

Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them
(IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are
either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits,
and there is always 1 code unit per code point.

For better or worse (mostly worse), ranges then treat all strings as ranges
of code points and decode them to code points such that get a range of dchar
(which means fun things like isRandomAccessRange!string and hasLength!string
are false). As I understand it, each code point is then something which can
be physically printed, but either way, it's not necessarily a full
character.

Multiple code points can then be combined to make a grapheme cluster (which
then corresponds to what we'd normally consider a full character - e.g. a
letter and an accent can each be a code point which are then combined to
create an accented character). std.uni provides the functionality for
operating on graphemes.

And std.utf.byCodeUnit can be used to treat strings as ranges of code units
instead of code points (and a fair bit of Phobos takes the solution of
specializing range-based code for strings to avoid the auto-decoding).

All in all, the whole thing is annoyingly complicated, though at least D is
much more explicit about it than most languages, and I suspect that your
average D programmer is better educated about Unicode than your average
programmer. And having to figure out why the heck strings and wstrings act
so bizarrely as ranges does have the positive side effect of putting it even
more in your face than it would be otherwise, making it that much more
likely that folks are going to learn about Unicode - though I still think
that we'd be better off if we could ever figure out how to treat all strings
as ranges of code units without breaking everything in the process. :|

- Jonathan M Davis



Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread Tony via Digitalmars-d-learn
On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole 
wrote:




Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
The size of a code point is 1, 2 or 4 bytes.


I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 
(UTF-32) bytes are referred to as "code units" and the size of a 
code point varies in UTF-8 and UTF-16.


Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread Jonathan M Davis via Digitalmars-d-learn
On Monday, January 15, 2018 02:22:09 Chris P via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
> > On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
> >> [...]
> >>
> >  string == immutable( char)[], char == utf8
> >
> > wstring == immutable(wchar)[], char == utf16
> > dstring == immutable(dchar)[], char == utf32
> >
> > Unless you are dealing with windows, in which case you way need
> > to consider using wstring, there is very little reason to use
> > anything but string.
> >
> > N.B. when you iterate over a string there are a number of
> > different "flavours" (for want of a better term) you can
> > iterate over, bytes, unicode codepoints and graphemes ( I'm
> > possible forgetting some). have a look in std.uni and related
> > modules. Iteration in Phobos defaults to coepoints I think.
> >
> > TLDR use string.
>
> Thank you (and rikki) for replying. Actually, I am using Windows
> (Doh!) but I now understand. Cheers!

Even with Windows, there usually isn't any reason to use wstring. The only
reason that wstring might be more desirable on Windows is that you need
UTF-16 when dealing with the Windows API calls, and that's normally only
going to come up if you're not writing platform-independent code. The common
stuff such as file access is already wrap by Phobos (e.g. in std.file and
std.stdio), so most programs, don't need to worry about the Windows API
calls. And even if you do, the best practice generally is to use string
everywhere in your code and then only convert to a zero-terminated wchar*
when making the Windows API calls (either by actually allocating a
zero-terminated wchar* or using a static array with the appropriate wchar
set to 0, depending on the context).

If you have to do a ton with Windows API calls, at some point, it arguably
becomes better to just keep them as wstrings to avoid the conversions, but
even then, because strings in D aren't zero-terminated, and the C API calls
usually require them to be, you're often forced to copy the string to pass
it to a Windows API call anyway, in which case, you lose most of the benefit
of keeping stuff around in wstrings instead of just using strings
everywhere.

If you do need to worry about call a Windows API call, then check out toUTFz
in std.utf, since it will allow you to easily convert to zero-terminated
strings of any character type (std.string.toStringz handles zero-terminated
strings as well, but just for string).

- Jonathan M Davis



Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread Chris P via Digitalmars-d-learn

On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:

On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:

[...]


 string == immutable( char)[], char == utf8
wstring == immutable(wchar)[], char == utf16
dstring == immutable(dchar)[], char == utf32

Unless you are dealing with windows, in which case you way need 
to consider using wstring, there is very little reason to use 
anything but string.


N.B. when you iterate over a string there are a number of 
different "flavours" (for want of a better term) you can 
iterate over, bytes, unicode codepoints and graphemes ( I'm 
possible forgetting some). have a look in std.uni and related 
modules. Iteration in Phobos defaults to coepoints I think.


TLDR use string.


Thank you (and rikki) for replying. Actually, I am using Windows 
(Doh!) but I now understand. Cheers!


Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread Nicholas Wilson via Digitalmars-d-learn

On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:

Hello,

I'm extremely new to D and have a quick question regarding 
common practice when using strings. Is usage of one type over 
the others encouraged? When using 'string' it appears there is 
a length mismatch between the string length and the char array 
if large Unicode characters are used. So I figured I'd ask.


Thanks in advance,

Chris P - Tampa


 string == immutable( char)[], char == utf8
wstring == immutable(wchar)[], char == utf16
dstring == immutable(dchar)[], char == utf32

Unless you are dealing with windows, in which case you way need 
to consider using wstring, there is very little reason to use 
anything but string.


N.B. when you iterate over a string there are a number of 
different "flavours" (for want of a better term) you can iterate 
over, bytes, unicode codepoints and graphemes ( I'm possible 
forgetting some). have a look in std.uni and related modules. 
Iteration in Phobos defaults to coepoints I think.


TLDR use string.



String Type Usage. String vs DString vs WString

2018-01-14 Thread Chris P via Digitalmars-d-learn

Hello,

I'm extremely new to D and have a quick question regarding common 
practice when using strings. Is usage of one type over the others 
encouraged? When using 'string' it appears there is a length 
mismatch between the string length and the char array if large 
Unicode characters are used. So I figured I'd ask.


Thanks in advance,

Chris P - Tampa


Re: String Type Usage. String vs DString vs WString

2018-01-14 Thread rikki cattermole via Digitalmars-d-learn

On 15/01/2018 2:05 AM, Chris P wrote:

Hello,

I'm extremely new to D and have a quick question regarding common 
practice when using strings. Is usage of one type over the others 
encouraged? When using 'string' it appears there is a length mismatch 
between the string length and the char array if large Unicode characters 
are used. So I figured I'd ask.


Thanks in advance,

Chris P - Tampa


D's strings are Unicode.

Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
The size of a code point is 1, 2 or 4 bytes.
But here is the thing, what is displayed (a character) could be multiple 
code points and these can be combined to form a grapheme.


So yes, there will be length mismatches between them :)