Re: Proposal: clean up semantics of array literals vs string literals

2012-10-04 Thread Bernard Helyer

On Tuesday, 2 October 2012 at 14:03:36 UTC, monarch_dodra wrote:
If you want 0 termination, then make it explicit, that's my 
opinion.


That ship has long since sailed. You'll break code in an
incredibly dangerous way if you were to change it now.


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-04 Thread Bernard Helyer
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
First, I think zero-terminated strings shouldn't be needed 
frequently enough in D code to make this necessary.


My experience has been much different. Interfacing with C occurs
in nearly every D program I write, and I usually end up passing
a string literal. Anecdotes!




Re: Proposal: clean up semantics of array literals vs string literals

2012-10-04 Thread Jakob Ovrum

On Thursday, 4 October 2012 at 07:57:16 UTC, Bernard Helyer wrote:
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
First, I think zero-terminated strings shouldn't be needed 
frequently enough in D code to make this necessary.


My experience has been much different. Interfacing with C occurs
in nearly every D program I write, and I usually end up passing
a string literal. Anecdotes!


Agreed. I'm always happy when I find that the particular C API I 
am working with supports passing strings as a pointer/length pair 
:)


Anyway, toStringz (and the wchar and dchar equivalents in 
std.utf) needs to be fixed regardless - it currently does a 
dangerous optimization if the string is immutable, otherwise it 
unconditionally concatenates. We cannot rely on strings being GC 
allocated based on mutability. Memory is outside the scope of the 
D type system - we cannot make assumptions about memory based on 
types.




Re: Proposal: clean up semantics of array literals vs string literals

2012-10-04 Thread Don Clugston

On 02/10/12 17:14, Andrei Alexandrescu wrote:

On 10/2/12 7:11 AM, Don Clugston wrote:

The problem
---

String literals in D are a little bit magical; they have a trailing \0.

[snip]

I don't mean to be Debbie Downer on this because I reckon it addresses
an issue that some have, although I never do. With that warning, a few
candid opinions follow.

First, I think zero-terminated strings shouldn't be needed frequently
enough in D code to make this necessary.


[snip]

You're missing the point, a bit. The zero-terminator is only one symptom 
of the underlying problem: string literals and array literals have the 
same type but different semantics.

The other symptoms are:
* the implicit .dup that happens with array literals, but not string 
literals.
This is a silent performance killer. It's probably the most common 
performance bug we find in our code, and it's completely ungreppable.


* string literals are polysemous with width (c, w, d) but array literals 
are not (they are polysemous with constness).

For example,
"abc" ~ 'ü'
is legal, but
['a', 'b', 'c'] ~ 'ü'
is not.
This has nothing to do with the zero terminator.



Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Peter Alexander
On Tuesday, 2 October 2012 at 15:14:10 UTC, Andrei Alexandrescu 
wrote:
However, so far I held off of defining such a range because 
C-strings are seldom useful in D code [...]


I think your view of what is common in D code is not 
representative. You are primarily a library writer, which means 
you rarely have to interface with other code. Please correct me 
if I'm wrong, but I don't believe you've written much 
application-level D code.


For people that write applications, we have the unfortunate chore 
of having to call lots of C APIs to get things done. There's a 
long list of things for which there is no D interface (graphics, 
audio, input, GUI, database, platform APIs, various 3rd party 
libs). Invariably these interfaces require C strings. In short, 
if you write applications in D, you need C strings.


I don't know what the right decision is here, but please do not 
say that C-strings are seldom useful in D code.






Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread deadalnix

Le 02/10/2012 15:12, Don Clugston a écrit :

On 02/10/12 13:26, deadalnix wrote:

Well the whole mess come from the fact that D conflate C string and D
string.

The first problem come from the fact that D array are implicitly
convertible to pointer. So calling D function that expect a char* is
possible with D string even if it is unsafe and will not work in the
general case.

The fact that D provide tricks that will make it work in special cases
is armful as previous discussion have shown (many D programmer assume
that this will always work because of toy tests they have made, where in
case it won't and toStringz must be used).

The only sane solution I can think of is to :
- disallow slice to convert implicitly to pointer. .ptr is made for that.
- Do not put any trailing 0 in string literal, unless it is specified
explicitly ( "foobar\0" ).
- Except if a const(char)* is expected from the string literal. In
case it becomes a Cstring literal, with a trailing 0. This is made to
allow uses like printf("foobar");

In other terms, the receiver type is used to decide if the compiler
generate a string literal or a Cstring literal.


This still doesn't solve the problem of the difference between array
literals and string literals (the magical implicit .dup), which is the
key problem I'm trying to solve.



OK, infact we have 2 different and unrelated problems here. I have to 
say I have no idea for the second one.


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Andrei Alexandrescu

On 10/2/12 7:11 AM, Don Clugston wrote:

The problem
---

String literals in D are a little bit magical; they have a trailing \0.

[snip]

I don't mean to be Debbie Downer on this because I reckon it addresses 
an issue that some have, although I never do. With that warning, a few 
candid opinions follow.


First, I think zero-terminated strings shouldn't be needed frequently 
enough in D code to make this necessary.


Second, a simple and workable solution to this would be to address the 
matter dynamically: make toStringz opportunistically look whether 
there's a \0 beyond the end of the string, EXCEPT when the string 
happens to end exactly at a page boundary (in which case accessing 
memory beyond the end of the string may produce a page fault). With this 
simple dynamic test we don't need precise and stringent rules for the 
implementation.


Third, the complex set of rules proposed pushes the number of cases in 
which the \0 is guaranteed, but doesn't make for a clear and easy to 
remember boundary. Therefore people will need to remember some more 
rules to make sure they can, well, avoid a call to toStringz.


On 10/2/12 10:55 AM, Regan Heath wrote:

Recent discussions on the zero terminated string problems and
inconsistency of string literals has me, again, wondering why D
doesn't have a 'type' to represent C's zero terminated strings.  It
seems to me that having a type, and typing C functions with it would
solve a lot of problems.

[snip]

I am probably missing something obvious, or I have forgotten one of
the array/slice complexities which makes this a nightmare.


You're not missing anything and defining a zero-terminated type is 
something I considered doing and have been highly interested in. My 
interest is motivated by the fact that sentinel-terminated structures 
are a very interesting example of forward ranges that are also 
contiguous. That sets them apart from both singly-linked lists and 
simple arrays, and gives them interesting properties.


I'd be interested in defining the more general:

struct SentinelTerminatedSlice(T, T terminator)
{
private T* data;
...
}

That would be a forward range and the instantiation 
SentinelTerminatedSlice!(char, 0) would be CString.


However, so far I held off of defining such a range because C-strings 
are seldom useful in D code and there are not many other compelling 
examples of sentinel-terminated ranges. Maybe it's time to dust off that 
idea, I'd love it if we gathered enough motivation for it.



Andrei


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread monarch_dodra

On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:

[SNIP]
A proposal to clean up this mess
[SNIP]


While I think it is convenient to be able to write 
'printf("world");', as you point out, I think that the fact that 
it works "inconsistently" (and by that, I mean there are rules 
and exceptions), is even more dangerous.


If at all possible, I'd rather side with consistency, then the 
"we got your back... except when we don't" approach: IE: strings 
are NEVER null terminated.


In theory, how often do you *really* need null terminated 
strings? And when you do, wouldn't it be safer to just write 
'printf("world\0")'? or 'printf(str ~ "world" ~ '\0');' rather 
than "Am I in a case where it is null terminated? Yeah... 90% 
confident I am..."


If you want 0 termination, then make it explicit, that's my 
opinion.


Besides, as you said, the null termination is not documented, so 
anything relying on it is a bug really. Just an observation of an 
implementation detail.


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread kenji hara
2012/10/2 Don Clugston :
> The problem
> ---
>
> String literals in D are a little bit magical; they have a trailing \0. This
> means that is possible to write,
>
> printf("Hello, World!\n");
>
> without including a trailing \0. This is important for compatibility with C.
> This trailing \0 is mentioned in the spec but only incidentally, and
> generally in connection with printf.
>
> But the semantics are not well defined.
>
> printf("Hello, W" ~ "orld!\n");
>
> Does this have a trailing \0 ? I think it should, because it improves
> readability of string literals that are longer than one line. Currently DMD
> adds a \0, but it is not in the spec.
>
> Now consider array literals.
>
> printf(['H','e', 'l', 'l','o','\n']);
>
> Does this have a trailing \0 ? Currently DMD does not put one in.
> How about ['H','e', 'l', 'l','o'] ~ " World!\n"  ?
>
> And "Hello " ~ ['W','o','r','l','d','\n']   ?
>
> And "Hello World!" ~ '\n' ?
> And  null ~ "Hello World!\n" ?
>
> Currently DMD puts \0 in some cases but not others, and it's rather random.
>
> The root cause is that this trailing zero is not part of the type, it's part
> of the literal. There are no rules for how literals are propagated inside
> expressions, they are just literals. This is a mess.
>
> There is a second difference.
> Array literals of char type, have completely different semantics from string
> literals. In module scope:
>
> char[] x = ['a'];  // OK -- array literals can have an implicit .dup
> char[] y = "b";// illegal
>
> This is a big problem for CTFE, because for CTFE, a string is just a
> compile-time value, it's neither string literal nor array literal!
>
> See bug 8660 for further details of the problems this causes.
>
>
> A proposal to clean up this mess
> 
>
> Any compile-time value of type immutable(char)[] or const(char)[], behaves a
> string literals currently do, and will have a \0 appended when it is stored
> in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.
>
> Any value of type char[], which is generated at compile time, will not have
> the trailing \0, and it will do an implicit dup (as current array literals
> do).
>
> char [] foo()
> {
> return "abc";
> }
>
> char [] x = foo();
>
> // x does not have a trailing \0, and it is implicitly duped, even though it
> was not declared with an array literal.
>
> ---
> So that the difference between string literals and char array literals would
> simply be that the latter are polysemous. There would be no semantics
> associated with the form of the literal itself.
>
>
> We still have this oddity:
>
>
> void foo(char qqq = 'b') {
>
>string x = "abc";// trailing \0
>string y = ['a', 'b', 'c'];  // trailing \0
>string z = ['a', qqq, 'c'];  // no trailing \0
> }
>
> This is because we made the (IMHO mistaken) decision to allow variables
> inside array literals.
> This is the reason why I listed _compile time value_ in the requirement for
> having a \0, rather than entirely basing it on the type.
>
> We could fix that with a language change: an array literal which contains a
> variable should not be of immutable type. It should be of mutable type (or
> const, in the case where it contains other, immutable values).
>
> So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even
> though w is allocated on the heap).
>
> But that's a separate proposal from the one I'm making here. I just need a
> decision on the main proposal so that I can fix a pile of CTFE bugs.

Maybe your proposal is correct.
I think the key idea is *polysemous typed string literal*.

When based on the Ideal D Interpreter in my brain, the organized rule
will become like follows.

1-1) In semantic level, D should have just one polysemous string
literal, which is "an array of char".
1-2) In token level, D has two represents for the polysemous string
literal, they are "str" and ['s','t','r'].

2) The polysemous string literl is implicitly convertible to
[wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is
not need, because immutable([wd]?char)[] is implicitly convertible to
them).

3) The concatenation result between polysemous literals is still
polysemous, but its representation is different based on the both side
of the operator.

   "str" ~ "str"; // "strstr"
   "str" ~ ['s','t','r']; // ['s','t','r','s','t','r']
   "str" ~ 's';   // "strs"
   ['s','t','r'] ~ 's';   // ['s','t','r','s']
   "str" ~ null;  // "str"
   ['s','t','r'] ~ null;  // ['s','t','r']

4) After semantics _and_ optimization, polysemous string literal which
represented as like
 4-1) "str" is typed as immutable([wd]?char)[] (The char type is
depends on the literal suffix).
 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is
depends on the common type of its elements).

5) In object file generating phase, string literal which typed as
  5-1) immutabl

Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Don Clugston

On 02/10/12 13:18, Tobias Pankrath wrote:

On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:

The problem
---

String literals in D are a little bit magical; they have a trailing
\0. This means that is possible to write,

printf("Hello, World!\n");

without including a trailing \0. This is important for compatibility
with C. This trailing \0 is mentioned in the spec but only
incidentally, and generally in connection with printf.

But the semantics are not well defined.

printf("Hello, W" ~ "orld!\n");


If every string literal is \0-terminated, then there should be two \0 in
the final string. I guess that's not the case and that's actually my
preferred behaviour, but the spec should make it crystal clear in which
situations a
string literal gets a terminator and in which not.


The \0 is *not* part of the string, it lies after the string.
It's as if all memory is cleared, then the string literals are copied 
into it, with a gap of at least one byte between each. The 'trailing 0' 
is not part of the literal, it's the underlying cleared memory.


At least, that's how I understand it. The spec is very vague.



Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Don Clugston

On 02/10/12 13:26, deadalnix wrote:

Well the whole mess come from the fact that D conflate C string and D
string.

The first problem come from the fact that D array are implicitly
convertible to pointer. So calling D function that expect a char* is
possible with D string even if it is unsafe and will not work in the
general case.

The fact that D provide tricks that will make it work in special cases
is armful as previous discussion have shown (many D programmer assume
that this will always work because of toy tests they have made, where in
case it won't and toStringz must be used).

The only sane solution I can think of is to :
  - disallow slice to convert implicitly to pointer. .ptr is made for that.
  - Do not put any trailing 0 in string literal, unless it is specified
explicitly ( "foobar\0" ).
  - Except if a const(char)* is expected from the string literal. In
case it becomes a Cstring literal, with a trailing 0. This is made to
allow uses like printf("foobar");

In other terms, the receiver type is used to decide if the compiler
generate a string literal or a Cstring literal.


This still doesn't solve the problem of the difference between array 
literals and string literals (the magical implicit .dup), which is the 
key problem I'm trying to solve.




Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Don Clugston

On 02/10/12 14:02, Andrej Mitrovic wrote:

On 10/2/12, Don Clugston  wrote:

A proposal to clean up this mess


Any compile-time value of type immutable(char)[] or const(char)[],
behaves a string literals currently do, and will have a \0 appended when
it is stored in the executable.

ie,

enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
printf(hello);

will work.


What about these, will these pass?:

enum string x = "foo";
assert(x.length == 3);

void test(string x) { assert(x.length == 3); }
test(x);

If these don't pass the proposal will break code.


Yes, they pass. The \0 is not included in the string length. It's 
effectively in the data segment, not in the string.





Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Andrej Mitrovic
On 10/2/12, Don Clugston  wrote:
> A proposal to clean up this mess
> 
>
> Any compile-time value of type immutable(char)[] or const(char)[],
> behaves a string literals currently do, and will have a \0 appended when
> it is stored in the executable.
>
> ie,
>
> enum hello = ['H', 'e', 'l', 'l', 'o', '\n'];
> printf(hello);
>
> will work.

What about these, will these pass?:

enum string x = "foo";
assert(x.length == 3);

void test(string x) { assert(x.length == 3); }
test(x);

If these don't pass the proposal will break code.


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread deadalnix
Well the whole mess come from the fact that D conflate C string and D 
string.


The first problem come from the fact that D array are implicitly 
convertible to pointer. So calling D function that expect a char* is 
possible with D string even if it is unsafe and will not work in the 
general case.


The fact that D provide tricks that will make it work in special cases 
is armful as previous discussion have shown (many D programmer assume 
that this will always work because of toy tests they have made, where in 
case it won't and toStringz must be used).


The only sane solution I can think of is to :
 - disallow slice to convert implicitly to pointer. .ptr is made for that.
 - Do not put any trailing 0 in string literal, unless it is specified 
explicitly ( "foobar\0" ).
 - Except if a const(char)* is expected from the string literal. In 
case it becomes a Cstring literal, with a trailing 0. This is made to 
allow uses like printf("foobar");


In other terms, the receiver type is used to decide if the compiler 
generate a string literal or a Cstring literal.


Other addition of 0 are just confusing, and will make incorrect code 
work in special cases, which is something you usually don't want. Code 
that work by accident often backfire in spectacular ways at the least 
expected moment.


Re: Proposal: clean up semantics of array literals vs string literals

2012-10-02 Thread Tobias Pankrath

On Tuesday, 2 October 2012 at 11:10:46 UTC, Don Clugston wrote:

The problem
---

String literals in D are a little bit magical; they have a 
trailing \0. This means that is possible to write,


printf("Hello, World!\n");

without including a trailing \0. This is important for 
compatibility with C. This trailing \0 is mentioned in the spec 
but only incidentally, and generally in connection with printf.


But the semantics are not well defined.

printf("Hello, W" ~ "orld!\n");

If every string literal is \0-terminated, then there should be 
two \0 in the final string. I guess that's not the case and 
that's actually my preferred behaviour, but the spec should make 
it crystal clear in which situations a

string literal gets a terminator and in which not.