2012/10/2 Don Clugston <d...@nospam.com>: > The problem > ----------- > > String literals in D are a little bit magical; they have a trailing \0. This > means that is possible to write, > > printf("Hello, World!\n"); > > without including a trailing \0. This is important for compatibility with C. > This trailing \0 is mentioned in the spec but only incidentally, and > generally in connection with printf. > > But the semantics are not well defined. > > printf("Hello, W" ~ "orld!\n"); > > Does this have a trailing \0 ? I think it should, because it improves > readability of string literals that are longer than one line. Currently DMD > adds a \0, but it is not in the spec. > > Now consider array literals. > > printf(['H','e', 'l', 'l','o','\n']); > > Does this have a trailing \0 ? Currently DMD does not put one in. > How about ['H','e', 'l', 'l','o'] ~ " World!\n" ? > > And "Hello " ~ ['W','o','r','l','d','\n'] ? > > And "Hello World!" ~ '\n' ? > And null ~ "Hello World!\n" ? > > Currently DMD puts \0 in some cases but not others, and it's rather random. > > The root cause is that this trailing zero is not part of the type, it's part > of the literal. There are no rules for how literals are propagated inside > expressions, they are just literals. This is a mess. > > There is a second difference. > Array literals of char type, have completely different semantics from string > literals. In module scope: > > char[] x = ['a']; // OK -- array literals can have an implicit .dup > char[] y = "b"; // illegal > > This is a big problem for CTFE, because for CTFE, a string is just a > compile-time value, it's neither string literal nor array literal! > > See bug 8660 for further details of the problems this causes. > > > A proposal to clean up this mess > -------------------------------- > > Any compile-time value of type immutable(char)[] or const(char)[], behaves a > string literals currently do, and will have a \0 appended when it is stored > in the executable. > > ie, > > enum hello = ['H', 'e', 'l', 'l', 'o', '\n']; > printf(hello); > > will work. > > Any value of type char[], which is generated at compile time, will not have > the trailing \0, and it will do an implicit dup (as current array literals > do). > > char [] foo() > { > return "abc"; > } > > char [] x = foo(); > > // x does not have a trailing \0, and it is implicitly duped, even though it > was not declared with an array literal. > > ------------------- > So that the difference between string literals and char array literals would > simply be that the latter are polysemous. There would be no semantics > associated with the form of the literal itself. > > > We still have this oddity: > > > void foo(char qqq = 'b') { > > string x = "abc"; // trailing \0 > string y = ['a', 'b', 'c']; // trailing \0 > string z = ['a', qqq, 'c']; // no trailing \0 > } > > This is because we made the (IMHO mistaken) decision to allow variables > inside array literals. > This is the reason why I listed _compile time value_ in the requirement for > having a \0, rather than entirely basing it on the type. > > We could fix that with a language change: an array literal which contains a > variable should not be of immutable type. It should be of mutable type (or > const, in the case where it contains other, immutable values). > > So char [] w = ['a', qqq, 'c']; should compile (it currently doesn't, even > though w is allocated on the heap). > > But that's a separate proposal from the one I'm making here. I just need a > decision on the main proposal so that I can fix a pile of CTFE bugs.
Maybe your proposal is correct. I think the key idea is *polysemous typed string literal*. When based on the Ideal D Interpreter in my brain, the organized rule will become like follows. 1-1) In semantic level, D should have just one polysemous string literal, which is "an array of char". 1-2) In token level, D has two represents for the polysemous string literal, they are "str" and ['s','t','r']. 2) The polysemous string literl is implicitly convertible to [wd]?char[] and immutable([wd]?char)[] (I think const([wd]?char)[] is not need, because immutable([wd]?char)[] is implicitly convertible to them). 3) The concatenation result between polysemous literals is still polysemous, but its representation is different based on the both side of the operator. "str" ~ "str"; // "strstr" "str" ~ ['s','t','r']; // ['s','t','r','s','t','r'] "str" ~ 's'; // "strs" ['s','t','r'] ~ 's'; // ['s','t','r','s'] "str" ~ null; // "str" ['s','t','r'] ~ null; // ['s','t','r'] 4) After semantics _and_ optimization, polysemous string literal which represented as like 4-1) "str" is typed as immutable([wd]?char)[] (The char type is depends on the literal suffix). 4-2) ['s','t','r'] is typed as ([wd]?char)[] (The char type is depends on the common type of its elements). 5) In object file generating phase, string literal which typed as 5-1) immutable([wd]?)char[] is stored in the executable and implicitly terminated with \0. 5-2) [wd]?char[] are stored in the executable as the original image and implicitly 'dup'ed in runtime. ---- Additionally, in following case, both concatenation should generate polysemous string literals in CT and RT. Because, after concatenation of chars and char arrays, newly allocated strings are *purely immutable* value and implicitly convertible to mutable. immutable char ic = 'a'; pragma(msg, typeof(['s', 't', ic, 'r'])); // prints const(char)[] immutable(char)[] s = ['s', 't', ic, 'r']; // BUT, should be allowed char mc = 'a'; pragma(msg, typeof("st"~mc~"r")); // prints const(char)[] char[] s = "st"~mc~"r"; // BUT, should be allowed Kenji Hara