On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
On 19/10/12 16:07, foobar wrote:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:

We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup

That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4

Come on, "assuming the code points are valid". It says so 4 lines above!

It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char).
\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.

Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two. This is a very reasonable choice to prevent/reduce Unicode encoding errors.

http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters."

I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data. Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form.

For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding.

In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals.

Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?

I just re-checked and to clarify string literals support _three_ escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes

So raw bytes _can_ be directly specified and I hope the compiler still verifies the string literal is valid Unicode.


Reply via email to