On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
On 19/10/12 16:07, foobar wrote:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston
wrote:
We can still have both (assuming the code points are
valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4
lines above!
It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it
includes the high bits that indicate the length of each char).
\u makes dchars.
"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two
non-zero bytes.
Yes, the \u requires code points and not code-units for a
specific UTF encoding, which you are correct in pointing out
are four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode
encoding errors.
http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex
data. The hex data need not form valid UTF characters."
I _already_ said that I consider this a major semantic bug as
it violates the principle of least surprise - the programmer's
expectation that the D string types which are Unicode according
to the spec to, well, actually contain _valid_ Unicode and
_not_ arbitrary binary data.
Given the above, the design of \u makes perfect sense for
_strings_ - you can use _valid_ code-points (not code units) in
hex form.
For general purpose binary data (i.e. _not_ UTF encoded Unicode
text) I also _already_ said IMO should be either stored as
ubyte[] or better yet their own types that would ensure the
correct invariants for the data type, be it audio, video, or
just a different text encoding.
In neither case the hex-string is relevant IMO. In the former
it potentially violates the type's invariant and in the latter
we already have array literals.
Using a malformed _string_ to initialize ubyte[] IMO is simply
less readable. How did that article call such features, "WAT"?
I just re-checked and to clarify string literals support _three_
escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes
So raw bytes _can_ be directly specified and I hope the compiler
still verifies the string literal is valid Unicode.