On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
On 19/10/12 16:07, foobar wrote:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
We can still have both (assuming the code points are
valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4
lines above!
It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it
includes the high bits that indicate the length of each char).
\u makes dchars.
"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two
non-zero bytes.
Yes, the \u requires code points and not code-units for a
specific UTF encoding, which you are correct in pointing out are
four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode
encoding errors.
http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data.
The hex data need not form valid UTF characters."
I _already_ said that I consider this a major semantic bug as it
violates the principle of least surprise - the programmer's
expectation that the D string types which are Unicode according
to the spec to, well, actually contain _valid_ Unicode and _not_
arbitrary binary data.
Given the above, the design of \u makes perfect sense for
_strings_ - you can use _valid_ code-points (not code units) in
hex form.
For general purpose binary data (i.e. _not_ UTF encoded Unicode
text) I also _already_ said IMO should be either stored as
ubyte[] or better yet their own types that would ensure the
correct invariants for the data type, be it audio, video, or just
a different text encoding.
In neither case the hex-string is relevant IMO. In the former it
potentially violates the type's invariant and in the latter we
already have array literals.
Using a malformed _string_ to initialize ubyte[] IMO is simply
less readable. How did that article call such features, "WAT"?