Re: Regarding hex strings

foobar Fri, 19 Oct 2012 12:01:04 -0700

On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:

On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
On 19/10/12 16:07, foobar wrote:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugstonwrote:
We can still have both (assuming the code points arevalid...):
string foo = "\ua1\ub2\uc3"; // no .dup
That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4
Come on, "assuming the code points are valid". It says so 4lines above!
It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, itincludes the high bits that indicate the length of each char).
\u makes dchars.
"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's twonon-zero bytes.
Yes, the \u requires code points and not code-units for aspecific UTF encoding, which you are correct in pointing outare four hex digits and not two.This is a very reasonable choice to prevent/reduce Unicodeencoding errors.
http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hexdata. The hex data need not form valid UTF characters."
I _already_ said that I consider this a major semantic bug asit violates the principle of least surprise - the programmer'sexpectation that the D string types which are Unicode accordingto the spec to, well, actually contain _valid_ Unicode and_not_ arbitrary binary data.Given the above, the design of \u makes perfect sense for_strings_ - you can use _valid_ code-points (not code units) inhex form.
For general purpose binary data (i.e. _not_ UTF encoded Unicodetext) I also _already_ said IMO should be either stored asubyte[] or better yet their own types that would ensure thecorrect invariants for the data type, be it audio, video, orjust a different text encoding.
In neither case the hex-string is relevant IMO. In the formerit potentially violates the type's invariant and in the latterwe already have array literals.
Using a malformed _string_ to initialize ubyte[] IMO is simplyless readable. How did that article call such features, "WAT"?

I just re-checked and to clarify string literals support _three_escape sequences:

\x__ - a single byte
\u____ - two bytes
\U________ - four bytes

So raw bytes _can_ be directly specified and I hope the compilerstill verifies the string literal is valid Unicode.

Re: Regarding hex strings

Reply via email to