Re: utf code unit sequence validity (non-)checking

Steven Schveighoffer Wed, 01 Dec 2010 14:30:20 -0800

On Wed, 01 Dec 2010 07:35:15 -0500, spir <denis.s...@gmail.com> wrote:

Hello,
I just noted noted that D's builtin *string types do not behave the sameway in front of invalid code unit sequences. For instance:
void main () {
    assert("hæ?" == "\x68\xc3\xa6\x3f");
    // Note: removing \xa6 thus makes invalid utf8.

    string s1 = "\x68\xc3\x3f";
    // ==> OK, accepted -- but write-ing indeed produces "h�?".

    dstring s4 = "\x68\xc3\x3f";
    // ==> compile-time Error: invalid UTF-8 sequence
}
I guess this is because, while converting from string to dstring,meaning while decoding code units to code points, D is forced to checksequence validity. But this is not needed, and not done, for utf8string. Am I right on this?If yes, isn't it risky to let utf8 (and wstrings?) unchecked? I mean, tohave a concrete safety difference with dstrings? I know there are utfchecking routines in the std lib, but for dstrings one does not need nocall them explicitely.(Note that this checking is done at compile-time for source codeliterals.)

I agree, the compiler should verify all string literals are valid utf.Can you file a bugzilla enhancement if there isn't already one?


-Steve

Re: utf code unit sequence validity (non-)checking

Reply via email to