I'm writing a parser generator for ANTLR-grammars and have come
across the rule
fragment Letter
: [a-zA-Z$_] // these are below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters
above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate
pairs encodings for U+10000 to U+10FFFF
;
at
https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158
This rule is converted into
Match m__Letter()
{
return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'),
ch('_')),
not(alt(rng('\u0000', '\u007F'), rng('\uD800',
'\uDBFF'))),
seq(rng('\uD800', '\uDBFF'), rng('\uDC00',
'\uDFFF')));
}
given suitable defs of alt, rng, seq, not.
This errors as
CtoLexer_parser.d 665 57 error invalid UTF
character \U0000d800
CtoLexer_parser.d 665 67 error invalid UTF
character \U0000dbff
CtoLexer_parser.d 666 28 error invalid UTF
character \U0000d800
CtoLexer_parser.d 666 38 error invalid UTF
character \U0000dbff
CtoLexer_parser.d 666 53 error invalid UTF
character \U0000dc00
CtoLexer_parser.d 666 63 error invalid UTF
character \U0000dfff
Doesn't DMD support these Unicodes yet?