> Yes, that's somewhat problematic.  Making up "a byte CEF" would be
> Wrong, though, because there is, by definition, no CCS to map, and
> we would be dangerously close to conflating in CES, too...
> ACR-CCS-CEF-CES.  Read the character model.  Understand the character
> model.  Embrace the character model.  Be the character model.  (And
> once you're it, read the relevant Unicode, XML, and Web standards.)
> 
> To highlight the difference between opaque numbers and characters,
> the above should really be:
> 
>       if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... }
> 
> I think what needs to be done is that \xHH must not be encoded as
> literals (as it is now, 'A' and \x41 are identical (in ASCII)), but
> instead as regex nodes of their own, storing the code points.  Then
> the regex engine can try both the "right/new way" (the Unicode code
> point), and the "wrong/legacy way" (the native code point).

My suggest will be add a binary mode, such as /xxxx/b. When binary mode
is in effect, only ascii characters (0 - 127) still carry text property.
\p{IsLower} will only match ascii a to z. All 128 - 255 always have false
text property. Any code points must be between 0 and 255. The regcomp
can easily check it upon compilation.

A dedicated binary mode will simplify many issues. And the regex will
be very readable. We can make binary mode be exclusive with text mode,
i.e. and regex expression must be either binary or text, but not both.
(I am not sure if it is really useful to have mixed mode.)

Hong

Reply via email to