Is there any good reason for UTS#18 'Unicode Regular Expressions' to express its requirements in terms of codepoints rather than scalar values?
I was initially worried by RL1.1 requiring that one be able to specify surrogate codepoints in a pattern. It would not be compliant for an application to reject such patterns as syntactically or semantically incorrect! RL1.1 seemed to prohibit compliant regular expression engines that only handled well-formed UTF-8 strings. Furthermore, consider attempting to handle CESU-8 text as a sequence of UTF-8 code units. The code unit sequence for U+10000 will, corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80 ED B0 80. If one follows the lead of the 'best practice' for processing ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0, and 80. I am not aware of any recommendation as to how to interpret these sequences as codepoints. While being able to specify a search for surrogate codepoint U+D800 might be useful when dealing with ill-formed UTF-16 Unicode sequences, UTS#18 Section 1.7, which discusses requirement RL1.7, states that there is no requirement for a one-codepoint pattern such as \u{D800} to match a UTF-16 Unicode string consisting just of one code unit with the value 0xD800. The convenient, possibly intended, consequence of this is that the RL1.1 requirement to allow patterns to specify surrogate codepoints can be satisfied by simply treating them as unmatchable; For example, such a 1-character RE could be treated as the empty Unicode set [\p{gc=Lo} && \p{gc=Mn}]. Now, I suppose one might want to specify a match for ill-formed (in context) UTF-8 code unit subsequences such as E0 80 (not a valid initial subsequence) and E0 A5 (lacking a trailing byte), but as matching is not required, I don't see the point in UTS#18 being changed to ask for an appropriate syntax to be added. Richard. _______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode