On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ☕️ <m...@macchiato.com> wrote:
> > \uD808\uDF45 specifies a sequence of two codepoints. > > That is simply incorrect. The above is in the sample notation of UTS #18 Version 17 Section 1.1. From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. All UTS #18 says for sure about regular expressions matching code units is that they don't satisfy RL1.1, though Section 1.7 appears to ban them when it says, "A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units". Perhaps it's a fundamental requirement of something other than UTS #18. I thought matching parts of characters in terms of their canonical equivalences was awkward enough, without having the additional option of matching some of the code units! Richard. _______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode