On Sat, 31 May 2014 19:28:27 -0700 Markus Scherer <markus....@gmail.com> wrote:
> On Sat, May 31, 2014 at 1:59 AM, Richard Wordingham < > richard.wording...@ntlworld.com> wrote: > > > Bear in mind that a pattern \uD808 shall not match anything in a > > well-formed Unicode string. > > > Depends. See the definitions of Unicode strings vs. UTF strings. D80: Unicode string: A code unit sequence containing code units of a particular Unicode encoding form... D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode encod- ing form is called well-formed if and only if it does follow the specification of that Unicode encoding form. How does a Unicode string purport anything? >> \uD808\uDF45 specifies a sequence of two >> codepoints. > Implementations that use Unicode 16-bit strings will usually treat > this as one supplementary code point. > In Java, there is no other way to escape one. In which case, Java does *not* supply 'basic Unicode support' as defined by UTS#18 Version 17 - see just before Section 1.1.1 therein. An engine that matches code unit by code unit does not comply with RL1.7. This makes sense in so far as it provides for consistent results across UTF-encodings for Unicode strings that could once have been reversibly converted. (A 32-bit Unicode string <D808, DF45> converted to a 16-bit Unicode string and back would become <12345>.) Now that that conversion should not preserve lone surrogates (separately both C10 together with D93 and TUS Section 5.22), it makes less sense. However, I can think of one major objection to a regular expression engine using 16-bit Unicode strings treating every supplementary point as a sequence of two surrogate points. While it might be acceptable for a lone surrogate to match \P{L} (codepoints that are not letters), it would not be acceptable for every supplementary point to match \P{L}\P{L} or even \p{Any}\p{Any}. Richard. _______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode