2015-10-20 2:07 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>:
> Now, as we know, UTF-32 does not handle the full range of Unicode code > points; ??? All valid UTFs handle the full range of valid Unicode code points. This includes UTF-32 as well as UTF-16 and UTF-8 (and their variants). it only handles scalar values. ??? UTF's allow encoding ANY valid scalar values (which are bijectively associated to a subset of valid code points). However they don't allow encoding surrogates (that are valid code points but not assigned any scalar value, so that they are not valid in any valid UTF). Visibly you are still confusing code points, code units and scalar values. > In the discussion of UTS#18 > RL1.7, my objections did result in the addition of: > > "Note: It is permissible, but not required, to match an isolated > surrogate code point (such as \u{D800}), which may occur in Unicode > Strings. See Unicode String in the Unicode glossary." > > I'm not sure that that text loosely associated with RL1.7 gets round > Requirement RL1.1, which still reads: > > "To meet this requirement, an implementation shall supply a mechanism > for specifying any Unicode code point (from U+0000 to U+10FFFF), using > the hexadecimal code point representation." > I'm also puzzled about how such a regexp will really match some input text if that input text has to be using a valid UTF. The regexp "\u{D800}" will likely match only lone surrogates (in any UTF), not a surrogate with the same value which is paired correctly to encode a supplementary code point. Note that even with **valid** UTF-8 text, U+D800 cannot occur. But if you remove the "valid" restriction, U+D800 may be present, including before U+DC00, but this won't form a valid pair: these are also lone surrogates in this case (they are paired and encode a supplementary code point, only if the text uses UTF-16 There are no valid surrogate pairs in valid UTF-8 and valid UTF-32, so if surrogates are appearing, they are all "lone" surrogates. If you blindly convert from UTF-8 or UTF-32 to UTF-16, the invalid text could become valid and new valid supplementary code points will appear unexpectedly. That's why lone surrogates cannot be part of any valid UTF, as they break the bijection.