While working on these bugs, we also discussed how surrogate characters were handled in XeTeX. Surrogate characters are the 2048 code points that are used in UTF-16 to encode characters with code points above 65536: a pair of them makes up one Unicode character; however they're not meant to be used in isolation, even though they have code points like other characters (they're not just byte sequences).
Right now, XeTeX allows isolated surrogate characters, and also combines sequences such as ^^^^d835^^^^dc00 into one Unicode character. We want to flag the former case but are not sure how: should we make the characters invalid (with catcode 15)? Or we could map them to the standard "unknown" character (U+FFFD). The latter case is more nasty and should definitely be forbidden -- the ^^ notation should only be used for "proper" characters (so instead of the above, the Unicode code point of the resulting Unicode character should be used, in this case ^^^^^1d400). Any thoughts? Best, Arthur -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex