The default behavior of unassigned characters are to treat them like base characters, so if they are followed by a combining mark, it would create a default grapheme cluster, which is not appropriate here.
Surrogates are not chracters (so they cannot have any character properties), but they are assigned and so don't have "default" properties (only meant for *unassigned* codepoints). I still think that it is safer to treat them (for text segmentation purpose as pure isolates i.e. exactly like basic controls such as U+0000 NUL, or such as the U+FFFD replacement control which is typically used as visible placeholders for various errors). For normalisation purpose they should also have combining class 0 (i.e. acting as blockers against reorderings for canonical equivalences), and not as "transparent" (discarded and bypassed as if those surrogates were not present at all). 2015-10-04 19:50 GMT+02:00 Markus Scherer <markus....@gmail.com>: > I would not spend any time specifying intricate rules for unpaired > surrogates in 16-bit strings, or out-of range values in 32-bit strings. > Most processing will treat them like unassigned characters, like U+50005, > with only default behaviors. > markus >