2015-10-05 19:11 GMT+02:00 Ken Whistler <kenwhist...@att.net>: > However, it would be reasonable (and permitted) for an API to actually > report a default value for a surrogate code point (i.e., treating it more > or less like the reserved code point U+50005 that Marcus mentioned). >
Unassigned (reserved) code points, when followed by an assigned combining mark would still be treated as starters of a combining sequence by default. This is not (IMHO) desirable for lone surrogates that should better be handled in isolation independantly of what follows them. My opinion is that they should be treated like new line controls, so that the combining mark after it will also be separated into a defective combining sequence without any starter (e.g. 000A 0302 creates two clusters, this should be the same for D800 0302. D800 will have no defined glyph to render, but the glyph for U+FFFD may be displayed, or just a ".notdef" tofu box). Now for break opportunities, those lone surrogates should not create a newline or paragraph break opportunity, but they may create a word break opportunity to allow their easy separation and selection by a double-click on this tofu in an editor; they may even create a syllable break opportunity before and after them to allow wrapping long lines there). Those adaptations however are not described at all in annexes speaking about text segmentations. So those surrogates (which are permanently assigned) could have their own code point properties more formally defined. In my opinion handling them like U+0000 is much better than handling thme like U+50005, which should stay reserved and handled as standard starters with default combining class 0. Also those lone surrogates should be Bidi-neutral (imagine they occur in the middle of some Arabic text, they should probably not change the direction of the surrounding text and should not alter the embedding context).