On 06/03/2016 11:13 PM, Steven Schveighoffer wrote:
No, but I like the idea of preserving the erroneous character you tried to convert.
Makes sense.
But is there an invalid wchar? I looked through the wikipedia article on UTF 16, and it didn't seem to say there was one. If we use U+FFFD, that signifies a coding problem but is still a valid code point. However, doing a wchar in the D800 - D8FF range without being followed by a code unit in the DC00 - DFFF range is an invalid sequence. D throws if it encounters such a thing.
The Unicode FAQ has an answer to this exact question, but it also only says that "[u]npaired surrogates are invalid" [1].
It also mentions "noncharacters" which are "permanently reserved [...] for internal use". "For example, they might be used internally as a particular kind of object placeholder in a string." [2] - Not too bad.
And then there is the replacement character, of course. "[U]sed to replace an incoming character whose value is unknown or unrepresentable in Unicode" [3].
[1] http://www.unicode.org/faq/utf_bom.html#utf16-7 [2] http://www.unicode.org/faq/private_use.html#noncharacters [3] http://www.fileformat.info/info/unicode/char/0fffd/index.htm