Asmus Freytag <asmusf at ix dot netcom dot com> wrote: >> I'm probably missing something here, but I don't agree that it's OK >> for a consumer of UTF-16 to accept an unpaired surrogate without >> throwing an error, or converting it to U+FFFD, or otherwise raising a >> fuss. Unpaired surrogates are ill-formed, and have to be caught and >> dealt with. > > The question is whether you want every library that handles strings > perform the equivalent of a citizen's arrest, or whether you architect > things that the gatekeepers (border control) police the data stream.
If you can have upstream libraries check for unpaired surrogates at the time they convert UTF-16 to Unicode code points, then your point is well taken, because then the downstream libraries are no longer dealing with UTF-16, but with code points. Doing conversion and validation at different stages isn't a great idea; that's how character encodings get involved with security problems. Corrigendum #1 closed the door on interpretation of invalid UTF-8 sequences. I'm not sure why the approach to handling UTF-16 should be any different. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s