Not silently ! Even if this removal is required to go on editing, this must be notified to the user as it may occur in unedited parts of the file (and it may be the sign that the document is not fully plain text, so the user should not save the edited file) If this is caused by a quirk in the user input (defect of the input mode or keyboard layout), there should be a notification.
But for a general purpose editor that allows editing files including binary ones (e.g. Emacs), it is best to NOT drop those lone surrogates at all, and effectively treat them in isolation for ALL purposes (the DELETE key should not delete more than this lone surrogate (it may be necessary to adjst the cursor position after the deletion if the editor does not support placing the cursor in the middle of a combining sequence, but a LONE surrogate + a combining character should still be treated as two separate clusters and the cursor or selection should be placable between the lone surrogate and the combining mark.) Note that file formats that contain binary parts and plain text parts do exist, e.g. media files that contain a final plain text section for metadata or for some XML data signature : it is safe to edit that final part in a text editor, provided that it does not silently change the encoding of the binary part. In summary, I do not like the idea of silently dropping lone surrogates in editors. If the editor needs it because it cannot safely handle binary parts, the notification will say to the user that he should not use that editor and choose something else, or it will allow the user to select another appropriate file encoding to edit the file safely. The user should not save the file blindly as it will be corrupted silently. Doing otherwise would be a security issue. And this remark extends to all other protocols using plain text input ; lone surrogates should not be dropped silently (unless explicitly requested for exemple in a maintenance cleanup or repair) : it this lone surrogate violates the further processing, the only safe option is to reject the whole text and report the error if text data is required but missing. 2015-10-05 13:50 GMT+02:00 Martin J. Dürst <due...@it.aoyama.ac.jp>: > On 2015/10/05 04:30, Asmus Freytag (t) wrote: > >> On 10/4/2015 6:02 AM, Richard Wordingham wrote: >> >>> In the absence of a specific tailoring, is the combination of a lone >>> surrogate and a combining mark a user-perceived character? Does a lone >>> surrogate constitute a user-perceived character? >>> >> >> In an editing interface, a lone surrogate should be a user perceived >> character, >> as otherwise you won't be able to manually delete it. Markus suggests >> that it be >> treated like an unassigned code point. >> > > In an editing tool (of which an editing interface is a part of), a lone > surrogate should just be removed! Apparently, that's what happens in > Richard's case, but only eventually. > > Regards, Martin. >