On 10/4/2015 12:38 PM, Richard
Wordingham wrote:
On Sun, 4 Oct 2015 10:50:43 -0700 Markus Scherer <markus....@gmail.com> wrote:I would not spend any time specifying intricate rules for unpaired surrogates in 16-bit strings, or out-of range values in 32-bit strings. Most processing will treat them like unassigned characters, like U+50005, with only default behaviors.The core problem here is that many editors will not allow one to delete just a non-initial character from a grapheme cluster. I fear there may be editors that don't even allow one to delete the final character. This may not be a problem when one works with a small set of grapheme clusters, as in French or German, or possibly even Vietnamese, but becomes a problem when working with such a large set that the notion of them being user-perceived characters strains credulity. The problem you are trying to solve is to allow editing on the code point level, or, if you will, the keystroke level. Generally, there will be a sweet spot for each language (and each user) with respect to what to erase or undo. For sequences that belong to a given language, you can pick the behavior that makes most sense in them, but for lone surrogates, by definition you are dealing with broken text that doesn't follow any conventions. It should also be something that doesn't occur commonly. So, for all of those reasons, I see no particular problem with giving that a "generic" behavior, which could be that of deleting the entire combining sequence; especially if your interface normally deletes sequences as a unit. If it never treats sequences as units, then I would in fact question why this should be different for surrogates. But in any case, the minimal requirement on an editor is that it lets you delete (and then retype) enough text to get it back to an uncorrupted state. A stray U+50005 before a combining mark would also be fiddly to get rid of, but even if the editor does not allow the entry of arbitrary scalar values, a user might fix the problem by creating an HTML file containing the character and then copying the character from the HTML file to a find and replace command. This trick is unlikely to work for a lone surrogate. Catch-22 here. In filtering input to the dialog to prevent it from being used to corrupt text, you prevent it from being used to repair text. Interesting. A./ |
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Philippe Verdy
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Asmus Freytag (t)
- Re: Deleting Lone Surrogates Richard Wordingham
- Scope of Unicode Character Properties (was: Re... Ken Whistler
- Re: Deleting Lone Surrogates Philippe Verdy
- Re: Deleting Lone Surrogates Markus Scherer
- Re: Deleting Lone Surrogates Philippe Verdy
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Asmus Freytag (t)
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Asmus Freytag (t)
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Asmus Freytag (t)
- Re: Deleting Lone Surrogates Richard Wordingham
- Re: Deleting Lone Surrogates Martin J. Dürst
- Re: Deleting Lone Surrogates Philippe Verdy
- Re: Deleting Lone Surrogates Asmus Freytag (t)
- Re: Deleting Lone Surrogates Richard Wordingham