On Friday, 3 June 2016 at 10:05:11 UTC, Walter Bright wrote:
On 6/3/2016 1:05 AM, H. S. Teoh via Digitalmars-d wrote:
However, this
meant that some precomposed characters were "redundant": they
represented character + diacritic combinations that could equally well be expressed separately. Normalization was the inevitable consequence.

It is not inevitable. Simply disallow the 2 codepoint sequences - the single one has to be used instead.

There is precedent. Some characters can be encoded with more than one UTF-8 sequence, and the longer sequences were declared invalid. Simple.

I.e. have the normalization up front when the text is created rather than everywhere else.

I don't think it would work (or at least, the analogy doesn't hold). It would mean that you can't add new precomposited characters, because that means that previously valid sequences are now invalid.

Reply via email to