I'm going to answer some of Peter's points, leaving aside the interesting digressions into Java subclassing etc. that have developed later in the discussion.

At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:
I note the following text from section 5.13, p.127, of the Unicode standard v.4:

Canonical equivalence must be taken into account in rendering multiple accents, so that any two canonically equivalent sequences display as the same.

This statement goes to the core of Unicode. If it is followed, it guarantees that normalizing a string does not change its appearance (and therefore it remains the 'same' string as far as the user is concerned.)


The word "must" is used here. But this is part of the "Implementation Guidelines" chapter which is generally not normative. Should this sentence with "must" be considered mandatory, or just a recommendation although in certain cases a "particularly important" one?

If you read the conformance requirements you deduce that any normalized or unnormalized form of a string must represent the same 'content' on interchange. However, the designers of the standard wanted to make even specialized uses, such as 'reveal character codes' explicitly conformant. Therefore you are free to show to a user whether a string is precomposed or composed of combining characters, e.g. by using a different font color for each character code.


The guidelines are concerned with the average case: displaying the characters as *text*.

[The use of the word 'must' in a guideline is always awkward, since that word has such a strong meaning in the normative part of the standard.]

Rendering systems should handle any of the canonically equivalent orders of combining
marks. This is not a performance issue: The amount of time necessary to reorder combining
marks is insignificant compared to the time necessary to carry out other work required
for rendering.

The interesting digressions on string libraries aside, the statement made here is in the context of the tasks needed for rendering. If you take a rendering library and add a normalization pass on the front of it, you'll be hard-pressed to notice a difference in performance, especially for any complex scripts.


So we conclude: "rendering any string as if it was normalized" is *not* a performance issue.

However, from the other messages on this thread we conclude: normalizing *every* string, *every time* it gets touched, *is* a performance issue.

A few things: Unicode supports data that allow to perform a 'Normalization Quick Check', which simply determines whether there is anything that might be affected by normalization. (For example, nothing in this e-mail message is affected by normalization, no matter to which form, since it's all in ASCII.)

With a quick check like that you should be able to reduce the cost of normalization dramatically --unless your data consists of data that needs normalization throughout. Even then, if there is a chance that the data is already normalized, verifying that is faster than normalizing (since verification doesn't re-order).

Then, after that, as others have pointed out, if you can keep track of a normalized state, either by recordkeeping or by having interfaces inside which the data is guaranteed to be normalized, then you cut your costs furhter.

A./



Reply via email to