I just remarked that there's absolutely NO utility of the collation weight 0000 anywhere in the algorithm.
For example in UTR #10, section 3.3.1 gives a collection element : [.0000.0021.0002] for COMBINING GRAVE ACCENT. However it can also be simply: [.0021.0002] for a simple reason: the secondary or tertiary weights are necessarily LOWER then any primary weight (for conformance reason): any tertiary weight < any secondary weight < any primary weight (the set of all weights for all levels is fully partitioned into disjoint intervals in the same order, each interval containing all its weights, so weights are sorted by decreasing level, then increasing weight in all cases) This also means that we never need to handle 0000 weights when creating sort keys from multiple collection elements, as we can easily detect that [.0021.0002] given above starts by a secondary weight 0021 and is not a primary weight. As well we don't need to use any level separator 0000 in the sort key. This allows more interesting optimizations, and reduction of length for sort keys. What this means is that we can safely implement UCA using basic substitions (e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to map source (binary) strings or regexps,into target (binary) strings: For a level-3 collation, you just then need only 3 calls to "string:gsub()" to compute any collation: - the first ":gsub(mapNormalize)" can decompose a source text into collation elements and can perform reordering to enforce a normalized order (possibly tuned for the tailored locale) using basic regexps. - the second ":gsub(mapSecondary)" will substitute any collection elements by their "intermediary" collation elements+tertiary weight. - the third ":gsub(mapSecondary)" will substitute any "intermediary" collation element by their primary weight + secondary weight The "intermediary" collection elements are just like source text, except that higher level differences are eliminated, i.e.all source collation element string are replaced by the collection element string that have the smallest collation element weights. They must be just encoded so that they are HIGHER than any higher level weights. How to do that: - reserve the weight range between .0000 (yes! not just .0001) and .001E for the last (tertiary) weight, make sure that all other intermediary collation elements will use only code units higher than .0020 (this means that they can remain encoded in their existing UTF form!) - reserve the weight .001F for the case where you don't want to use secondary differences (like letter case) and them to tertiary differences. This will be used in the second mapping to decompose source collection elements into "intermediary collation elements" + tertiary weight. you may then decide to leave tertiary weights in the substitute string, or because the "gsub()" finds match from left to right, to accumulate the tertiary weights into a separate buffer, so that the subtitution itself will still return a valid UTF string, containing only "intermediary collation elements" (with all tertiary differences erased). You can repeat the process with the next gsub() to return the primary collation elements" (still in UTF form), and separately the secondary weights (also accumulable in a separate buffer). Now there remains only 3 strings: - one contains only the primary collection elements (still in UTF-form, but using code units always higher than or equal to 0020) - another one contains only secondary weights (between MINSECONDARYWEIGHT and 001F) - another one contains only tertiary weights. (between 0000 and MINSECONDARYWEIGHT-1) For the rest I will assume that MINSECONDARYWEIGHT is 0010, so * primary weights are encoded with one or more code units in [0020..] (multiple code units are possible if you reserve some of these code units to be prefixes or longer sequences) * secondary weights are encoded with one or more code units in [0010..001E] (same remark about multiple code units if you need them) * tertiary weights are encoded with one or more code units in [0010..001F] (same remark about multiple code units if you need them) The last gsub() will only reorder the primary collection elements to remap them in a suitable binary order (it will be a simple bijective permutation, except that the target does not have to use multiple code units, but a single one, when there are contractions). It's always possible to make this permutation generate integers higher than 0020. The resulting weights can remain encodable with UTF-8 as if it was source text. And to return the sort key, all you need is to concatenate * the string containing all primary weights encoded with code units in [0020..], then * the string containing secondary weights encoded with code units in [0010..001E], then * the string containing tertiary weights encoded with code units in [0000..001F]. * you don't need to insert ANY [0000] as a level separator in the final sort key, because each concatenated part in the final sort key respect the wellformedness constraint WF2 of the UCA algorithm. You may choose to not use tertiary weights encoded with [0000] code units, if you want the final string containing the sort key to be null-terminated. In summary: * there's no longer any special role given in UCA for [0000]. More compaction possible for storing the mapping of source collation element strings (in their original UTF encoding) to strings of collation weights (themselves still encodage with an UTF!). * Any tailored collation (except those requiring preprocessing that may apply specific reorderings, possibly made by using subtitution with one or more regexps to apply, repeated in a defined order) is just specified by one map per collation level, containing source UTF strings (or regexps) to replace by their mapped string of collation weights. * You are free to choose the UTF to use for the source string or for the collation weight (these UTF may be different or may be both UTF-8. If you use a conforming UTF, the only code units you cannot use are those in [D800..DFFF], reserved for surrogates. * Normal string library packages can be used to implement UCA, even those that can only work with texts encoded with a valid UTF. * Given that the resulting sort keys are valid UTF, they are displayable: in many circonstances, the initial part of the string (containing primary weights only) will display the normal UTF encoding of readable text; if there are additional secondary or tertiary weights after it, because they are represented using C0 controls, you may still display them using a notation like \xNN (you only need to escape '\' if it is present as a litteral in the readable part of the sort key containing primary weights). Note: Isolated surrogates found in a non-conforming source string need to be preprocessed if you want to accept them in a collator: - You can do that by preprocessing [0000] or [D800..DFFF], into [0000] followed by only one codeunits in [0020..], so they form a single collation element [0000][0020..]; use [0000][0020] as the collation element representing the source [0000] and just insert a single [0000] before any isolated surrogate you'll replace by a code unit in [0800..0FFF]. The result will be a conforming UTF string on which your collator will return valid UTF strings of weights. - If you don't want to have any [0000] within sort keys, you can also preprocess the source string by reencoding [0000] into [0001][0020], and [0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into [0001][0800..0FFF]. Here also the result will be a conforming UTF string on which your collator will return valid UTF strings of weights.

