2013/3/16 Richard Wordingham <richard.wording...@ntlworld.com>: > If you start with my start = low, continuation = high scheme, you can > convert it in an order-preserving manner to a no-prefix scheme by > the following simple transform: > > If a simple weight precedes a continuation weight, add 0·8 ('·' is > serving as the hexadecimal point) to it. > > Thus 21, 21 81 and 21 81 81 become 21, 21·8 81 and 21·8 81·8 81. If > you don't like semi-integral values, discard the high bit and double, > yielding 42, 43 02 and 43 03 02. You may recognise your non-final v. > final scheme! (I replaced '80' by '81' to avoid confusing zeroes.) > > If you're still not convinced, please show me what goes wrong.
OK so you're adding 1 extra bit, i.e. you're changing your start/continuation encocing scheme to a different encoding that suppresses codes that are prefixes of another. Exactyly what I described (you use variable number of bits), except that your scheme is highly suboptimal, compared to an Huffman coding or the optimal artithmetic coding (for which you can generate statistics of frequencies (precomputed from some initial text corpus in the language for which the collation will be used, and filled with extra entries with low frequency representing the weights of all other collation elements not found in your corpus) in order to generate the discrionary.that will be mapping ordered weights into bit sequences with the same relative order. With this dictionnary (statically generated for the collator, so it is not a problem when computing individual collation elements or comparing texts) you can then directly convert all weights into the codes without having to check for possible codes which may be prefixes for another. Note that Huffmann and arithmetic coding generate an dictionnary of patterns with variable number of bits. If you're not genreating collation keys, but just comparing two texts, the bit stuffing woul not be needed and you'll just compare the values of codes coming from the dictionnary : these codes should then be stored in the dictionnary in a left-aligned integer, padded with bits set to zero to fill the code unit, the dictionnary will indicate how many bits are needed when generating collation keys but this info can be safely ignored when just comparing two plain texts. I've used this technic to generate extremely small collation keys (which are the same order of storage size as the original plain-text (UTF-8 encoded) but in fact almost always shorter than plain-text ! If you use this encoding with the collation level set to its maximum level (which preserves all code point distinctions), this encoding becomes another encoding of plain-text, and is fully reversivle to UTF-8, except that it is binary sortable. This means that the encoding can also be used to store the plain-text itself, without needing to duplicate it in the store for also storing the collation keys (provided that the static dictionnary is preserved) : storing the collation keys is enough. The scheme becomes a valid UTF; but it is more compact than UTF-8 and binary-sortable.