There are a family of interesting algorithms that sort lexicographically
on strings of bits or bytes rather than by comparing items pairwise with
some kind of customizable comparison function on individual items.

Optimal comparison sorts such as heapsort, quicksort, mergesort,
library sort, and sorting with a red-black tree are often described as
having O(N lg N) runtime, but in fact this is only true if we assume
individual comparisons take constant time.  In fact, though,
individual lexicographical comparisons can take up to O(N) time as
well, if N is the total size of the input in bits; consider comparing
“aaaaaaaaaaaaaaaab” to “aaaaaaaaaaaaaaaac”.  Consequently their
average runtimes are slightly worse, and their worst-case runtimes are
much worse.

Radix sorting algorithms, by contrast, can operate in O(N) time!  This
sounds like an extraordinary claim, since they still must do at least
O(lg M) bit comparisons per item if you have M unique items; but, the
fact is, the length of each of M unique items is also O(lg M), so the
total size N of the input is O(M lg M).  So they’re still O(M lg M) in
the number of items, but O(N) in the number of bits in the input.

(If you have nonunique input items, you can count the instances of
each unique item in linear time while you sort the unique items.)

There are similar advantages associated with radix trees (called
“tries”, invented by Fredkin), in particular Patricia (called
“crit-bit” by Bernstein).

The trouble with radix-sorting algorithms is that you have to encode
your data in a form where lexicographical comparison results in the
order that you want.  For most kinds of data, this is fairly easy, but
not all.  

There's an excellent article by [Guttman and Rosler][] about the
practicalities of generating string keys to sort various types of data
in Perl, so that you don't have to use a written-in-Perl comparison
function. This is also applicable to the problem of sorting arbitrary
data with radix sort.

[Guttman and Rosler]: http://www.sysarch.com/Perl/sort_paper.html
 (A Fresh Look at Efficient Perl Sorting, Uri Guttman and Larry Rosler, from I 
think 1999)

There's also the [`strxfrm` function in the standard C library]
[strxfrm], which generates such a string key specifically for
locale-specific comparison of human-language strings (so that you sort
accented characters into the place where they are traditionally sorted.)
P.J. Plauger wrote [an article in 1992 on how to implement it] [Plauger]
with a DSL to add new locales; this [seems to be similar to the way
glibc does it] [glibc] — see e.g. the [Swedish locale collation
definition] [Swedish] that uClibc shares with glibc.

[strxfrm]: http://www.manpagez.com/man/3/strxfrm/

[Plauger]: http://drdobbs.com/tools/199902389?pgno=1 
 (“'All sorts of sorts', Computer Language magazine, January 1992, reprinted in 
Programming on Purpose III and the Dr Dobbs Journal blog)

[glibc]: 
http://repo.or.cz/w/glibc.git/blame/4b10dd6c1959577f57850ca427a94fe22b9f3299:/string/strxfrm.c

[Swedish]: 
http://cristi.indefero.net/p/uClibc-cristi/source/tree/3e8c046f98134be9c6ddd954141b6b13bf919199/extra/locale/collation/sv_SE

In particular, ratios --- rational numbers --- are fairly easy to define
a custom comparison function on, but encoding them for lexicographical
comparison seems difficult.  [Jules Jacobs proposed][] using the
position of a rational number in the Stern-Brocot tree.  Unfortunately,
his encoding is not self-delimiting and unusably inefficient; for
example, it encodes the integer N by a sequence of N 1 bits.
In response to [Jacobs’ query on Math Overflow] [], Gjergji Zaimi said,
“It is easy to prove that for any countable linearly ordered set there
is an order preserving injection to the rationals. This can be proven by
enumerating the base set and then specifying the values of the mapping
by induction.”  So, in a sense, then solving the radix-sortable
encoding problem for rational numbers solves the problem for all
linearly-ordered countable infinite sets, but Zaimi’s construction,
while constructive, is even less efficient than Jacobs’s construction.

Nevertheless, Jacobs and Zaimi proved that it is always theoretically
possible to sort elements of a countable linearly-ordered set with a
radix sort.  Now it only remains to discover whether it’s possible to do
so efficiently.

[Jules Jacobs proposed]: 
http://www.reddit.com/r/programming/comments/g6a4s/the_american_flag_sort_an_efficient_inplace_radix/c1lbkjg
 (Reddit comment in or near March 2011 by user julesjacobs, beginning “Oh, I 
just thought of a method :)”)

[Jacobs’ query on Math Overflow]: 
http://mathoverflow.net/questions/58917/injections-to-binary-sequences-that-preserve-order
 (Math Overflow question on 2011-03-19, “Injections to binary sequences that 
preserve order”, by Jules)

The rest of this post describes an encoding of arbitrary-precision
rational numbers that is self-delimiting, lexicographically ordered,
and fairly compact.  It is CPU-efficient enough that you could
practically write programs with it, but it is probably two orders of
magnitude less efficient as a numerical representation than raw binary.

Lexicographical comparison 
---------------------------

Let <=> be the comparison operator, returning one of {"<", "=", ">"}.
Given a definition for it on some atomic type, we extend it to operate
on sequences of that type “lexicographically”, as follows:

Let F = (A0 <=> B0). Then:

    ([] <=> [B0, ...]) = "<"
    ([A0, ...] <=> []) = ">"
    (([A0, A1, A2, ...] <=> [B0, B1, B2, ...]) = ([A1, A2, ...] <=> [B1, B2, 
...]) if F = "=";
                                               = F otherwise.

A function f is “order-preserving” if

    (f(N) <=> f(M)) = (N <=> M)

for all N, M in the domain of f.  If f’s range consists of some kind
of sequence, we’ll call f a “lexicographically sortable encoding”.

Self-delimiting
---------------

An encoding f is “self-delimiting” if no string it produces is a
prefix of any other string it produces, i.e. there are no N, M, and i
such that f(N) = f(M)[:i] and N != M.  This means that you can
concatenate other crap onto the end of it and still decode it, and
then you know where the encoding ends and the other crap begins, which
is handy if the other crap has useful data in it.

In particular, this is useful in two practical cases:

1. Sequence encoding: if you have a sequence of items in the domain of
   f, you can produce an encoding of the sequence by concatenating the
   encodings of the items.  Furthermore, this new encoding function is
   lexicographically sortable iff f is.

2. Packing into fields: modern computers can do rapid parallelized
   lexicographical comparison of 32-bit or 64-bit chunks, but to do
   that conveniently, the chunks need to be stored in memory aligned
   at least to byte boundaries, and probably to 32- or 64-bit
   boundaries.  That means you need to append some crap to pad
   bitstrings out to fill entire bytes or 32-bit or 64-bit boundaries,
   and it’s desirable if that can’t cause comparisons to return the
   wrong answer or the encoding to become undecodable.

The definition above doesn’t imply that it’s necessarily
computationally feasible or even decidable to find the end of the
bitstring produced by f, but it is easy for all the encodings I
consider here.

Encoding of whole numbers: E
----------------------------

I found a lovely lexicographically sortable encoding of the whole or
natural numbers 1, 2, 3, etc., as bitstrings {0,1}* --- all of them,
not just those less than some arbitrary limit such as 2**32.  It is
self-delimiting.

You can construct a lovely self-delimiting sortable encoding as
follows.

First calculate b(N) = floor(lg N), where lg N is log N / log 2.  This
b is one less than the number of bits in the binary encoding of N:
b(1) = 0, b(2) = 1, b(3) = 1, b(4) = 2, etc.

Let p(N) = "1" * b(N) || "0" --- b(N) 1 bits, followed by a 0 bit.

Let e(N) be the last b(N) bits of the binary encoding of N.  This will
leave off a single 1 bit from the beginning of the binary encoding.

Let E(N), the encoding of N, be p(N) || e(n).

Encodings of the first few whole numbers follow, with spaces inserted
for clarity:

    1 -> 0             9 -> 111 0 001
    2 -> 1 0 0        10 -> 111 0 010
    3 -> 1 0 1        11 -> 111 0 011
    4 -> 11 0 00      12 -> 111 0 100
    5 -> 11 0 01      13 -> 111 0 101
    6 -> 11 0 10      14 -> 111 0 110
    7 -> 11 0 11      15 -> 111 0 111
    8 -> 111 0 000    16 -> 1111 0 0000

If b(N) = b(M), then p(N) = p(M), so (E(N) <=> E(M)) = (e(N) <=> e(M))
= (N <=> M), which is what we wanted.

If b(N) != b(M), assume WOLOG b(N) < b(M).  So N < M.  Both p(N) and
p(M) begin with b(N) 1 bits, but then they differ: the bit b(N) will
be a 0 in p(N) and a 1 in p(M).  Therefore p(N) <=> p(M) gives us "<",
as does E(N) <=> E(M), as does N <=> M.

You could argue that this encoding is not particularly efficient,
since it uses 2b(N)+1 bits, while the binary encoding of the number
would only use b(N)+1 bits.  Surely, you might say, there must be a
less voluminous way to delimit the numbers.  However, all such
questions of coding efficiency depend on the random distribution being
assumed of the numbers to be encoded, and that distribution is
necessarily nonuniform.  I assert without proof that this encoding is
nearly optimal for the case where the numbers are distributed such
that 2**-M of them have M bits for all natural numbers M, i.e. half of
them are 1, a quarter of them are 2 or 3, an eighth of them are in the
range 4-7, etc.

Because the encoding is self-delimiting, you can bitwise-concatenate a
sequence of number representations, and lexicographical comparisons on
the resulting bitstrings will map to lexicographical comparisons on
the underlying sequences of numbers.

Encoding of nonnegative numbers: EN
-----------------------------------

Add 1, then encode as above:

    EN(N) = E(N+1)

That’s lexicographically sortable because adding 1 is order-preserving.

For example, EN(2) = E(3) = 101.

Reversing the lexicographical order of an encoding on bitstrings
----------------------------------------------------------------

Complement the bits. If it was self-delimiting before, it’s still
self-delimiting.

Encoding of integers: ES
------------------------

    ES(N) =   "1" || EN(N) if N >= 0
            ~("1" || E(N)) otherwise

If negative, encode the absolute value as a whole number as above,
prepend a 1 bit, and complement.  E.g.:

    -1 -> 1 -> 0     -> 10     -> 01
    -2 -> 2 -> 100   -> 1100   -> 0011
    -3 -> 3 -> 101   -> 1101   -> 0010
    -4 -> 4 -> 11000 -> 111000 -> 000111

If nonnegative, encode as a nonnegative number as above and prepend a
1 bit.  E.g.:

    0 -> 1 -> 0     -> 10
    1 -> 2 -> 100   -> 1100
    2 -> 3 -> 101   -> 1101
    3 -> 4 -> 11000 -> 111000

The leading 1 bit ensures that negative numbers precede nonnegative
numbers.  The bit complementation puts the negative numbers in reverse
order, so that -1 is the greatest of all of them.

So this encoding is lexicographically sortable.

Encoding of whole numbers or infinity: EI
-----------------------------------------

Infinity should compare greater than any whole number.  That’s easy
enough: encode the sequence (1, N) for whole numbers N: 00, 0100,
0101, 011000, etc., and use 2 for infinity: 100.  However, this uses
up a whole bit, devoting essentially half of the space of numbers to
infinity.

If infinity is not quite so probable, you can use a sort of
bit-stuffing: first encode the whole number, and then insert a 0 bit
into its encoding if it begins with too many 1 bits; then use a longer
sequence of 1 bits to represent infinity.

Let the number of 1 bits B = 2.

Let F = E(N) if N is finite.

    EI(N) = "1" * B || "0" || F[B+1:] if F starts with "1" * B;
         or F otherwise if N is finite;
         or "1" * (B+1) if N is infinite.

This encoding is still lexicographically ordered and self-delimiting.

This encoding leaves untouched the encoding of the numbers 1, 2, and
3, while adding a single bit to higher numbers, and encodes infinity
in three bits.  Under our earlier assumption about the distribution of
numbers to be encoded, one-quarter of the numbers to encode are
greater than 3, so this encoding adds a quarter of a bit per number;
so it is optimal when infinity occurs with probability 1/8.  You can
adjust B to a greater or lower value if infinity is more or less
probable.

Encoding of continued fractions: EF
-----------------------------------

A regular continued fraction consists of a sequence of quotients
[Q0; Q1, Q2, ...]., representing the number

          1
    Q0 + ---------------
                1
          Q1 + ---------
                Q2 + ...


A rational number has a finite sequence of such quotients.  For
example, 53/24 is [2; 4, 1, 4], i.e.

         1
    2 + -------------
              1
         4 + --------
                   1
              1 + ---
                   4

You can see from this representation that [3; ...] is a larger number than 
[2; 4, 1, 4], but [2; 5, ...] is smaller, while [2; 4, 2, ...] is a
larger number again.  In general, you can compare continued fractions
by comparing their strings of quotients lexicographically, but
reversing the sense of the inequality if the first unequal quotient is
at an odd offset.

You can eliminate the case where lexicographical comparison runs out
of quotients in one continued fraction or the other by appending a
terminating infinity to the list, yielding e.g. 
[2, 4, 1, 4, infinity].  This also makes the encoding self-delimiting,
since you can’t have an infinity in the middle; nothing after it would
matter.  So you can encode a sequence of continued fractions by
concatenating their individual representations, e.g. 
[53, infinity, 24, infinity, 2, 4, 1, 4, infinity, 0, 2, infinity]
represents [53, 24, 53/24, 1/2].  In mathematical terms it is somewhat
dubious to treat infinity like a number, but it is okay for our purposes
here.

So we can construct a lexicographically sortable encoding of rational
numbers as sequences of integers by negating the odd-offset quotients.
So we encode 53/24 as [2, -4, 1, -4, infinity].  This can give rise to
negative infinities as well.

Unfortunately this gives us an encoding on sequences of integers and
positive and negative infinities, but we want an encoding as
bitstrings.  We could hack a negative infinity onto the signed integer
encoding ES, but this is somewhat wasteful: all the quotients except
the first one are positive until we invert their signs, so the sign
bits are redundant.

Instead, let’s encode the first quotient, which must be finite but
could be zero or negative, using ES, and then encode the others ---
which are whole numbers or infinity --- with EI, giving us a sequence
of bitstrings:

    [2, 4, 1, 4, infinity] -> ["1100", "110000", "0", "110000", "111"]

and then complement the odd-index items in the sequence to reverse the
direction of their comparison:

    ["1101", "001111", "0", "001111", "111"]

and finally concatenate into a bitstring, shown here with spaces for clarity:

    1101 0011 1100 0111 1111

This encodes 53/24 in 20 bits, which is probably about as compact as
you can expect an arbitrary-precision rational-number format to be for
a number like 53/24, which contains 11 bits of precision already.

As another example, -53/24 is [-3; 1, 3, 1, 4], which encodes as

    0010 1 101 1 110000 000

which is 17 bits.

In short:

    EF([A0, ...]) = ES(A0) || EF1([...])

    EF1([])        = ~EI(infinity)
    EF1([A1, ...]) = ~(EI(A1) || EF1([...]))

Collected definitions without explanation
-----------------------------------------

    EF([A0, ...]) = ES(A0) || EF1([...])

    EF1([])        = ~EI(infinity)
    EF1([A1, ...]) = ~(EI(A1) || EF1([...]))

    EI(N) = "1" * B || "0" || F[B+1:] if F starts with "1" * B;
         or F otherwise if N is finite;
         or "1" * (B+1) if N is infinite,
      where F = E(N) if N is finite.

    ES(N) =   "1" || EN(N) if N >= 0
            ~("1" || E(N)) otherwise

    EN(N) = E(N+1)

    E(N) = p(N) || e(n)

    p(N) = "1" * b(N) || "0"

    e(N) = the last b(N) bits of the binary encoding of N

    b(N) = floor(lg N), where lg N is log N / log 2.

Some properties of this encoding
--------------------------------

Is the encoding optimal, or at least close to optimal?  Well, that
depends on the probability distribution of continued-fraction
quotients, which in turn depends in a complicated way on the
probability distribution of the rational numbers that are being
encoded as continued fractions.  

Absent a probability distribution over the rational numbers, the best
we can do is use empirical tests and look for redundancy.

Empirically it seems to do pretty well on the numbers I’ve tried it
on, and I don’t see how it could blow up too badly.  At worst,
e.g. for large integers, it’s slightly over a factor of 2 worse than a
specialized encoding for numbers like that.

There *is* a little bit of waste in the last quotient before the
infinity, which cannot be a 1.  You would think that maybe you could
subtract 1 from it before encoding, but that wrecks comparisons
between e.g. [0; 2] (one half) and [0; 2, 2] (two fifths).

If you have the liberty to produce bits of the encoding lazily, you
can use it to represent irrational numbers as well as rational ones.
However, in that case, equality becomes formally undecidable.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

Reply via email to