Better directory hashing for cyrus-imapd-2.0.x

mills Mon, 27 Nov 2000 19:31:49 -0800
I'm in the process of setting up a cyrus server for about 40,000 users.
Of these, about 3/4 have userids that begin with the string `um',
followed by an abbreviation of the person's last name, possibly
followed by a 1 to 3 digit number.  The rest have userids derived
from their last names and initials.  Now, cyrus 2.0.x uses the first
character of the userid to select a hash directory in a variety of
places in the file tree, in order to avoid degrading UFS performance
by having a large number of files in a directory.  This clearly will
not work well with my set of userids, and may be less than optimal
at other sites as well.

I did a bit of searching, and found a better directory hashing
algorithm described by Nick Christenson on the qpopper mailing list.
Like many good things, it originates from Knuth.  A test program
that illustrates this algorithm is appended below.  Nick said that
X and Y should be small primes.  I chose 3 and 5 for them.  I chose
a value of 23 for PRIME to give a valid comparison to the existing
cyrus hash scheme.  The algorithm is simple and seems reasonably
fast.  Here is a comparison of the distribution of userid files for
the two hash schemes:

First Character:

        u       m       s       ...     i       x       q
        34652   869     767     ...     86      47      29

New Algorithm:

        N       U       T       ...     I       G       F
        1997    1946    1944    ...     1845    1822    1787

Even disregarding the `um' userids, the existing scheme yields quite
an uneven distribution.  The new one, in contrast, has only about 10%
difference between the highest and lowest categories.

Below is the test program.  I did not experiment with altering the
three prime numbers, although increasing PRIME would likely improve
performance with a larger number of users by creating more hash
directories.  I'd like to suggest that a scheme like this be added to
cyrus 2.0.x.  In addition, if functions that need read-only access to
the files below the hash directories could fail silently if the hash
directory did not exist, and if functions that create the files below
the hash directories could silently create the hash directories, there
would be no need to pre-allocate all of them.


================================================================
/* dirhash: directory hash scheme */
/* Suggested by Nick Christenson <[EMAIL PROTECTED]> */

#include <stdio.h>

enum {
    X = 3,
    Y = 5,
    PRIME = 23
}

dirhash(char *name) {
    unsigned char *pt;
    unsigned int n;

    n = 0;
    pt = (unsigned char *)name;
    while (*pt) {
        n = ((n << X) ^ (n >> Y)) ^ *pt;
        ++pt;
    }
    return n % PRIME;
}

main(int argc, char **argv) {
    char buf[255];

    if (argc > 1) {
        printf("%c/%s\n", dirhash(argv[1]) + '@', argv[1]);
    }
    else {
        while (gets(buf)) {
            printf("%c/%s\n", dirhash(buf) + '@', buf);
        }
    }
    return 0;
}

/**/
================================================================



-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-
Better directory hashing for cyrus-imapd-2.0.x

Reply via email to