Re: Better directory hashing for cyrus-imapd-2.0.x

Lawrence Greenfield Tue, 05 Dec 2000 22:07:35 -0800
This sort of directory hash is very easy to implement in Cyrus; a
better directory hash than the one included wasn't chosen since we
wanted something simple that worked well enough for us.

Unfortunately, due to extreme laziness on my part, the directory hash
function used is used in multiple places.  While any one place can be
replaced without the others, you probably will want to replace them
all.

In 2.0, look for

mailbox.c:
  mailbox_hash_mbox()
  mailbox_hash_quota()

mboxlist.c:
  mboxlist_hash_usersubs()

seen_db.c:
  getpath()

It's similar functions in 1.6.

If you'd like to consolidate things into one macro in a header file
that can be used from each of these, I'd gladly accept the patch and
we'd be able to change the hash function (at compile time) more
easily.

Larry

   From: [EMAIL PROTECTED]
   Date: Mon, 27 Nov 2000 19:33:03 -0600 (CST)

   I'm in the process of setting up a cyrus server for about 40,000 users.
   Of these, about 3/4 have userids that begin with the string `um',
   followed by an abbreviation of the person's last name, possibly
   followed by a 1 to 3 digit number.  The rest have userids derived
   from their last names and initials.  Now, cyrus 2.0.x uses the first
   character of the userid to select a hash directory in a variety of
   places in the file tree, in order to avoid degrading UFS performance
   by having a large number of files in a directory.  This clearly will
   not work well with my set of userids, and may be less than optimal
   at other sites as well.

   I did a bit of searching, and found a better directory hashing
   algorithm described by Nick Christenson on the qpopper mailing list.
   Like many good things, it originates from Knuth.  A test program
   that illustrates this algorithm is appended below.  Nick said that
   X and Y should be small primes.  I chose 3 and 5 for them.  I chose
   a value of 23 for PRIME to give a valid comparison to the existing
   cyrus hash scheme.  The algorithm is simple and seems reasonably
   fast.  Here is a comparison of the distribution of userid files for
   the two hash schemes:

   First Character:

           u    m       s       ...     i       x       q
           34652        869     767     ...     86      47      29

   New Algorithm:

           N    U       T       ...     I       G       F
           1997 1946    1944    ...     1845    1822    1787

   Even disregarding the `um' userids, the existing scheme yields quite
   an uneven distribution.  The new one, in contrast, has only about 10%
   difference between the highest and lowest categories.

   Below is the test program.  I did not experiment with altering the
   three prime numbers, although increasing PRIME would likely improve
   performance with a larger number of users by creating more hash
   directories.  I'd like to suggest that a scheme like this be added to
   cyrus 2.0.x.  In addition, if functions that need read-only access to
   the files below the hash directories could fail silently if the hash
   directory did not exist, and if functions that create the files below
   the hash directories could silently create the hash directories, there
   would be no need to pre-allocate all of them.


   ================================================================
   /* dirhash: directory hash scheme */
   /* Suggested by Nick Christenson <[EMAIL PROTECTED]> */

   #include <stdio.h>

   enum {
       X = 3,
       Y = 5,
       PRIME = 23
   }

   dirhash(char *name) {
       unsigned char *pt;
       unsigned int n;

       n = 0;
       pt = (unsigned char *)name;
       while (*pt) {
           n = ((n << X) ^ (n >> Y)) ^ *pt;
           ++pt;
       }
       return n % PRIME;
   }

   main(int argc, char **argv) {
       char buf[255];

       if (argc > 1) {
           printf("%c/%s\n", dirhash(argv[1]) + '@', argv[1]);
       }
       else {
           while (gets(buf)) {
               printf("%c/%s\n", dirhash(buf) + '@', buf);
           }
       }
       return 0;
   }

   /**/
   ================================================================



   -- 
   -Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-
Re: Better directory hashing for cyrus-imapd-2.0.x

Reply via email to