[EMAIL PROTECTED] wrote:

Wizards,

Those who've answered my previous questions about hashes have been most helpful. I especially appreciated John W. Kennedy's excellent card catalogue cabinet analogy.

I have one last (I hope) question on the topic:I understand from previous answers that not all data sets will hash "politely." Some will have a very nice one-item-to-one-bucket outcome, while others can have several, even many, items in just a few buckets. And that the worst case could be "10,000 items all in one bucket." Is there a way of determining how well my data goes into its hash? As an example of my question:

        $h{$_} = $_ for 'a'..'zzz';
        print scalar( %h ), "\n";

prints out

        14056/32768

This says that the 18,278 values generated used only 14,056 of the reserved 32,768 buckets. I understand that permutations would hash to the same key (er, the same bucket?), for example "cab" and "abc" (although I suspect this would depend on the hashing algorithm, wouldn't it?). My question here is simply if there's a way to see how "well-behaved" my data set is. Some of my scalars come out to:

        59/128 -- for 69 values
        78/128 -- for 120 values

In fact, none of them come out "okay" -- that is, x buckets for x values.

I guess a second question might be whether the answer to the first is even meaningful--at least insofar as this has any effect on performance. If it doesn't, I suppose the question's almost pointless, except as a point of knowledge.

In order to really know how effective a hash algorithm is, it is necessary to have statistics on chain lengths. I don't believe Perl gives you that, but here is a program that will simulate it for you.

use strict;
use warnings;
use integer;

# Initialize the buckets to n zeros.
sub init_buckets($$) {
        my ($buckets, $size) = @_;
        @$buckets = ();
        foreach (1..$size) {
                push @$buckets, 0;
        }
}

# Calculate the raw hash value using the documented Perl algorithm.
sub rawhash ($) {
        my ($key) = @_;
        my $hash = 0;
        for (my $i = 0; $i < length $key; ++$i) {
                $hash = $hash * 33 + ord(substr($_, $i, 1));
        }
        return $hash + ($hash >> 5);
}

# Simulate adding (or re-adding) the data.
sub add_key ($$) {
        my ($buckets, $key) = @_;
        my $hash = rawhash ($key);
        ++$$buckets[rawhash ($key) & $#$buckets];
}

# All hashes start with a minimum of eight buckets. Initialize them.
my $size = 8;
my @buckets = ();
init_buckets [EMAIL PROTECTED], $size;

# Since we are not creating a real hash, create an array for use in
# re-hashing.
my @save = ();

# Prepare a count of items added.
my $items = 0;

# This is where the work is done, reading input.
while (<>) {
        # If we have more data than buckets, double the number of
        # buckets and re-hash all the existing data.
        if (++$items > $size) {
                $size *= 2;
                init_buckets [EMAIL PROTECTED], $size;
                foreach (@save) {
                        add_key [EMAIL PROTECTED], $_;
                }
        }
        # Save the input for re-hashing.
        push @save, $_;
        # Simulate adding the new item.
        add_key [EMAIL PROTECTED], $_;
}

# Show the result.
print join ', ', @buckets;




--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
  -- Charles Williams.  "Judgement at Chelmsford"

_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to