Hello everybody,

we had a conversation with Ali Cehreli about the right ways of hashing values / objects. This is related to the D language when using associative arrays but applies for any language as well.

One of the points was, we have the following class:

==
class Student // class representing a student
{
    string firstName; // the first name of the student
    string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, we re-implement the toHash method of the Student class :

==
class Student
{
    string firstName;
    string lastName;

    override size_t toHash() const
    {
        //...
    }
}
==

The original solution proposed was:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==

However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information).

So here is the second solution that was proposed and that avoids having the same hash for two different names:

==
    override size_t toHash() const
    {
auto h = firstName ~ ":" ~ lastName; // we suppose that ":" is not a valid character in a name
        return typeid(h).getHash(&h);
    }
==

A ":" was added so new Student("Martin", "Jean") gets a different hash than new Student("Mar", "tinJean")).
Let's call the following solution "second bis solution" :

==
    override size_t toHash() const
    {
        auto h = firstName ~ lastName;
        return typeid(h).getHash(&h);
    }
==

This solution suppose that being named Mar tinJean instead of Martin Jean is really strange and therefore quite unusual. However, these two solutions imply string concatenations each time we need the hash. Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :

==
class Student
{
    string firstName;
    string lastName;
    size_t cachedHash;

    override size_t toHash() const
    {
        if(cachedHash is null)
        {
            auto h = firstName ~ ":" ~ lastName;
            cachedHash = typeid(h).getHash(&h);
        }
        return cachedHash;
    }
}
==

This solution doesn't make compromise on the "quality" of the hash and is fast after the first usage but needs to keep the hash as part of the class. The hash is calculated for each new object that is used as index in an associative array, even if the hash for this specific name has already been calculated in another object.

Questions are :
 - what is the most efficient solution, and in which case ?
- Is it preferable to try to make the hash unique (second bis solution), or to try to make the hash easier to evaluate and risk in rare case to get the same hashes for two different objects (second solution)? - Is solution 3 a good practice? Preferable to the others solutions? When? - Is solution 1 preferable to the others solutions, given that first name / last name inversions are quite "rare" ?
 - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case, i.e. random objects and not just names that have particularities like "First names are not often last names and vice versa", or for other specific situations. Questions are oriented for usage in associative arrays in D, but still apply to other use cases of hashes.


For those who are interested, here is Ali's experiment on performance of associative arrays in D depending on the design of the hash function (I took the liberty to modify it slightly).

==
import std.stdio;
import std.random;
import std.string;

class Clock
{
    int hour;
    int minute;
    int second;

    static size_t opCmpCount = 0;
    static size_t opEqualsCount = 0;
    static size_t toHashCount = 0;

    override equals_t opEquals(Object o) const
    {
        ++opEqualsCount;

        auto rhs = cast(const Clock)o;

        return (rhs &&
                (hour == rhs.hour) &&
                (minute == rhs.minute) &&
                (second == rhs.second));
    }

    override int opCmp(Object o) const
    {
        ++opCmpCount;

        /* Taking advantage of the automatically-maintained
         * order of the types. */
        if (typeid(this) != typeid(o)) {
            return typeid(this).opCmp(typeid(o));
        }

        auto rhs = cast(const Clock)o;
        /* No need to check whether rhs is null, because it is
         * known on this line that it has the same type as o. */

        return (hour != rhs.hour
                ? hour - rhs.hour
                : (minute != rhs.minute
                   ? minute - rhs.minute
                   : second - rhs.second));
    }

    override string toString() const
    {
        return format("%02s:%02s:%02s", hour, minute, second);
    }

    this(int hour, int minute, int second)
    {
        this.hour = hour;
        this.minute = minute;
        this.second = second;
    }

    override hash_t toHash() const
    {
        ++toHashCount;
        return hour + minute + second; // method 1
        // return hour + 60 * minute + 3600 * second; // method 2
    }

    static void printCounts()
    {
        writefln("opEquals: %8s", opEqualsCount);
        writefln("opCmp   : %8s", opCmpCount);
        writefln("toHash  : %8s", toHashCount);
    }
}

Clock randomClock()
{
return new Clock(uniform(0, 24), uniform(0, 60), uniform(0, 60));
}

void main()
{
    enum count = 10_000;
    Clock[Clock] aa;

    foreach (i; 0 .. count) {
        auto entry = randomClock();
        aa[entry] = entry;
    }

    foreach (i; 0 .. count) {
        auto result = (randomClock() in aa);
    }

    Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't use the opEquals method, we see that using method 1 implies approx. 856 times more comparisons than method 2. I calculated that method 1 gives 139 different hashes for 80063 different elements, which means each hash represents an average of approx. 576 elements. Method 2 gives exactly one hash for one element. This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.

Reply via email to