Performance of hashes and associative arrays

Raphaël.Jakse Tue, 06 Nov 2012 22:40:23 -0800

Hello everybody,

we had a conversation with Ali Cehreli about the right ways ofhashing values / objects. This is related to the D language whenusing associative arrays but applies for any language as well.


One of the points was, we have the following class:

==
class Student // class representing a student
{
    string firstName; // the first name of the student
    string lastName; // the last name of the student
}
==

Let s be a Student:

==
auto s = new Student("Jean", "Martin");
==

We want to be able to get the hash of s. Therefore, were-implement the toHash method of the Student class :


==
class Student
{
    string firstName;
    string lastName;

    override size_t toHash() const
    {
        //...
    }
}
==

The original solution proposed was:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==

However, with this solution, we get the same hash for newStudent("Jean", "Martin") and new Student("Martin", "Jean"). Wedid an experiment of performance of associative arrays in D whenthe hash function is not well designed (returning the same hashfor too many values). When the hash function returns the samehash for too many values, performance are dramatic (See the endof the post for more information).

So here is the second solution that was proposed and that avoidshaving the same hash for two different names:


==
    override size_t toHash() const
    {

auto h = firstName ~ ":" ~ lastName; // we suppose that":" is not a valid character in a name

        return typeid(h).getHash(&h);
    }
==

A ":" was added so new Student("Martin", "Jean") gets a differenthash than new Student("Mar", "tinJean")).

Let's call the following solution "second bis solution" :

==
    override size_t toHash() const
    {
        auto h = firstName ~ lastName;
        return typeid(h).getHash(&h);
    }
==

This solution suppose that being named Mar tinJean instead ofMartin Jean is really strange and therefore quite unusual.However, these two solutions imply string concatenations eachtime we need the hash.Ali agreed that concatenating strings each time would indeed beinefficient. He thought we might cache the value (third solution):


==
class Student
{
    string firstName;
    string lastName;
    size_t cachedHash;

    override size_t toHash() const
    {
        if(cachedHash is null)
        {
            auto h = firstName ~ ":" ~ lastName;
            cachedHash = typeid(h).getHash(&h);
        }
        return cachedHash;
    }
}
==

This solution doesn't make compromise on the "quality" of thehash and is fast after the first usage but needs to keep the hashas part of the class. The hash is calculated for each new objectthat is used as index in an associative array, even if the hashfor this specific name has already been calculated in anotherobject.


Questions are :
 - what is the most efficient solution, and in which case ?

- Is it preferable to try to make the hash unique (second bissolution), or to try to make the hash easier to evaluate and riskin rare case to get the same hashes for two different objects(second solution)?- Is solution 3 a good practice? Preferable to the otherssolutions? When?- Is solution 1 preferable to the others solutions, given thatfirst name / last name inversions are quite "rare" ?

 - how compares solution 1 to solution 2 bis ?

It would also be interesting to have ideas for the general case,i.e. random objects and not just names that have particularitieslike "First names are not often last names and vice versa", orfor other specific situations.Questions are oriented for usage in associative arrays in D, butstill apply to other use cases of hashes.

For those who are interested, here is Ali's experiment onperformance of associative arrays in D depending on the design ofthe hash function (I took the liberty to modify it slightly).


==
import std.stdio;
import std.random;
import std.string;

class Clock
{
    int hour;
    int minute;
    int second;

    static size_t opCmpCount = 0;
    static size_t opEqualsCount = 0;
    static size_t toHashCount = 0;

    override equals_t opEquals(Object o) const
    {
        ++opEqualsCount;

        auto rhs = cast(const Clock)o;

        return (rhs &&
                (hour == rhs.hour) &&
                (minute == rhs.minute) &&
                (second == rhs.second));
    }

    override int opCmp(Object o) const
    {
        ++opCmpCount;

        /* Taking advantage of the automatically-maintained
         * order of the types. */
        if (typeid(this) != typeid(o)) {
            return typeid(this).opCmp(typeid(o));
        }

        auto rhs = cast(const Clock)o;
        /* No need to check whether rhs is null, because it is
         * known on this line that it has the same type as o. */

        return (hour != rhs.hour
                ? hour - rhs.hour
                : (minute != rhs.minute
                   ? minute - rhs.minute
                   : second - rhs.second));
    }

    override string toString() const
    {
        return format("%02s:%02s:%02s", hour, minute, second);
    }

    this(int hour, int minute, int second)
    {
        this.hour = hour;
        this.minute = minute;
        this.second = second;
    }

    override hash_t toHash() const
    {
        ++toHashCount;
        return hour + minute + second; // method 1
        // return hour + 60 * minute + 3600 * second; // method 2
    }

    static void printCounts()
    {
        writefln("opEquals: %8s", opEqualsCount);
        writefln("opCmp   : %8s", opCmpCount);
        writefln("toHash  : %8s", toHashCount);
    }
}

Clock randomClock()
{

return new Clock(uniform(0, 24), uniform(0, 60), uniform(0,60));

}

void main()
{
    enum count = 10_000;
    Clock[Clock] aa;

    foreach (i; 0 .. count) {
        auto entry = randomClock();
        aa[entry] = entry;
    }

    foreach (i; 0 .. count) {
        auto result = (randomClock() in aa);
    }

    Clock.printCounts();
}
==

We both get the following results for the method 1:

opEquals:        0
opCmp   :  1438438
toHash  :    20000


and the following results for the method 2:

opEquals:        0
opCmp   :     1681
toHash  :    20000

Apart from the fact associative arrays in gdc and dmd don't usethe opEquals method, we see that using method 1 implies approx.856 times more comparisons than method 2. I calculated thatmethod 1 gives 139 different hashes for 80063 different elements,which means each hash represents an average of approx. 576elements. Method 2 gives exactly one hash for one element.This seems to be an extreme case, we might think results would becompletely different from a function giving hashes that"sometimes" represents 2 elements.

Performance of hashes and associative arrays

Reply via email to