Re: Performance of hashes and associative arrays

Dan Wed, 07 Nov 2012 04:15:31 -0800

On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jaksewrote:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==

Isn't the real problem the addition. You want to mix the bitstogether in a consistent way that does not have associativity.I've seen things like:


result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can bemixed in. Any feedback on it would be great. See:


http://forum.dlang.org/post/tbjrqaogxegtyexnf...@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made tosupport classes. It allows for the following code for yourexample. It provides toHash, opEquals and opCmp. Below thehashcodes for s1 and s2 are different, the instances are notequal and s1 is less than s2 which is what you want.


import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
  mixin(HashSupport);
  string firstName; // the first name of the student
  string lastName; // the last name of the student
}

void main() {
  auto s1 = Student("Jean", "Martin");
  auto s2 = Student("Martin", "Jean");
  writeln(s1.toHash(), " vs ", s2.toHash());
  writeln(s1 == s2);
  writeln(s1 < s2);
}

However, with this solution, we get the same hash for newStudent("Jean", "Martin") and new Student("Martin", "Jean"). Wedid an experiment of performance of associative arrays in Dwhen the hash function is not well designed (returning the samehash for too many values). When the hash function returns thesame hash for too many values, performance are dramatic (Seethe end of the post for more information).


This makes sense and is not particular to D.

Ali agreed that concatenating strings each time would indeed beinefficient. He thought we might cache the value (thirdsolution) :

Interesting about caching the hashcode and on large classes couldsave you. But isn't the signature shown const? Would thatcompile? Also, what if you change the strings - you get the wronghash? I suppose you could limit writes to properties and null thehashcode on any write.

Questions are :
 - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on allimportant data (in most cases is all the data) is the goal.

This seems to be an extreme case, we might think results wouldbe completely different from a function giving hashes that"sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 forthe smart method and (several minutes later) opCmp:907216764 forthe simple addition (method 1).In this case you know something special about the data and cantake advantage of it. If I run the example using themixin(HashSupport) it does opCmp:7793499 which is about 4 timesas many compares as the smart one. The simple addition does 474times as many compares as the smart one - so it is clearly verybad. So, if you know something special about the data, like itcan easily be converted into a single number such as seconds, byall means include that. But remember, next time you add somethingto the class, if you don't have some "automated" way of pullingin info from all fields you need to revisit the hash (and opCmpand opEquals).


Thanks
Dan

Re: Performance of hashes and associative arrays

Reply via email to