On Wednesday, 7 November 2012 at 06:38:32 UTC, Raphaël Jakse wrote:

==
    override size_t toHash() const
    {
        return (typeid(firstName).getHash(&firstName) +
                typeid(lastName).getHash(&lastName));
    }
==


Isn't the real problem the addition. You want to mix the bits together in a consistent way that does not have associativity. I've seen things like:

result = 37 * first + 17 * second

I've incorporated this functionality in a toHash that can be mixed in. Any feedback on it would be great. See:

http://forum.dlang.org/post/tbjrqaogxegtyexnf...@forum.dlang.org
https://github.com/patefacio/d-help/blob/master/d-help/opmix/mix.d

It only supports structs, but the hashing could be made to support classes. It allows for the following code for your example. It provides toHash, opEquals and opCmp. Below the hashcodes for s1 and s2 are different, the instances are not equal and s1 is less than s2 which is what you want.

import std.stdio;
import opmix.mix;

struct Student // class representing a student
{
  mixin(HashSupport);
  string firstName; // the first name of the student
  string lastName; // the last name of the student
}

void main() {
  auto s1 = Student("Jean", "Martin");
  auto s2 = Student("Martin", "Jean");
  writeln(s1.toHash(), " vs ", s2.toHash());
  writeln(s1 == s2);
  writeln(s1 < s2);
}


However, with this solution, we get the same hash for new Student("Jean", "Martin") and new Student("Martin", "Jean"). We did an experiment of performance of associative arrays in D when the hash function is not well designed (returning the same hash for too many values). When the hash function returns the same hash for too many values, performance are dramatic (See the end of the post for more information).

This makes sense and is not particular to D.

Ali agreed that concatenating strings each time would indeed be inefficient. He thought we might cache the value (third solution) :

Interesting about caching the hashcode and on large classes could save you. But isn't the signature shown const? Would that compile? Also, what if you change the strings - you get the wrong hash? I suppose you could limit writes to properties and null the hashcode on any write.

Questions are :
 - what is the most efficient solution, and in which case ?

No string concatenation is good. I think a single pass on all important data (in most cases is all the data) is the goal.

This seems to be an extreme case, we might think results would be completely different from a function giving hashes that "sometimes" represents 2 elements.

Very true. If I increase to 10_000_000 I see opCmp:1913600 for the smart method and (several minutes later) opCmp:907216764 for the simple addition (method 1). In this case you know something special about the data and can take advantage of it. If I run the example using the mixin(HashSupport) it does opCmp:7793499 which is about 4 times as many compares as the smart one. The simple addition does 474 times as many compares as the smart one - so it is clearly very bad. So, if you know something special about the data, like it can easily be converted into a single number such as seconds, by all means include that. But remember, next time you add something to the class, if you don't have some "automated" way of pulling in info from all fields you need to revisit the hash (and opCmp and opEquals).

Thanks
Dan


Reply via email to