I'm having a little trouble understanding Mahout's minhash implementation.
Please correct me if I'm wrong, but the general intent of minhash is to
evaluate the similarity of two sparse feature vectors based on the features
they have in common, not necessarily the value of those features (as the
values are often 1 or 0 and 0 values simply aren't tracked in the sparse
vector). So given a space of 10 dimensions, if Jack had features 4 and 6
and Jill had features 5 and 6, Jacks vector would look something like {4:1,
6:1} and Jill's would like like {5:1, 6:1}. Since they have 1/3 total
features in common, their Jaccard coefficient is 1/3. Also, given K random
hash functions, we would expect about a third of them to return a minimum
value for each of the three keys 4, 5 and 6 and thus about a third of them
would also return the same minimum value for {4, 6} and {5, 6} (i.e. the
third that return a minimum hash value for the key 6). That's my basic
English explanation of the purpose of minhash -- again, somebody please
correct me if I'm wrong.
Given that understanding, can somebody explain why Mahout's minhash
implentation is hasing the values from the feature vectors rather than the
keys?
See the following code from MinHashMapper.java
for (int i = 0; i < numHashFunctions; i++) {
for (Vector.Element ele : featureVector) {
/// Shouldn't the following line say ele.index();
int value = (int) ele.get();
bytesToHash[0] = (byte) (value >> 24);
bytesToHash[1] = (byte) (value >> 16);
bytesToHash[2] = (byte) (value >> 8);
bytesToHash[3] = (byte) value;
int hashIndex = hashFunction[i].hash(bytesToHash);
if (minHashValues[i] > hashIndex) {
minHashValues[i] = hashIndex;
}
}
}
The code in TestMinHashClustering also seems to written with the expectation
that minhash should be hashing the values rather than the keys. Am I
reading something wrong here? Is this the intended use? Are we supposed to
be putting the feature ids into the double value fields of the feature
vectors?
Thanks,
Jeff