[
https://issues.apache.org/jira/browse/MAHOUT-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979307#action_12979307
]
Ankur commented on MAHOUT-579:
------------------------------
> However, the first and the second hash functions are different, so, it
> doesn't mean the two users are similar even if minhash1 of A equals to
> minhash2 of B
This does mean that there is some degree of similarity between A and B.
What we are trying to do with the hashing is to reduce the dimensionality
(features here) of an item while trying to preserve its proximity to its
similar items.
More the degree of similarity (i.e more common features) higher are the chances
of hash collision increasing the probability of two items agreeing on hash
signature(s).
The actual code goes like this :-
// output the cluster information
for (int i = 0; i < numHashFunctions; i++) {
StringBuilder clusterIdBuilder = new StringBuilder();
for (int j = 0; j < keyGroups; j++) {
clusterIdBuilder.append(minHashValues[(i + j) %
numHashFunctions]).append('-');
}
String clusterId = clusterIdBuilder.toString();
clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
Text cluster = new Text(clusterId);
What we do here is to concatenate multiple consecutive hash-signatures together
in a sliding window fashion to form cluster-ids so that average similarity of
items in a particular cluster is higher. (Note:- Clusters with very low
membership are discarded in reducer).
So 2 items that are highly similar to each other will appear together in more
clusters while the ones that are less similar, will appear together in fewer
clusters.
In your example, A & B look 100% equal as per Jaccard coefficient similarity
measure. So its expected that they appear together in more than one cluster.
A little bit of theory behind Minhash is explained here -
https://issues.apache.org/jira/browse/MAHOUT-344?focusedCommentId=12916491&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12916491
Also you can take a quick look at the PPT explaining some concepts here -
www.stanford.edu/class/cs276b/handouts/minhash.pdf
> group Id should be included in clusterId for MinHash clustering
> ---------------------------------------------------------------
>
> Key: MAHOUT-579
> URL: https://issues.apache.org/jira/browse/MAHOUT-579
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.4
> Reporter: Forest Tan
> Assignee: Ankur
> Fix For: 0.5
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Current implementation of MinHash clustering use N groups of hash value as
> clusterid, e.g., 10003226-1109023
> And the code(MinHashMapper.java) is as following:
> for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
> {
> StringBuilder clusterIdBuilder = new StringBuilder();
> for (int j = 0; (j < this.keyGroups) && (i + j <
> this.numHashFunctions); j++)
> {
> clusterIdBuilder.append(this.minHashValues[(i +
> j)]).append('-');
> }
> String clusterId = clusterIdBuilder.toString();
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> Text cluster = new Text(clusterId);
> Writable point;
> if (this.debugOutput)
> point = new VectorWritable(featureVector.clone());
> else
> {
> point = new Text(item.toString());
> }
> context.write(cluster, point);
> }
> For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
> userid, minhash1, minhash2
> A, 100, 200
> B, 200, 100
> the clustering result will be:
> clusterid, userid
> 100, A
> 200, A
> 200, B
> 100, B
> And user A, B will be in the same cluster 100 and 200.
> However, the first and the second hash functions are different, so, it
> doesn't mean the two users are similar even if minhash1 of A equals to
> minhash2 of B.
> The fix is easy, just change the line
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> to
> clusterId = clusterId + i;
> After the fix, the clustering result will be:
> clusterid, userid
> 100-0, A
> 200-1, A
> 200-0, B
> 100-1, B
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.