[jira] Commented: (MAHOUT-579) group Id should be included in clusterId for MinHash clustering

Forest Tan (JIRA) Sun, 09 Jan 2011 17:22:10 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979446#action_12979446
 ]


Forest Tan commented on MAHOUT-579:
-----------------------------------

Hi Ankur, I totally agree that it is necessary to "concatenate multiple 
consecutive hash-signatures together in a sliding window fashion to form 
cluster-ids". 
What I am talking about is another problem: even if minhash value of two 
vectors are different using any hash function, these two vectors still have 
chance to belong to the same cluster in current code.

For example, the feature vector of user A and B are:
user A: (1,2,3) 
user B: (4,5,6)
It is obvious that Jaccard coefficient similarity of A and B are zero.

And there are two Hash Functions:
HF1: x%4
HF2: (x+1)%4

So,
minhash(user A, HF1)=1
minhash(user B, HF1)=0
minhash(user A, HF2)=0
minhash(user B, HF2)=1

Using HF1, minhash of user A and B are different(1 and 0).
Using HF2, minhash of user A and B are different(0 and 1).
But user A and B will belong to the same cluster: 
If KEY_GROUPS=1, user A and B will belong to both cluster with clusterid=0 and 
cluster with clusterid=1. 
If KEY_GROUPS=2, user A and B will belong to both cluster with clusterid=0-1 
and cluster with clusterid=1-0.

I think this clustering result is not the expected one.

> group Id should be included in clusterId for MinHash clustering
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-579
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-579
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Forest Tan
>            Assignee: Ankur
>             Fix For: 0.5
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current implementation of MinHash clustering use N groups of hash value as 
> clusterid, e.g., 10003226-1109023
> And the code(MinHashMapper.java) is as following:
> for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
>         {
>             StringBuilder clusterIdBuilder = new StringBuilder();
>             for (int j = 0; (j < this.keyGroups) && (i + j < 
> this.numHashFunctions); j++)
>             {
>                 clusterIdBuilder.append(this.minHashValues[(i + 
> j)]).append('-');
>             }
>             String clusterId = clusterIdBuilder.toString();
>             clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
>             Text cluster = new Text(clusterId);
>             Writable point;
>             if (this.debugOutput)
>                 point = new VectorWritable(featureVector.clone());
>             else
>             {
>                 point = new Text(item.toString());
>             }
>             context.write(cluster, point);
>         }
> For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
> userid, minhash1, minhash2
> A, 100, 200
> B, 200, 100
> the clustering result will be:
> clusterid, userid
> 100, A
> 200, A
> 200, B
> 100, B
> And user A, B will be in the same cluster 100 and 200. 
> However, the first and the second hash functions are different, so, it 
> doesn't mean the two users are similar even if minhash1 of A equals to 
> minhash2 of B.
> The fix is easy, just change the line
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> to
> clusterId = clusterId + i;
> After the fix, the clustering result will be:
> clusterid, userid
> 100-0, A
> 200-1, A
> 200-0, B
> 100-1, B

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-579) group Id should be included in clusterId for MinHash clustering

Reply via email to