[ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi Prodan updated MAHOUT-344:
---------------------------------

    Status: Patch Available  (was: Open)

Thank you guys for all the encouragement and advices. 

I'm committing my first patch for MinHah clustering. The patch contain the 
following things:

- in core - minhash:
 * MinHashMapRed - removed the distributed hash need; Each mapper generates the 
same hash functions using the same seed (as per instructions from Ankur).
 * RandomLinearHashFunction - added another random linear hash function in 
form: h( x ) = ax + b mod p . p will be as big as possible > 10000000 and it 
should be prime(not done yet, but committing in this form due to some time 
restrictions) . 

- in examples - minhash directory:
 * DisplayMinHash - contains an example of running min-hash, with the options 
commented. It's basically the main function from MinHashMapRed.
 * PrepareDataset - this class offers the ability to convert the last-fm 
database suggested above in a format readable by the MinHash algorithm. It also 
shows a progress bar with the percent done :) . 
For the future I believe that all the code in the algorithm should take a more 
generalized form and use the Vectors classes used by Mahout, then the users 
could either write their own version with a Vector interface or create a tool 
that converts their ds to the vector format the code will know.
MurmurHash is used by PrepareDataset to hash the strings which denoted users 
(in the original last_fm dataset) - to integers. 
 * TestClusterQuality - gets a clustered file, generated by the minhash 
algorithm and computes the average for each cluster aggregated over all 
clusters. 
In each cluster the mean is computed by:
SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i != j . 
TOTAL_SIMILARITIES = n! / k! * (n -k)!, n = total number of items in cluster, k 
= 2.
The aggregated mean is the mean of all these values. 

As an example. Having the following input:

1       1       2       3       4       5
2       1       2       3       4       6
3       7       6       3       8       9
4       7       8       9       6       1
5       5       6       7       8       9
6       8       7       5       6

The first column are the users. For each user, on the lines we have the items 
preferred (browsed, listened to) by him. Same format in the contents of each 
cluster bellow. 

we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash 
indices in a bucket), 2 - minimum items within cluster):

CLUSTER ID -- 2359983695385880352354530253637788 (items=2)      
=========================================
2       1       2       3       4       6
1       1       2       3       4       5
 
CLUSTER ID -- 236643825172184878353970117486898894 (items=2)    
=========================================
4       7       8       9       6       1
3       7       6       3       8       9
 
CLUSTER ID -- 35606006580772015548743126287496777 (items=2)     
=========================================
6       8       7       5       6
5       5       6       7       8       9
 
CLUSTER ID -- 38797144231157365543316465389702468 (items=2)     
=========================================
6       8       7       5       6
5       5       6       7       8       9

The aggregated average over theses clusters is 0.6793650793650793.

I'm now testing on the last_fm dataset. The problem I currently encounter is 
that the size for the clustered file is kind of big, but I'm working on tuning 
the params. 


> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to