[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cristi Prodan updated MAHOUT-344:
---------------------------------
Status: Patch Available (was: Open)
Thank you guys for all the encouragement and advices.
I'm committing my first patch for MinHah clustering. The patch contain the
following things:
- in core - minhash:
* MinHashMapRed - removed the distributed hash need; Each mapper generates the
same hash functions using the same seed (as per instructions from Ankur).
* RandomLinearHashFunction - added another random linear hash function in
form: h( x ) = ax + b mod p . p will be as big as possible > 10000000 and it
should be prime(not done yet, but committing in this form due to some time
restrictions) .
- in examples - minhash directory:
* DisplayMinHash - contains an example of running min-hash, with the options
commented. It's basically the main function from MinHashMapRed.
* PrepareDataset - this class offers the ability to convert the last-fm
database suggested above in a format readable by the MinHash algorithm. It also
shows a progress bar with the percent done :) .
For the future I believe that all the code in the algorithm should take a more
generalized form and use the Vectors classes used by Mahout, then the users
could either write their own version with a Vector interface or create a tool
that converts their ds to the vector format the code will know.
MurmurHash is used by PrepareDataset to hash the strings which denoted users
(in the original last_fm dataset) - to integers.
* TestClusterQuality - gets a clustered file, generated by the minhash
algorithm and computes the average for each cluster aggregated over all
clusters.
In each cluster the mean is computed by:
SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i != j .
TOTAL_SIMILARITIES = n! / k! * (n -k)!, n = total number of items in cluster, k
= 2.
The aggregated mean is the mean of all these values.
As an example. Having the following input:
1 1 2 3 4 5
2 1 2 3 4 6
3 7 6 3 8 9
4 7 8 9 6 1
5 5 6 7 8 9
6 8 7 5 6
The first column are the users. For each user, on the lines we have the items
preferred (browsed, listened to) by him. Same format in the contents of each
cluster bellow.
we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash
indices in a bucket), 2 - minimum items within cluster):
CLUSTER ID -- 2359983695385880352354530253637788 (items=2)
=========================================
2 1 2 3 4 6
1 1 2 3 4 5
CLUSTER ID -- 236643825172184878353970117486898894 (items=2)
=========================================
4 7 8 9 6 1
3 7 6 3 8 9
CLUSTER ID -- 35606006580772015548743126287496777 (items=2)
=========================================
6 8 7 5 6
5 5 6 7 8 9
CLUSTER ID -- 38797144231157365543316465389702468 (items=2)
=========================================
6 8 7 5 6
5 5 6 7 8 9
The aggregated average over theses clusters is 0.6793650793650793.
I'm now testing on the last_fm dataset. The problem I currently encounter is
that the size for the clustered file is kind of big, but I'm working on tuning
the params.
> Minhash based clustering
> -------------------------
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Ankur
> Assignee: Ankur
> Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high
> dimensional data. The essence of the technique is to hash each item using
> multiple independent hash functions such that the probability of collision of
> similar items is higher. Multiple such hash tables can then be constructed
> to answer near neighbor type of queries efficiently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.