[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ankur updated MAHOUT-344:
-------------------------
Attachment: MAHOUT-344-v4.patch
Finally some action from my side ;-)
1. HashFunction is now an interface with a single method - hash().
2. Implementations of different hash functions are now moved to a HashFactory
that also provides factory method for fetching hashFunctions of a requested
type (linear, polynomial, murmur).
3. Minhash mapper/reducer code cleaned up quite a bit.
4. Added options for minimum vector size and hashType.
Pending tasks
1. Fix the Unit test case.
2. Fix example code over Last FM dataset.
3. Add Javadoc documentation.
I hope to complete the above task by EOD tomorrow and submit a new patch.
> Minhash based clustering
> -------------------------
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Ankur
> Assignee: Ankur
> Fix For: 0.4
>
> Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch,
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high
> dimensional data. The essence of the technique is to hash each item using
> multiple independent hash functions such that the probability of collision of
> similar items is higher. Multiple such hash tables can then be constructed
> to answer near neighbor type of queries efficiently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.