[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ankur updated MAHOUT-344:
-------------------------
Attachment: MAHOUT-344-v7.patch
Sean,
Thanks for agreeing to merge the style and latest code changes.I'll study
your style changes and hopefully do better next time :-). Code changes should
only be in 'LastfmDataConverter' & 'LastfmClusterEvaluator'.
Updated patch with fixed Javadoc comments and added support for converting
LastFM 1K users dataset
http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-1K.html into
vector format for running MinHash.
Sadly the changes in seed generation as we discussed did not help much and
addition RandomUtil.getRandom(11) was causing testLinearMinHashMRJob() to fail
consistently :-( . So I reverted the code change in HashFactory.java
> Minhash based clustering
> -------------------------
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Ankur
> Assignee: Ankur
> Fix For: 0.4
>
> Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch,
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch, MAHOUT-344-v5.patch,
> MAHOUT-344-v6.patch, MAHOUT-344-v7.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high
> dimensional data. The essence of the technique is to hash each item using
> multiple independent hash functions such that the probability of collision of
> similar items is higher. Multiple such hash tables can then be constructed
> to answer near neighbor type of queries efficiently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.