[ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851416#action_12851416
 ] 

Cristi Prodan commented on MAHOUT-344:
--------------------------------------

I ran the code on the last.fm data set (2.). Due to the nature of the data set 
I had to write a small program that converts it to the a format used by the 
algorithm. Also I have clustered similar users instead of songs (the 
transformation I mentioned above was easier to do) and I wanted to see how does 
the algorithm runs. I've used MurmurHash for mapping artists hashes to integers 
- which can be used by the min-hash algorithm. 

Related to this I would like to ask, how does Mahout usually deals with this 
kind of situations: 
- Is there a standard formatting for the input on each clustering alg or the 
input format follows the same rules for all algorithms, and then the users 
write conversion tools which ? 
- would it be ok if I attach the code which does an example of running min-hash 
clustering in the examples dirs ? (it would first convert the dataset format 
accordingly)

@Ankur: you are using a DdistributedCache for sharing the hash functions. That 
requires the distributed file to be on HDFS as far as I know. I believe it 
would be nice to have a flag or something which allows storing the hash on 
"normal" file system, for testing purposes. What do you think ?

Sorry everybody If I'm doing something wrong, it's the first time I'm 
contributing to an open source project. 

> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to