[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cristi Prodan updated MAHOUT-344: --------------------------------- Status: Patch Available (was: Open) Thank you guys for all the encouragement and advices. I'm committing my first patch for MinHah clustering. The patch contain the following things: - in core - minhash: * MinHashMapRed - removed the distributed hash need; Each mapper generates the same hash functions using the same seed (as per instructions from Ankur). * RandomLinearHashFunction - added another random linear hash function in form: h( x ) = ax + b mod p . p will be as big as possible > 10000000 and it should be prime(not done yet, but committing in this form due to some time restrictions) . - in examples - minhash directory: * DisplayMinHash - contains an example of running min-hash, with the options commented. It's basically the main function from MinHashMapRed. * PrepareDataset - this class offers the ability to convert the last-fm database suggested above in a format readable by the MinHash algorithm. It also shows a progress bar with the percent done :) . For the future I believe that all the code in the algorithm should take a more generalized form and use the Vectors classes used by Mahout, then the users could either write their own version with a Vector interface or create a tool that converts their ds to the vector format the code will know. MurmurHash is used by PrepareDataset to hash the strings which denoted users (in the original last_fm dataset) - to integers. * TestClusterQuality - gets a clustered file, generated by the minhash algorithm and computes the average for each cluster aggregated over all clusters. In each cluster the mean is computed by: SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i != j . TOTAL_SIMILARITIES = n! / k! * (n -k)!, n = total number of items in cluster, k = 2. The aggregated mean is the mean of all these values. As an example. Having the following input: 1 1 2 3 4 5 2 1 2 3 4 6 3 7 6 3 8 9 4 7 8 9 6 1 5 5 6 7 8 9 6 8 7 5 6 The first column are the users. For each user, on the lines we have the items preferred (browsed, listened to) by him. Same format in the contents of each cluster bellow. we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash indices in a bucket), 2 - minimum items within cluster): CLUSTER ID -- 2359983695385880352354530253637788 (items=2) ========================================= 2 1 2 3 4 6 1 1 2 3 4 5 CLUSTER ID -- 236643825172184878353970117486898894 (items=2) ========================================= 4 7 8 9 6 1 3 7 6 3 8 9 CLUSTER ID -- 35606006580772015548743126287496777 (items=2) ========================================= 6 8 7 5 6 5 5 6 7 8 9 CLUSTER ID -- 38797144231157365543316465389702468 (items=2) ========================================= 6 8 7 5 6 5 5 6 7 8 9 The aggregated average over theses clusters is 0.6793650793650793. I'm now testing on the last_fm dataset. The problem I currently encounter is that the size for the clustered file is kind of big, but I'm working on tuning the params. > Minhash based clustering > ------------------------- > > Key: MAHOUT-344 > URL: https://issues.apache.org/jira/browse/MAHOUT-344 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.3 > Reporter: Ankur > Assignee: Ankur > Attachments: MAHOUT-344-v1.patch > > > Minhash clustering performs probabilistic dimension reduction of high > dimensional data. The essence of the technique is to hash each item using > multiple independent hash functions such that the probability of collision of > similar items is higher. Multiple such hash tables can then be constructed > to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.