On Oct 26, 2011, at 10:51 AM, Suneel Marthi wrote: > I am still trying to fully understand minHash algorithm and I had the same > results like below when running the MinHashDriver. > > I have a use case wherein I need to determine the content similarity of 2 > documents like what's been described in Andrei Broder's paper - 'Identifying > and Filtering Near-Duplicate Documents' > (http://dl.acm.org/citation.cfm?id=736184). > > I started dissecting the clusters generated by Mahout's MinHashDriver to > compare document content equality and to determine how accurate the > clustering was? > I do see that the first 2 files from the output below were put in the same > cluster 106460162-207863047; thought the actual text content in both the > files is different. How?
What do the vectors of these look like? If I remember correctly, some of those files don't have much actual text in them, such that I wonder if they are more or less empty. Running now to check. > > I am assuming that the NGram attribute was set to the default value of 1 when > creating the tf-idf vectors from sequence files. > > Suneel > > > > ________________________________ > From: Grant Ingersoll <gsing...@apache.org> > To: user@mahout.apache.org > Sent: Tuesday, October 25, 2011 5:55 AM > Subject: Re: MinHash Clustering in Mahout > > > On Oct 19, 2011, at 11:38 AM, Varun Thacker wrote: > >> I was trying to run the MinHash algorithm on the Reuters data set, so I did >> the following before running MinHashDriver >> >> - Get the Reuters dataset >> - Run org.apache.lucene.benchmark.utils.ExtractReuters to generate >> reuters-out from reuters-sgm(the downloaded archive) >> - Run seqdirectory to convert reuters-out to SequenceFile format >> - Run seq2sparse to convert SequenceFiles to sparse vector format >> >> I used these instructions from the K-means clustering wiki page. >> >> This is the command I used to run MinHashDriver >> >> ./mahout org.apache.mahout.clustering.minhash.MinHashDriver --input >> /home/varun/mahout/sparse/tfidf-vectors/ -o /home/varun/mahout/minhash >> >> The output file looks something like this: >> >> 106460162-207863047 > /reut2-015.sgm-653.txt >> 106460162-207863047 /reut2-021.sgm-7.txt >> 106460162-207863047 /reut2-013.sgm-307.txt >> 106460162-207863047 /reut2-013.sgm-306.txt >> 106460162-207863047 /reut2-014.sgm-786.txt >> 106460162-207863047 /reut2-013.sgm-304.txt >> 106460162-207863047 /reut2-013.sgm-303.txt >> 106460162-207863047 /reut2-021.sgm-230.txt >> 106460162-207863047 /reut2-012.sgm-548.txt >> 106460162-207863047 /reut2-020.sgm-161.txt >> 106460162-207863047 /reut2-021.sgm-553.txt >> 106460162-207863047 /reut2-013.sgm-299.txt >> 106460162-207863047 /reut2-015.sgm-284.txt >> 106460162-207863047 /reut2-013.sgm-996.txt >> 106460162-207863047 /reut2-021.sgm-441.txt >> 106460162-207863047 /reut2-013.sgm-298.txt >> 106460162-207863047 /reut2-013.sgm-995.txt >> 106460162-207863047 /reut2-015.sgm-521.txt >> 106460162-207863047 /reut2-020.sgm-162.txt >> 106460162-207863047 > /reut2-020.sgm-163.txt >> 106460162-207863047 /reut2-013.sgm-296.txt >> ... >> ... >> >> >> Is this the correct way of running MinHash. >> >> If yes then I would update the wiki page >> https://cwiki.apache.org/confluence/display/MAHOUT/Minhash+Clustering with >> the instructions. >> >> Otherwise if someone could tell me on what am I doing wrong. > > I haven't looked into the code, but I get similar outputs, so I assume it is > working. Might be good to incorporate this into the build-reuters.sh as well > as try it on some other input. > > -Grant -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com