Hi, Can anyone please point me a good data set on which I might try SimHash clustering ? Thank you,
Cristi On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan <prodan.crist...@gmail.com>wrote: > Hello again, > > First of all, thank you all for taking time to answer my ideas. Based on > your thoughts, I have been digging around, and the project I would very > much like to propose is a MapReduce implementation of the simhash algorithm. > Thank you Ted for the great idea! I'm keen on starting implementing a new > algorithm for Mahout, which seems manageable and doable during the summer > period and which has a clear scope. I have noticed quite a few comments, > advising about not taking a project which would be to big and I intend to > follow them. > > After taking a deep look at the article pointed out by Ted, as well as > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for > Minhash), I've decided to write some notes on the algorithm and a schedule > for implemeting the idea during the competition. I kindly ask for your > feedback on this proposal. > > > 1. Problem and outline of simhash > --------------------------------------- > > Detecting similar files and cliassifying documents requires complex > heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm > [1] relies on computing a hash function that hashes similar files to similar > values. The file similarities would then be determined by comparing the > pre-sorted hash key values (and reducing the complexity to O(n log n)). > In order to futher improve the detection mechanism, the algorithm will also > store some auxiliary data, used to compute the hash keys. This will be used > as a second filter, i.e. after the hash key comparison indicates that two > files are potentiallly similar. > > Properties for the similarity function and the algorithm: > - very similar files map to very similar or even the same hash key; > - distance between keys should be some measure of the difference between > files. This would lead to keys proportional to file sizes and this would > create false positives. Some auxiliary data will provide an easy and > efficient way of refining the similarity detection. > - the metric used will be a binary metric (simhash operats at byte level). > - given a similarity metric, there needs to be a threshold to determine how > close within the metric files need to be to count as similar. > > I would like my implementation of the simhash algorithm to answer two > questions: > - retrieving a set of files, similar to a given file; > - retrieving all pairs of similar files. > > > 2. simhash implementation > ------------------------------ > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to > count the occurances of these tags within the file under processing. > > > 2.1 Computing the hash function for each file > ------------------------------------------------- > for each directory > for each file within directory > - scan through the file and look for matches for each tag; > - detection window is moved one bit at a time > - if a match if found > - set a skip_counter and then decrement it for the following bits; > - a count of matches is stored for each tag, and these are stored as > SUM_TABLE_ENTRIES > - the KEY for the current file is computed as a function of the > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] > - after this processing, we store: > | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | > > The key function could also be implemented to take into account file > extensions. > > > 2.2 Finding similarities > ---------------------------- > for each FILE > SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); > for each SIMILAR_FILE > if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) > discard the query > else > compute the distance between SIMLAR_FILE and FILE ' s > SUM_TABLE_ENTRIES > (i.e. "the sum of the absolute values of the diffs between > their entries) > if the computed distance is within a tollerance or is equal to > 0 then > "The files are identical" > > > > 3. Distributing simhash on hadoop > ------------------------------------- > The algorithm above is very suited for a MapReduce implementation. > > 1. Map > In this phase we do the computation of the hash for a file. > It outputs (File, simhash_key). > > 2. Reduce > Once every item has a simhash, group everyfile with the same simhash into a > cluster, sorted by the simhashkey (the key is an integer [#1]) . > > QUESTIONS SO FAR: > 1. For the distributed version, I believe there should be a prepocessing > phase where data about files should be stored on HDFS, in a proper vector > format ? Would it be better to store this in HBase, Cassandra or something > similar ? > 2. Would it be better in the reduce phase to actually eliminate the > duplicates and create a single sorted-by-hash file ? > > > 4. Implementation schedule > ------------------------------ > > *** May 24 - Jul 12: > * Week 1: write documentation for the algorithm on the wiki and start > planning the objects would interact (classes, methods etc.). * Week 2,3,4: > write the algorithm for hashing of the files. > * Week 5: Thoroughly test > * Week 6: Document the MapReduce implementation - 2 days. Organize the data > structures 3 - days. > * Week 7,8,9: Write the distributed MapReduce Implementation of the > algorithm along with tests. Test as much as possible on different datasets. > > *** Jul 16 - Aug 9: > * Week 10, 11, 12, 13: Continue with tests on different datasets for the > distributed implementation. Also if time permits, I would like to make > graphs and publish results with the carried tests. > > I'm very open to suggestions regarding this plan, so any observation is > highly appreciated. > > Thank you again for taking time to read this. > > Regards, > Cristi. > > > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity > Detection, December 13, 2007. > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote: > > > Minhash clustering is important for duplicate detection. You can also > > do simhash > > clustering< > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which > > might be simpler to implement. I can imagine an implementation where > > the map generates the simhash and emits multiple copies keyed on part of > the > > hash. Near duplicates would thus be grouped together in the reduce where > an > > in-memory duplicate detection could be done. > > > > Sean's intuitions about project size are generally better than mine, but > if > > you limit you are strict about not increasing scope, I think you could > > succeed with this project. > > > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan < > prodan.crist...@gmail.com>wrote: > > > >> IDEA 1 - MinHash clustering > >> --------------------------- > >> The first idea come after taking a look at Google 's paper on > collaborative > >> filtering for their news system[2]. In that paper, I looked at MinHash > >> clustering. > >> My first question is: is MinHash clustering considered cool ? If yes, > than > >> I would like to take a stab at implementing it. > >> The paper also describes the implementation in a MapReduce style. Since > >> this is only a suggestion I will not elaborate very much on the solution > >> now. I would like to ask you weather this might be considered a good > choice > >> (i.e. important for the framework to have something like this) and if > this > >> is a big enough project. > >> > >