Thanks Robin, I will try have a look at that. Cristi.
On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <robin.a...@gmail.com> wrote: > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics > and couple of overlapping ones. You can easily find issues with the > clustering code with that dataset. Once its done you can try bigger > datasets > like wikipedia > > Robin > > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan > <prodan.crist...@gmail.com>wrote: > > > Hi, > > > > Can anyone please point me a good data set on which I might try SimHash > > clustering ? > > Thank you, > > > > Cristi > > > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan > > <prodan.crist...@gmail.com>wrote: > > > > > Hello again, > > > > > > First of all, thank you all for taking time to answer my ideas. Based > on > > > your thoughts, I have been digging around, and the project I would > very > > > much like to propose is a MapReduce implementation of the simhash > > algorithm. > > > Thank you Ted for the great idea! I'm keen on starting implementing a > new > > > algorithm for Mahout, which seems manageable and doable during the > summer > > > period and which has a clear scope. I have noticed quite a few > comments, > > > advising about not taking a project which would be to big and I intend > to > > > follow them. > > > > > > After taking a deep look at the article pointed out by Ted, as well as > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch > for > > > Minhash), I've decided to write some notes on the algorithm and a > > schedule > > > for implemeting the idea during the competition. I kindly ask for your > > > feedback on this proposal. > > > > > > > > > 1. Problem and outline of simhash > > > --------------------------------------- > > > > > > Detecting similar files and cliassifying documents requires complex > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash > > algorithm > > > [1] relies on computing a hash function that hashes similar files to > > similar > > > values. The file similarities would then be determined by comparing the > > > pre-sorted hash key values (and reducing the complexity to O(n log n)). > > > In order to futher improve the detection mechanism, the algorithm will > > also > > > store some auxiliary data, used to compute the hash keys. This will be > > used > > > as a second filter, i.e. after the hash key comparison indicates that > two > > > files are potentiallly similar. > > > > > > Properties for the similarity function and the algorithm: > > > - very similar files map to very similar or even the same hash key; > > > - distance between keys should be some measure of the difference > between > > > files. This would lead to keys proportional to file sizes and this > would > > > create false positives. Some auxiliary data will provide an easy and > > > efficient way of refining the similarity detection. > > > - the metric used will be a binary metric (simhash operats at byte > > level). > > > - given a similarity metric, there needs to be a threshold to determine > > how > > > close within the metric files need to be to count as similar. > > > > > > I would like my implementation of the simhash algorithm to answer two > > > questions: > > > - retrieving a set of files, similar to a given file; > > > - retrieving all pairs of similar files. > > > > > > > > > 2. simhash implementation > > > ------------------------------ > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is > > to > > > count the occurances of these tags within the file under processing. > > > > > > > > > 2.1 Computing the hash function for each file > > > ------------------------------------------------- > > > for each directory > > > for each file within directory > > > - scan through the file and look for matches for each tag; > > > - detection window is moved one bit at a time > > > - if a match if found > > > - set a skip_counter and then decrement it for the following > bits; > > > - a count of matches is stored for each tag, and these are stored > as > > > SUM_TABLE_ENTRIES > > > - the KEY for the current file is computed as a function of the > > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] > > > - after this processing, we store: > > > | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | > > > > > > The key function could also be implemented to take into account file > > > extensions. > > > > > > > > > 2.2 Finding similarities > > > ---------------------------- > > > for each FILE > > > SIMILAR_FILES = files with keys within a tollerance range of > KEY(FILE); > > > for each SIMILAR_FILE > > > if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) > > > discard the query > > > else > > > compute the distance between SIMLAR_FILE and FILE ' s > > > SUM_TABLE_ENTRIES > > > (i.e. "the sum of the absolute values of the diffs between > > > their entries) > > > if the computed distance is within a tollerance or is > equal > > to > > > 0 then > > > "The files are identical" > > > > > > > > > > > > 3. Distributing simhash on hadoop > > > ------------------------------------- > > > The algorithm above is very suited for a MapReduce implementation. > > > > > > 1. Map > > > In this phase we do the computation of the hash for a file. > > > It outputs (File, simhash_key). > > > > > > 2. Reduce > > > Once every item has a simhash, group everyfile with the same simhash > into > > a > > > cluster, sorted by the simhashkey (the key is an integer [#1]) . > > > > > > QUESTIONS SO FAR: > > > 1. For the distributed version, I believe there should be a > prepocessing > > > phase where data about files should be stored on HDFS, in a proper > vector > > > format ? Would it be better to store this in HBase, Cassandra or > > something > > > similar ? > > > 2. Would it be better in the reduce phase to actually eliminate the > > > duplicates and create a single sorted-by-hash file ? > > > > > > > > > 4. Implementation schedule > > > ------------------------------ > > > > > > *** May 24 - Jul 12: > > > * Week 1: write documentation for the algorithm on the wiki and start > > > planning the objects would interact (classes, methods etc.). * Week > > 2,3,4: > > > write the algorithm for hashing of the files. > > > * Week 5: Thoroughly test > > > * Week 6: Document the MapReduce implementation - 2 days. Organize the > > data > > > structures 3 - days. > > > * Week 7,8,9: Write the distributed MapReduce Implementation of the > > > algorithm along with tests. Test as much as possible on different > > datasets. > > > > > > *** Jul 16 - Aug 9: > > > * Week 10, 11, 12, 13: Continue with tests on different datasets for > the > > > distributed implementation. Also if time permits, I would like to make > > > graphs and publish results with the carried tests. > > > > > > I'm very open to suggestions regarding this plan, so any observation is > > > highly appreciated. > > > > > > Thank you again for taking time to read this. > > > > > > Regards, > > > Cristi. > > > > > > > > > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity > > > Detection, December 13, 2007. > > > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote: > > > > > > > Minhash clustering is important for duplicate detection. You can > also > > > > do simhash > > > > clustering< > > > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which > > > > might be simpler to implement. I can imagine an implementation where > > > > the map generates the simhash and emits multiple copies keyed on part > > of > > > the > > > > hash. Near duplicates would thus be grouped together in the reduce > > where > > > an > > > > in-memory duplicate detection could be done. > > > > > > > > Sean's intuitions about project size are generally better than mine, > > but > > > if > > > > you limit you are strict about not increasing scope, I think you > could > > > > succeed with this project. > > > > > > > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan < > > > prodan.crist...@gmail.com>wrote: > > > > > > > >> IDEA 1 - MinHash clustering > > > >> --------------------------- > > > >> The first idea come after taking a look at Google 's paper on > > > collaborative > > > >> filtering for their news system[2]. In that paper, I looked at > MinHash > > > >> clustering. > > > >> My first question is: is MinHash clustering considered cool ? If > yes, > > > than > > > >> I would like to take a stab at implementing it. > > > >> The paper also describes the implementation in a MapReduce style. > > Since > > > >> this is only a suggestion I will not elaborate very much on the > > solution > > > >> now. I would like to ask you weather this might be considered a good > > > choice > > > >> (i.e. important for the framework to have something like this) and > if > > > this > > > >> is a big enough project. > > > >> > > > > > > > > >