Hi There are couple of questions I would like to ask.
1. What type of clustering would you like me to use? Is K-Means good enough? 2.Can you tell me more about the map reduce code that you would have written. Or first do I need to implement that as well using Hadoop? Thanking You Tanya On Thu, Apr 1, 2010 at 1:21 PM, Cristian Prodan <prodan.crist...@gmail.com>wrote: > Thanks Robin, I will try have a look at that. > > Cristi. > > On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <robin.a...@gmail.com> wrote: > > > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics > > and couple of overlapping ones. You can easily find issues with the > > clustering code with that dataset. Once its done you can try bigger > > datasets > > like wikipedia > > > > Robin > > > > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan > > <prodan.crist...@gmail.com>wrote: > > > > > Hi, > > > > > > Can anyone please point me a good data set on which I might try SimHash > > > clustering ? > > > Thank you, > > > > > > Cristi > > > > > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan > > > <prodan.crist...@gmail.com>wrote: > > > > > > > Hello again, > > > > > > > > First of all, thank you all for taking time to answer my ideas. Based > > on > > > > your thoughts, I have been digging around, and the project I would > > very > > > > much like to propose is a MapReduce implementation of the simhash > > > algorithm. > > > > Thank you Ted for the great idea! I'm keen on starting implementing a > > new > > > > algorithm for Mahout, which seems manageable and doable during the > > summer > > > > period and which has a clear scope. I have noticed quite a few > > comments, > > > > advising about not taking a project which would be to big and I > intend > > to > > > > follow them. > > > > > > > > After taking a deep look at the article pointed out by Ted, as well > as > > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch > > for > > > > Minhash), I've decided to write some notes on the algorithm and a > > > schedule > > > > for implemeting the idea during the competition. I kindly ask for > your > > > > feedback on this proposal. > > > > > > > > > > > > 1. Problem and outline of simhash > > > > --------------------------------------- > > > > > > > > Detecting similar files and cliassifying documents requires complex > > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash > > > algorithm > > > > [1] relies on computing a hash function that hashes similar files to > > > similar > > > > values. The file similarities would then be determined by comparing > the > > > > pre-sorted hash key values (and reducing the complexity to O(n log > n)). > > > > In order to futher improve the detection mechanism, the algorithm > will > > > also > > > > store some auxiliary data, used to compute the hash keys. This will > be > > > used > > > > as a second filter, i.e. after the hash key comparison indicates that > > two > > > > files are potentiallly similar. > > > > > > > > Properties for the similarity function and the algorithm: > > > > - very similar files map to very similar or even the same hash key; > > > > - distance between keys should be some measure of the difference > > between > > > > files. This would lead to keys proportional to file sizes and this > > would > > > > create false positives. Some auxiliary data will provide an easy and > > > > efficient way of refining the similarity detection. > > > > - the metric used will be a binary metric (simhash operats at byte > > > level). > > > > - given a similarity metric, there needs to be a threshold to > determine > > > how > > > > close within the metric files need to be to count as similar. > > > > > > > > I would like my implementation of the simhash algorithm to answer two > > > > questions: > > > > - retrieving a set of files, similar to a given file; > > > > - retrieving all pairs of similar files. > > > > > > > > > > > > 2. simhash implementation > > > > ------------------------------ > > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea > is > > > to > > > > count the occurances of these tags within the file under processing. > > > > > > > > > > > > 2.1 Computing the hash function for each file > > > > ------------------------------------------------- > > > > for each directory > > > > for each file within directory > > > > - scan through the file and look for matches for each tag; > > > > - detection window is moved one bit at a time > > > > - if a match if found > > > > - set a skip_counter and then decrement it for the following > > bits; > > > > - a count of matches is stored for each tag, and these are stored > > as > > > > SUM_TABLE_ENTRIES > > > > - the KEY for the current file is computed as a function of the > > > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1] > > > > - after this processing, we store: > > > > | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES | > > > > > > > > The key function could also be implemented to take into account file > > > > extensions. > > > > > > > > > > > > 2.2 Finding similarities > > > > ---------------------------- > > > > for each FILE > > > > SIMILAR_FILES = files with keys within a tollerance range of > > KEY(FILE); > > > > for each SIMILAR_FILE > > > > if file_size(SIMILAR_FILE) differes a lot from file_size(FILE) > > > > discard the query > > > > else > > > > compute the distance between SIMLAR_FILE and FILE ' s > > > > SUM_TABLE_ENTRIES > > > > (i.e. "the sum of the absolute values of the diffs > between > > > > their entries) > > > > if the computed distance is within a tollerance or is > > equal > > > to > > > > 0 then > > > > "The files are identical" > > > > > > > > > > > > > > > > 3. Distributing simhash on hadoop > > > > ------------------------------------- > > > > The algorithm above is very suited for a MapReduce implementation. > > > > > > > > 1. Map > > > > In this phase we do the computation of the hash for a file. > > > > It outputs (File, simhash_key). > > > > > > > > 2. Reduce > > > > Once every item has a simhash, group everyfile with the same simhash > > into > > > a > > > > cluster, sorted by the simhashkey (the key is an integer [#1]) . > > > > > > > > QUESTIONS SO FAR: > > > > 1. For the distributed version, I believe there should be a > > prepocessing > > > > phase where data about files should be stored on HDFS, in a proper > > vector > > > > format ? Would it be better to store this in HBase, Cassandra or > > > something > > > > similar ? > > > > 2. Would it be better in the reduce phase to actually eliminate the > > > > duplicates and create a single sorted-by-hash file ? > > > > > > > > > > > > 4. Implementation schedule > > > > ------------------------------ > > > > > > > > *** May 24 - Jul 12: > > > > * Week 1: write documentation for the algorithm on the wiki and start > > > > planning the objects would interact (classes, methods etc.). * Week > > > 2,3,4: > > > > write the algorithm for hashing of the files. > > > > * Week 5: Thoroughly test > > > > * Week 6: Document the MapReduce implementation - 2 days. Organize > the > > > data > > > > structures 3 - days. > > > > * Week 7,8,9: Write the distributed MapReduce Implementation of the > > > > algorithm along with tests. Test as much as possible on different > > > datasets. > > > > > > > > *** Jul 16 - Aug 9: > > > > * Week 10, 11, 12, 13: Continue with tests on different datasets for > > the > > > > distributed implementation. Also if time permits, I would like to > make > > > > graphs and publish results with the carried tests. > > > > > > > > I'm very open to suggestions regarding this plan, so any observation > is > > > > highly appreciated. > > > > > > > > Thank you again for taking time to read this. > > > > > > > > Regards, > > > > Cristi. > > > > > > > > > > > > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity > > > > Detection, December 13, 2007. > > > > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote: > > > > > > > > > Minhash clustering is important for duplicate detection. You can > > also > > > > > do simhash > > > > > clustering< > > > > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf > >which > > > > > might be simpler to implement. I can imagine an implementation > where > > > > > the map generates the simhash and emits multiple copies keyed on > part > > > of > > > > the > > > > > hash. Near duplicates would thus be grouped together in the reduce > > > where > > > > an > > > > > in-memory duplicate detection could be done. > > > > > > > > > > Sean's intuitions about project size are generally better than > mine, > > > but > > > > if > > > > > you limit you are strict about not increasing scope, I think you > > could > > > > > succeed with this project. > > > > > > > > > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan < > > > > prodan.crist...@gmail.com>wrote: > > > > > > > > > >> IDEA 1 - MinHash clustering > > > > >> --------------------------- > > > > >> The first idea come after taking a look at Google 's paper on > > > > collaborative > > > > >> filtering for their news system[2]. In that paper, I looked at > > MinHash > > > > >> clustering. > > > > >> My first question is: is MinHash clustering considered cool ? If > > yes, > > > > than > > > > >> I would like to take a stab at implementing it. > > > > >> The paper also describes the implementation in a MapReduce style. > > > Since > > > > >> this is only a suggestion I will not elaborate very much on the > > > solution > > > > >> now. I would like to ask you weather this might be considered a > good > > > > choice > > > > >> (i.e. important for the framework to have something like this) and > > if > > > > this > > > > >> is a big enough project. > > > > >> > > > > > > > > > > > > > >