[GSoC] Proposal to implement SimHash clustering on MapReduce
------------------------------------------------------------

                 Key: MAHOUT-365
                 URL: https://issues.apache.org/jira/browse/MAHOUT-365
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Cristi Prodan


Application for Google Summer of Code 2010 - Mahout Project

Student: Cristian Prodan

1. Synopsis

I will add a map-reduce implementation of the SimHash clustering algorithm to 
the Mahout project. This algorithm provides an efficient way of finding 
similar/identical files in a large collection of files. 


2. Project

Storage capacities become larger and thus it is difficult to organize and 
manage growing file systems. It is easy to loose track of identical copies or 
older versions of the files in a directory structure. Duplicate detection 
technologies do not extend well when files are not identical. A typical idea 
for detecting similar files is to see the features of a file into a 
high-dimensional space and then use distance within space as a measure of 
similarity. This may not be feasible since it involves a O(n^2) complexity of 
the algorithm. If these file-to-vector mappings are reduced ow a 
one-dimensional space, then the data points could be sorted in O(n log n) time 
- a big increase of detection speed. 

I will implement the SimHash algorithm presented in detail in [1]. The idea is 
the following: using a hash function that hashed similar files to similar 
values, file similarity could be determined simply by comparing pre-sorted hash 
key values. 
I will implement a family of similarity hash functions which will do this, as 
described in [1]. Furthermore, the performance will be enhanced by storing 
auxiliary data used to compute the hash keys. This data will be used as a 
second filter after a hash key comparison indicates that two files are 
potentially similar. 

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between files. 
This would lead to keys proportional to file sizes and this would create false 
positives. The auxiliary data mentioned above will provide an easy and 
efficient way of refining the similarity detection.  
- the metric used will be a binary metric (simhash operates at byte level). 
- given a similarity metric, there needs to be a threshold to determine how 
close within the metric files need to be to count as similar. 

>From a distributed point of view, the algorithm above is very suited for a 
>MapReduce implementation. The sketch is the following:

I. MAP phase
In this phase we compute the hash for a file along with additional info which 
serves as a second filter in similarity detection. 
It outputs (File, simhash_key).

II. REDUCE phase
Once every file has a simhash, group every file with the same simhash into a 
cluster, sorted by the simhashkey (the key is an integer) . 
* The similarity check can be done in main memory. 


3. Deliverables:

1. Batch for sim-hashing the files. The hashes will be stored either on normal 
file system (for small tests), HDFS or on HBase.  
2. A tool for answering the next type of queries:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.
There will be the option to run this as a standalone program, for tests or 
smaller scale purposes.
3. Documentation and unit tests for the written code. 
4. Getting started tutorials for SimHash.
5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.


4. Benefits for the Mahout community:

1) A distributed tool for efficient similarity detection of files in large 
datasets. Algorithms for detecting similar files can also be useful for 
classification purposes and as an aid to search. Next release of Mahout will 
contain this functionality. 
2) This will be used at smaller scale also, by running it in an 
"urn-distributed" mode, for small scale datasets.
3) Good tutorials on how to work with this tool. 


5. Roadmap

1).  Community Bonding Period (21th April to 24rd May)
- 1 week: Familiarize with Mahout, hadoop, and the existing clustering 
algorithms;
- 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and 
other interfaces and classes used in Mahout) and start initial planning of the 
system in terms of interactions and data structures. 
- 1 week: Setup a hadoop cluster formed of 2-3 nodes;
- During this time, I plan to speak with my mentor and ask him very specific 
questions about the plans I make for the implementation. 

2).  May 24th to Jul 12th
- Week 1: Detailed planning on how the objects would interact (classes, 
methods, the Vector interfaces I plan to use etc.) and ask feedback from the 
mentor. 
- Week 2: Write the code for hashing files (a batch process which will work as 
an indexer), according to the algorithm, in a TDD style. At the end tests will 
accompany the code. The results will be stored on normal file system or HDFS.
- Week 3: Half week: Test the program in "single" mode (urn-distributed). Start 
interacting with HBase if time permits. 
- Week 4: Interact with HBase; The algorithm will store the hashes in HBase for 
fast retrieval. 
- Week 5: Test the algorithm on different datasets: local file system, 
20newsgroups and wikipedia.
- Week 6, 7: Add the distributed MapReduce implementation (based on hadoop) and 
write unit tests for it;
- Week 8: Continue with unit tests and test the distributed implementation on 
the data sets for local file system, HDFS and HBase. Fix outstanding bugs. 
Analyze results. 
- Week 9: Continue with testing and bug fixing and make sure everything is 
ready for mid-term evaluation;

Mid-term evaluation period.

3). Jul 16th - Aug 9th
- Week 10: Extend the batch tool with different options (different hash 
functions families, threshold for similarity). Experiment with different hash 
functions families; Give the user an option to specify hash functions. 
- Week 11: Finish off the demos for the 20newsgroups and wikipedia datasets. 
Write some tests for them. Publish the results from the demos on the wiki.
- Week 12: Write the getting started guides on how to use the SimHash tool;
- Week 13: Finish off everything, fix outstanding bugs, clean and document 
undocumented code.  


6. Biography 

My name is Cristi Prodan, I am 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science. During the past year, I have been 
interested in the study of machine learning mainly Recommender Systems and 
clustering techniques. I have heard about hadoop and Mahout and I found 
challenging the idea of working with them  if I ever have the chance. My 
dissertation paper on Recommender Systems and I will use Mahout at it's core. 
Since I heard that The Apache Software Foundation is willing to participate  in 
this year edition of Google Summer of Code, I was eager to try my hand at 
contributing to the Mahout project. Being a subscriber to the list since 
December 2009, I started to interact with the community and presenting my 
ideas.  I was mostly interested in the MinHash algorithm [2] (which may also be 
used for similarity detection), whose implementation was started by a 
contributor. After analyzing his code and seeking for advice from the other 
members, I have committed my first patch to Mahout (on JIRA, MAHOUT-344).
During this time I have also skimmed through the wiki and [3].
At university, I have extensively worked with Java. I am also a big fan of 
Ruby, Python and currently started digging into Erlang. In the past two years, 
in the free time, I did some freelancing, working on various web applications. 


7. References

[1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection, 
December 13, 2007 (Manning MEAP).

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.

[3] Owen, Anil - Mahout in Action. Manning, 2010. 


------------------------------------------------------
This proposal is partly inspired by the one of Konstantin Kafer 
[http://drupal.org/files/application.pdf]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to