Re: My ideas for GSoC 2010

Cristian Prodan Wed, 31 Mar 2010 23:33:00 -0700

Hi,

Can anyone please point me a good data set on which I might try SimHash
clustering ?
Thank you,


Cristi

On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
<prodan.crist...@gmail.com>wrote:

> Hello again,
>
> First of all, thank you all for taking time to answer my ideas. Based on
> your thoughts, I have been digging around, and  the project I would very
> much like to propose is a MapReduce implementation of the simhash algorithm.
> Thank you Ted for the great idea! I'm keen on starting implementing a new
> algorithm for Mahout, which seems manageable and doable during the summer
> period and which has a clear scope. I have noticed quite a few comments,
> advising about not taking a project which would be to big and I intend to
> follow them.
>
> After taking a deep look at the article pointed out by Ted, as well as
> other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> Minhash), I've decided to write some notes on the algorithm and a schedule
> for implemeting the idea during the competition.  I kindly ask for your
> feedback on this proposal.
>
>
> 1. Problem and outline of simhash
> ---------------------------------------
>
> Detecting similar files and cliassifying documents requires complex
> heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm
> [1] relies on computing a hash function that hashes similar files to similar
> values. The file similarities would then be determined by comparing the
> pre-sorted hash key values (and reducing the complexity to O(n log n)).
> In order to futher improve the detection mechanism, the algorithm will also
> store some auxiliary data, used to compute the hash keys. This will be used
> as a second filter, i.e. after the hash key comparison indicates that two
> files are potentiallly similar.
>
> Properties for the similarity function and the algorithm:
> - very similar files map to very similar or even the same hash key;
> - distance between keys should be some measure of the difference between
> files. This would lead to keys proportional to file sizes and this would
> create false positives. Some auxiliary data will provide an easy and
> efficient way of refining the similarity detection.
> - the metric used will be a binary metric (simhash operats at byte level).
> - given a similarity metric, there needs to be a threshold to determine how
> close within the metric files need to be to count as similar.
>
> I would like my implementation of the simhash algorithm to answer two
> questions:
> - retrieving a set of files, similar to a given file;
> - retrieving all pairs of similar files.
>
>
> 2. simhash implementation
> ------------------------------
> Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to
> count the occurances of these tags within the file under processing.
>
>
> 2.1 Computing the hash function for each file
> -------------------------------------------------
> for each directory
>  for each file within directory
>     - scan through the file and look for matches for each tag;
>     - detection window is moved one bit at a time
>     - if a match if found
>        - set a skip_counter and then decrement it for the following bits;
>     - a count of matches is stored for each tag, and these are stored as
> SUM_TABLE_ENTRIES
>     - the KEY for the current file is computed as a function of the
> SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
>     - after this processing, we store:
>       | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
>
> The key function could also be implemented to take into account file
> extensions.
>
>
> 2.2 Finding similarities
> ----------------------------
> for each FILE
>   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
>   for each SIMILAR_FILE
>        if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
>              discard the query
>         else
>              compute the distance between SIMLAR_FILE and FILE ' s
> SUM_TABLE_ENTRIES
>              (i.e. "the sum of the absolute values of the diffs between
> their entries)
>              if the computed distance is within a tollerance or is equal to
> 0 then
>                   "The files are identical"
>
>
>
> 3. Distributing simhash on hadoop
> -------------------------------------
> The algorithm above is very suited for a MapReduce implementation.
>
> 1. Map
> In this phase we do the computation of the hash for a file.
> It outputs (File, simhash_key).
>
> 2. Reduce
> Once every item has a simhash, group everyfile with the same simhash into a
> cluster, sorted by the simhashkey (the key is an integer [#1]) .
>
> QUESTIONS SO FAR:
> 1. For the distributed version, I believe there should be a prepocessing
> phase where data about files should be stored on HDFS, in a proper vector
> format ? Would it be better to store this in HBase, Cassandra or something
> similar ?
> 2. Would it be better in the reduce phase to actually eliminate the
> duplicates and create a single sorted-by-hash file ?
>
>
> 4. Implementation schedule
> ------------------------------
>
> *** May 24 - Jul 12:
> * Week 1: write documentation for the algorithm on the wiki and start
> planning the objects would interact (classes, methods etc.). * Week 2,3,4:
> write the algorithm for hashing of the files.
> * Week 5: Thoroughly test
> * Week 6: Document the MapReduce implementation - 2 days. Organize the data
> structures 3 - days.
> * Week 7,8,9: Write the distributed MapReduce Implementation of the
> algorithm along with tests. Test as much as possible on different datasets.
>
> *** Jul 16 - Aug 9:
> * Week 10, 11, 12, 13: Continue with tests on different datasets for the
> distributed implementation. Also if time permits, I would like to make
> graphs and publish results with the carried tests.
>
> I'm very open to suggestions regarding this plan, so any observation is
> highly appreciated.
>
> Thank you again for taking time to read this.
>
> Regards,
> Cristi.
>
>
> [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity
> Detection, December 13, 2007.
> On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote:
>
> > Minhash clustering is important for duplicate detection.  You can also
> > do simhash
> > clustering<
> http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which
> > might be simpler to implement.  I can imagine an implementation where
> > the map generates the simhash and emits multiple copies keyed on part of
> the
> > hash.  Near duplicates would thus be grouped together in the reduce where
> an
> > in-memory duplicate detection could be done.
> >
> > Sean's intuitions about project size are generally better than mine, but
> if
> > you limit you are strict about not increasing scope, I think you could
> > succeed with this project.
> >
> > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <
> prodan.crist...@gmail.com>wrote:
> >
> >> IDEA 1 - MinHash clustering
> >> ---------------------------
> >> The first idea come after taking a look at Google 's paper on
> collaborative
> >> filtering for their news system[2]. In that paper, I looked at MinHash
> >> clustering.
> >> My first question is: is MinHash clustering considered cool ? If yes,
> than
> >> I would like to take a stab at implementing it.
> >> The paper also describes the implementation in a MapReduce style. Since
> >> this is only a suggestion I will not elaborate very much on the solution
> >> now. I would like to ask you weather this might be considered a good
> choice
> >> (i.e. important for the framework to have something like this) and if
> this
> >> is a big enough project.
> >>
>
>

Re: My ideas for GSoC 2010

Reply via email to