Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
and couple of overlapping ones. You can easily find issues with the
clustering code with that dataset. Once its done you can try bigger datasets
like wikipedia

Robin

On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
<prodan.crist...@gmail.com>wrote:

> Hi,
>
> Can anyone please point me a good data set on which I might try SimHash
> clustering ?
> Thank you,
>
> Cristi
>
> On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> <prodan.crist...@gmail.com>wrote:
>
> > Hello again,
> >
> > First of all, thank you all for taking time to answer my ideas. Based on
> > your thoughts, I have been digging around, and  the project I would very
> > much like to propose is a MapReduce implementation of the simhash
> algorithm.
> > Thank you Ted for the great idea! I'm keen on starting implementing a new
> > algorithm for Mahout, which seems manageable and doable during the summer
> > period and which has a clear scope. I have noticed quite a few comments,
> > advising about not taking a project which would be to big and I intend to
> > follow them.
> >
> > After taking a deep look at the article pointed out by Ted, as well as
> > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> > Minhash), I've decided to write some notes on the algorithm and a
> schedule
> > for implemeting the idea during the competition.  I kindly ask for your
> > feedback on this proposal.
> >
> >
> > 1. Problem and outline of simhash
> > ---------------------------------------
> >
> > Detecting similar files and cliassifying documents requires complex
> > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> algorithm
> > [1] relies on computing a hash function that hashes similar files to
> similar
> > values. The file similarities would then be determined by comparing the
> > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > In order to futher improve the detection mechanism, the algorithm will
> also
> > store some auxiliary data, used to compute the hash keys. This will be
> used
> > as a second filter, i.e. after the hash key comparison indicates that two
> > files are potentiallly similar.
> >
> > Properties for the similarity function and the algorithm:
> > - very similar files map to very similar or even the same hash key;
> > - distance between keys should be some measure of the difference between
> > files. This would lead to keys proportional to file sizes and this would
> > create false positives. Some auxiliary data will provide an easy and
> > efficient way of refining the similarity detection.
> > - the metric used will be a binary metric (simhash operats at byte
> level).
> > - given a similarity metric, there needs to be a threshold to determine
> how
> > close within the metric files need to be to count as similar.
> >
> > I would like my implementation of the simhash algorithm to answer two
> > questions:
> > - retrieving a set of files, similar to a given file;
> > - retrieving all pairs of similar files.
> >
> >
> > 2. simhash implementation
> > ------------------------------
> > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> to
> > count the occurances of these tags within the file under processing.
> >
> >
> > 2.1 Computing the hash function for each file
> > -------------------------------------------------
> > for each directory
> >  for each file within directory
> >     - scan through the file and look for matches for each tag;
> >     - detection window is moved one bit at a time
> >     - if a match if found
> >        - set a skip_counter and then decrement it for the following bits;
> >     - a count of matches is stored for each tag, and these are stored as
> > SUM_TABLE_ENTRIES
> >     - the KEY for the current file is computed as a function of the
> > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> >     - after this processing, we store:
> >       | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> >
> > The key function could also be implemented to take into account file
> > extensions.
> >
> >
> > 2.2 Finding similarities
> > ----------------------------
> > for each FILE
> >   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
> >   for each SIMILAR_FILE
> >        if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> >              discard the query
> >         else
> >              compute the distance between SIMLAR_FILE and FILE ' s
> > SUM_TABLE_ENTRIES
> >              (i.e. "the sum of the absolute values of the diffs between
> > their entries)
> >              if the computed distance is within a tollerance or is equal
> to
> > 0 then
> >                   "The files are identical"
> >
> >
> >
> > 3. Distributing simhash on hadoop
> > -------------------------------------
> > The algorithm above is very suited for a MapReduce implementation.
> >
> > 1. Map
> > In this phase we do the computation of the hash for a file.
> > It outputs (File, simhash_key).
> >
> > 2. Reduce
> > Once every item has a simhash, group everyfile with the same simhash into
> a
> > cluster, sorted by the simhashkey (the key is an integer [#1]) .
> >
> > QUESTIONS SO FAR:
> > 1. For the distributed version, I believe there should be a prepocessing
> > phase where data about files should be stored on HDFS, in a proper vector
> > format ? Would it be better to store this in HBase, Cassandra or
> something
> > similar ?
> > 2. Would it be better in the reduce phase to actually eliminate the
> > duplicates and create a single sorted-by-hash file ?
> >
> >
> > 4. Implementation schedule
> > ------------------------------
> >
> > *** May 24 - Jul 12:
> > * Week 1: write documentation for the algorithm on the wiki and start
> > planning the objects would interact (classes, methods etc.). * Week
> 2,3,4:
> > write the algorithm for hashing of the files.
> > * Week 5: Thoroughly test
> > * Week 6: Document the MapReduce implementation - 2 days. Organize the
> data
> > structures 3 - days.
> > * Week 7,8,9: Write the distributed MapReduce Implementation of the
> > algorithm along with tests. Test as much as possible on different
> datasets.
> >
> > *** Jul 16 - Aug 9:
> > * Week 10, 11, 12, 13: Continue with tests on different datasets for the
> > distributed implementation. Also if time permits, I would like to make
> > graphs and publish results with the carried tests.
> >
> > I'm very open to suggestions regarding this plan, so any observation is
> > highly appreciated.
> >
> > Thank you again for taking time to read this.
> >
> > Regards,
> > Cristi.
> >
> >
> > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity
> > Detection, December 13, 2007.
> > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote:
> >
> > > Minhash clustering is important for duplicate detection.  You can also
> > > do simhash
> > > clustering<
> > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which
> > > might be simpler to implement.  I can imagine an implementation where
> > > the map generates the simhash and emits multiple copies keyed on part
> of
> > the
> > > hash.  Near duplicates would thus be grouped together in the reduce
> where
> > an
> > > in-memory duplicate detection could be done.
> > >
> > > Sean's intuitions about project size are generally better than mine,
> but
> > if
> > > you limit you are strict about not increasing scope, I think you could
> > > succeed with this project.
> > >
> > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <
> > prodan.crist...@gmail.com>wrote:
> > >
> > >> IDEA 1 - MinHash clustering
> > >> ---------------------------
> > >> The first idea come after taking a look at Google 's paper on
> > collaborative
> > >> filtering for their news system[2]. In that paper, I looked at MinHash
> > >> clustering.
> > >> My first question is: is MinHash clustering considered cool ? If yes,
> > than
> > >> I would like to take a stab at implementing it.
> > >> The paper also describes the implementation in a MapReduce style.
> Since
> > >> this is only a suggestion I will not elaborate very much on the
> solution
> > >> now. I would like to ask you weather this might be considered a good
> > choice
> > >> (i.e. important for the framework to have something like this) and if
> > this
> > >> is a big enough project.
> > >>
> >
> >
>

Reply via email to