Re: My ideas for GSoC 2010

Cristian Prodan Thu, 01 Apr 2010 00:52:31 -0700

Thanks Robin, I will try have a look at that.

Cristi.


On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <robin.a...@gmail.com> wrote:

> Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> and couple of overlapping ones. You can easily find issues with the
> clustering code with that dataset. Once its done you can try bigger
> datasets
> like wikipedia
>
> Robin
>
> On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> <prodan.crist...@gmail.com>wrote:
>
> > Hi,
> >
> > Can anyone please point me a good data set on which I might try SimHash
> > clustering ?
> > Thank you,
> >
> > Cristi
> >
> > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > <prodan.crist...@gmail.com>wrote:
> >
> > > Hello again,
> > >
> > > First of all, thank you all for taking time to answer my ideas. Based
> on
> > > your thoughts, I have been digging around, and  the project I would
> very
> > > much like to propose is a MapReduce implementation of the simhash
> > algorithm.
> > > Thank you Ted for the great idea! I'm keen on starting implementing a
> new
> > > algorithm for Mahout, which seems manageable and doable during the
> summer
> > > period and which has a clear scope. I have noticed quite a few
> comments,
> > > advising about not taking a project which would be to big and I intend
> to
> > > follow them.
> > >
> > > After taking a deep look at the article pointed out by Ted, as well as
> > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> for
> > > Minhash), I've decided to write some notes on the algorithm and a
> > schedule
> > > for implemeting the idea during the competition.  I kindly ask for your
> > > feedback on this proposal.
> > >
> > >
> > > 1. Problem and outline of simhash
> > > ---------------------------------------
> > >
> > > Detecting similar files and cliassifying documents requires complex
> > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > algorithm
> > > [1] relies on computing a hash function that hashes similar files to
> > similar
> > > values. The file similarities would then be determined by comparing the
> > > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > > In order to futher improve the detection mechanism, the algorithm will
> > also
> > > store some auxiliary data, used to compute the hash keys. This will be
> > used
> > > as a second filter, i.e. after the hash key comparison indicates that
> two
> > > files are potentiallly similar.
> > >
> > > Properties for the similarity function and the algorithm:
> > > - very similar files map to very similar or even the same hash key;
> > > - distance between keys should be some measure of the difference
> between
> > > files. This would lead to keys proportional to file sizes and this
> would
> > > create false positives. Some auxiliary data will provide an easy and
> > > efficient way of refining the similarity detection.
> > > - the metric used will be a binary metric (simhash operats at byte
> > level).
> > > - given a similarity metric, there needs to be a threshold to determine
> > how
> > > close within the metric files need to be to count as similar.
> > >
> > > I would like my implementation of the simhash algorithm to answer two
> > > questions:
> > > - retrieving a set of files, similar to a given file;
> > > - retrieving all pairs of similar files.
> > >
> > >
> > > 2. simhash implementation
> > > ------------------------------
> > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> > to
> > > count the occurances of these tags within the file under processing.
> > >
> > >
> > > 2.1 Computing the hash function for each file
> > > -------------------------------------------------
> > > for each directory
> > >  for each file within directory
> > >     - scan through the file and look for matches for each tag;
> > >     - detection window is moved one bit at a time
> > >     - if a match if found
> > >        - set a skip_counter and then decrement it for the following
> bits;
> > >     - a count of matches is stored for each tag, and these are stored
> as
> > > SUM_TABLE_ENTRIES
> > >     - the KEY for the current file is computed as a function of the
> > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > >     - after this processing, we store:
> > >       | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > >
> > > The key function could also be implemented to take into account file
> > > extensions.
> > >
> > >
> > > 2.2 Finding similarities
> > > ----------------------------
> > > for each FILE
> > >   SIMILAR_FILES = files with keys within a tollerance range of
> KEY(FILE);
> > >   for each SIMILAR_FILE
> > >        if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> > >              discard the query
> > >         else
> > >              compute the distance between SIMLAR_FILE and FILE ' s
> > > SUM_TABLE_ENTRIES
> > >              (i.e. "the sum of the absolute values of the diffs between
> > > their entries)
> > >              if the computed distance is within a tollerance or is
> equal
> > to
> > > 0 then
> > >                   "The files are identical"
> > >
> > >
> > >
> > > 3. Distributing simhash on hadoop
> > > -------------------------------------
> > > The algorithm above is very suited for a MapReduce implementation.
> > >
> > > 1. Map
> > > In this phase we do the computation of the hash for a file.
> > > It outputs (File, simhash_key).
> > >
> > > 2. Reduce
> > > Once every item has a simhash, group everyfile with the same simhash
> into
> > a
> > > cluster, sorted by the simhashkey (the key is an integer [#1]) .
> > >
> > > QUESTIONS SO FAR:
> > > 1. For the distributed version, I believe there should be a
> prepocessing
> > > phase where data about files should be stored on HDFS, in a proper
> vector
> > > format ? Would it be better to store this in HBase, Cassandra or
> > something
> > > similar ?
> > > 2. Would it be better in the reduce phase to actually eliminate the
> > > duplicates and create a single sorted-by-hash file ?
> > >
> > >
> > > 4. Implementation schedule
> > > ------------------------------
> > >
> > > *** May 24 - Jul 12:
> > > * Week 1: write documentation for the algorithm on the wiki and start
> > > planning the objects would interact (classes, methods etc.). * Week
> > 2,3,4:
> > > write the algorithm for hashing of the files.
> > > * Week 5: Thoroughly test
> > > * Week 6: Document the MapReduce implementation - 2 days. Organize the
> > data
> > > structures 3 - days.
> > > * Week 7,8,9: Write the distributed MapReduce Implementation of the
> > > algorithm along with tests. Test as much as possible on different
> > datasets.
> > >
> > > *** Jul 16 - Aug 9:
> > > * Week 10, 11, 12, 13: Continue with tests on different datasets for
> the
> > > distributed implementation. Also if time permits, I would like to make
> > > graphs and publish results with the carried tests.
> > >
> > > I'm very open to suggestions regarding this plan, so any observation is
> > > highly appreciated.
> > >
> > > Thank you again for taking time to read this.
> > >
> > > Regards,
> > > Cristi.
> > >
> > >
> > > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity
> > > Detection, December 13, 2007.
> > > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote:
> > >
> > > > Minhash clustering is important for duplicate detection.  You can
> also
> > > > do simhash
> > > > clustering<
> > > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf>which
> > > > might be simpler to implement.  I can imagine an implementation where
> > > > the map generates the simhash and emits multiple copies keyed on part
> > of
> > > the
> > > > hash.  Near duplicates would thus be grouped together in the reduce
> > where
> > > an
> > > > in-memory duplicate detection could be done.
> > > >
> > > > Sean's intuitions about project size are generally better than mine,
> > but
> > > if
> > > > you limit you are strict about not increasing scope, I think you
> could
> > > > succeed with this project.
> > > >
> > > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <
> > > prodan.crist...@gmail.com>wrote:
> > > >
> > > >> IDEA 1 - MinHash clustering
> > > >> ---------------------------
> > > >> The first idea come after taking a look at Google 's paper on
> > > collaborative
> > > >> filtering for their news system[2]. In that paper, I looked at
> MinHash
> > > >> clustering.
> > > >> My first question is: is MinHash clustering considered cool ? If
> yes,
> > > than
> > > >> I would like to take a stab at implementing it.
> > > >> The paper also describes the implementation in a MapReduce style.
> > Since
> > > >> this is only a suggestion I will not elaborate very much on the
> > solution
> > > >> now. I would like to ask you weather this might be considered a good
> > > choice
> > > >> (i.e. important for the framework to have something like this) and
> if
> > > this
> > > >> is a big enough project.
> > > >>
> > >
> > >
> >
>

Re: My ideas for GSoC 2010

Reply via email to