Re: My ideas for GSoC 2010

Tanya Gupta Fri, 02 Apr 2010 02:15:31 -0700

Hi

There are couple of questions I would like to ask.


1. What type of clustering would you like me to use? Is K-Means good enough?

2.Can you tell me more about the map reduce code that you would have
written. Or first do I need to implement that as well using Hadoop?

Thanking You
Tanya



On Thu, Apr 1, 2010 at 1:21 PM, Cristian Prodan
<prodan.crist...@gmail.com>wrote:

> Thanks Robin, I will try have a look at that.
>
> Cristi.
>
> On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil <robin.a...@gmail.com> wrote:
>
> > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> > and couple of overlapping ones. You can easily find issues with the
> > clustering code with that dataset. Once its done you can try bigger
> > datasets
> > like wikipedia
> >
> > Robin
> >
> > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> > <prodan.crist...@gmail.com>wrote:
> >
> > > Hi,
> > >
> > > Can anyone please point me a good data set on which I might try SimHash
> > > clustering ?
> > > Thank you,
> > >
> > > Cristi
> > >
> > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > > <prodan.crist...@gmail.com>wrote:
> > >
> > > > Hello again,
> > > >
> > > > First of all, thank you all for taking time to answer my ideas. Based
> > on
> > > > your thoughts, I have been digging around, and  the project I would
> > very
> > > > much like to propose is a MapReduce implementation of the simhash
> > > algorithm.
> > > > Thank you Ted for the great idea! I'm keen on starting implementing a
> > new
> > > > algorithm for Mahout, which seems manageable and doable during the
> > summer
> > > > period and which has a clear scope. I have noticed quite a few
> > comments,
> > > > advising about not taking a project which would be to big and I
> intend
> > to
> > > > follow them.
> > > >
> > > > After taking a deep look at the article pointed out by Ted, as well
> as
> > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> > for
> > > > Minhash), I've decided to write some notes on the algorithm and a
> > > schedule
> > > > for implemeting the idea during the competition.  I kindly ask for
> your
> > > > feedback on this proposal.
> > > >
> > > >
> > > > 1. Problem and outline of simhash
> > > > ---------------------------------------
> > > >
> > > > Detecting similar files and cliassifying documents requires complex
> > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > > algorithm
> > > > [1] relies on computing a hash function that hashes similar files to
> > > similar
> > > > values. The file similarities would then be determined by comparing
> the
> > > > pre-sorted hash key values (and reducing the complexity to O(n log
> n)).
> > > > In order to futher improve the detection mechanism, the algorithm
> will
> > > also
> > > > store some auxiliary data, used to compute the hash keys. This will
> be
> > > used
> > > > as a second filter, i.e. after the hash key comparison indicates that
> > two
> > > > files are potentiallly similar.
> > > >
> > > > Properties for the similarity function and the algorithm:
> > > > - very similar files map to very similar or even the same hash key;
> > > > - distance between keys should be some measure of the difference
> > between
> > > > files. This would lead to keys proportional to file sizes and this
> > would
> > > > create false positives. Some auxiliary data will provide an easy and
> > > > efficient way of refining the similarity detection.
> > > > - the metric used will be a binary metric (simhash operats at byte
> > > level).
> > > > - given a similarity metric, there needs to be a threshold to
> determine
> > > how
> > > > close within the metric files need to be to count as similar.
> > > >
> > > > I would like my implementation of the simhash algorithm to answer two
> > > > questions:
> > > > - retrieving a set of files, similar to a given file;
> > > > - retrieving all pairs of similar files.
> > > >
> > > >
> > > > 2. simhash implementation
> > > > ------------------------------
> > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea
> is
> > > to
> > > > count the occurances of these tags within the file under processing.
> > > >
> > > >
> > > > 2.1 Computing the hash function for each file
> > > > -------------------------------------------------
> > > > for each directory
> > > >  for each file within directory
> > > >     - scan through the file and look for matches for each tag;
> > > >     - detection window is moved one bit at a time
> > > >     - if a match if found
> > > >        - set a skip_counter and then decrement it for the following
> > bits;
> > > >     - a count of matches is stored for each tag, and these are stored
> > as
> > > > SUM_TABLE_ENTRIES
> > > >     - the KEY for the current file is computed as a function of the
> > > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > > >     - after this processing, we store:
> > > >       | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > > >
> > > > The key function could also be implemented to take into account file
> > > > extensions.
> > > >
> > > >
> > > > 2.2 Finding similarities
> > > > ----------------------------
> > > > for each FILE
> > > >   SIMILAR_FILES = files with keys within a tollerance range of
> > KEY(FILE);
> > > >   for each SIMILAR_FILE
> > > >        if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> > > >              discard the query
> > > >         else
> > > >              compute the distance between SIMLAR_FILE and FILE ' s
> > > > SUM_TABLE_ENTRIES
> > > >              (i.e. "the sum of the absolute values of the diffs
> between
> > > > their entries)
> > > >              if the computed distance is within a tollerance or is
> > equal
> > > to
> > > > 0 then
> > > >                   "The files are identical"
> > > >
> > > >
> > > >
> > > > 3. Distributing simhash on hadoop
> > > > -------------------------------------
> > > > The algorithm above is very suited for a MapReduce implementation.
> > > >
> > > > 1. Map
> > > > In this phase we do the computation of the hash for a file.
> > > > It outputs (File, simhash_key).
> > > >
> > > > 2. Reduce
> > > > Once every item has a simhash, group everyfile with the same simhash
> > into
> > > a
> > > > cluster, sorted by the simhashkey (the key is an integer [#1]) .
> > > >
> > > > QUESTIONS SO FAR:
> > > > 1. For the distributed version, I believe there should be a
> > prepocessing
> > > > phase where data about files should be stored on HDFS, in a proper
> > vector
> > > > format ? Would it be better to store this in HBase, Cassandra or
> > > something
> > > > similar ?
> > > > 2. Would it be better in the reduce phase to actually eliminate the
> > > > duplicates and create a single sorted-by-hash file ?
> > > >
> > > >
> > > > 4. Implementation schedule
> > > > ------------------------------
> > > >
> > > > *** May 24 - Jul 12:
> > > > * Week 1: write documentation for the algorithm on the wiki and start
> > > > planning the objects would interact (classes, methods etc.). * Week
> > > 2,3,4:
> > > > write the algorithm for hashing of the files.
> > > > * Week 5: Thoroughly test
> > > > * Week 6: Document the MapReduce implementation - 2 days. Organize
> the
> > > data
> > > > structures 3 - days.
> > > > * Week 7,8,9: Write the distributed MapReduce Implementation of the
> > > > algorithm along with tests. Test as much as possible on different
> > > datasets.
> > > >
> > > > *** Jul 16 - Aug 9:
> > > > * Week 10, 11, 12, 13: Continue with tests on different datasets for
> > the
> > > > distributed implementation. Also if time permits, I would like to
> make
> > > > graphs and publish results with the carried tests.
> > > >
> > > > I'm very open to suggestions regarding this plan, so any observation
> is
> > > > highly appreciated.
> > > >
> > > > Thank you again for taking time to read this.
> > > >
> > > > Regards,
> > > > Cristi.
> > > >
> > > >
> > > > [1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity
> > > > Detection, December 13, 2007.
> > > > On Mar 19, 2010, at 7:47 PM, Ted Dunning wrote:
> > > >
> > > > > Minhash clustering is important for duplicate detection.  You can
> > also
> > > > > do simhash
> > > > > clustering<
> > > > http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf
> >which
> > > > > might be simpler to implement.  I can imagine an implementation
> where
> > > > > the map generates the simhash and emits multiple copies keyed on
> part
> > > of
> > > > the
> > > > > hash.  Near duplicates would thus be grouped together in the reduce
> > > where
> > > > an
> > > > > in-memory duplicate detection could be done.
> > > > >
> > > > > Sean's intuitions about project size are generally better than
> mine,
> > > but
> > > > if
> > > > > you limit you are strict about not increasing scope, I think you
> > could
> > > > > succeed with this project.
> > > > >
> > > > > On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan <
> > > > prodan.crist...@gmail.com>wrote:
> > > > >
> > > > >> IDEA 1 - MinHash clustering
> > > > >> ---------------------------
> > > > >> The first idea come after taking a look at Google 's paper on
> > > > collaborative
> > > > >> filtering for their news system[2]. In that paper, I looked at
> > MinHash
> > > > >> clustering.
> > > > >> My first question is: is MinHash clustering considered cool ? If
> > yes,
> > > > than
> > > > >> I would like to take a stab at implementing it.
> > > > >> The paper also describes the implementation in a MapReduce style.
> > > Since
> > > > >> this is only a suggestion I will not elaborate very much on the
> > > solution
> > > > >> now. I would like to ask you weather this might be considered a
> good
> > > > choice
> > > > >> (i.e. important for the framework to have something like this) and
> > if
> > > > this
> > > > >> is a big enough project.
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: My ideas for GSoC 2010

Reply via email to