Re: My ideas for GSoC 2010

2010-04-02 Thread Tanya Gupta
Hi

There are couple of questions I would like to ask.

1. What type of clustering would you like me to use? Is K-Means good enough?

2.Can you tell me more about the map reduce code that you would have
written. Or first do I need to implement that as well using Hadoop?

Thanking You
Tanya



On Thu, Apr 1, 2010 at 1:21 PM, Cristian Prodan
wrote:

> Thanks Robin, I will try have a look at that.
>
> Cristi.
>
> On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil  wrote:
>
> > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> > and couple of overlapping ones. You can easily find issues with the
> > clustering code with that dataset. Once its done you can try bigger
> > datasets
> > like wikipedia
> >
> > Robin
> >
> > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> > wrote:
> >
> > > Hi,
> > >
> > > Can anyone please point me a good data set on which I might try SimHash
> > > clustering ?
> > > Thank you,
> > >
> > > Cristi
> > >
> > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > > wrote:
> > >
> > > > Hello again,
> > > >
> > > > First of all, thank you all for taking time to answer my ideas. Based
> > on
> > > > your thoughts, I have been digging around, and  the project I would
> > very
> > > > much like to propose is a MapReduce implementation of the simhash
> > > algorithm.
> > > > Thank you Ted for the great idea! I'm keen on starting implementing a
> > new
> > > > algorithm for Mahout, which seems manageable and doable during the
> > summer
> > > > period and which has a clear scope. I have noticed quite a few
> > comments,
> > > > advising about not taking a project which would be to big and I
> intend
> > to
> > > > follow them.
> > > >
> > > > After taking a deep look at the article pointed out by Ted, as well
> as
> > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> > for
> > > > Minhash), I've decided to write some notes on the algorithm and a
> > > schedule
> > > > for implemeting the idea during the competition.  I kindly ask for
> your
> > > > feedback on this proposal.
> > > >
> > > >
> > > > 1. Problem and outline of simhash
> > > > ---
> > > >
> > > > Detecting similar files and cliassifying documents requires complex
> > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > > algorithm
> > > > [1] relies on computing a hash function that hashes similar files to
> > > similar
> > > > values. The file similarities would then be determined by comparing
> the
> > > > pre-sorted hash key values (and reducing the complexity to O(n log
> n)).
> > > > In order to futher improve the detection mechanism, the algorithm
> will
> > > also
> > > > store some auxiliary data, used to compute the hash keys. This will
> be
> > > used
> > > > as a second filter, i.e. after the hash key comparison indicates that
> > two
> > > > files are potentiallly similar.
> > > >
> > > > Properties for the similarity function and the algorithm:
> > > > - very similar files map to very similar or even the same hash key;
> > > > - distance between keys should be some measure of the difference
> > between
> > > > files. This would lead to keys proportional to file sizes and this
> > would
> > > > create false positives. Some auxiliary data will provide an easy and
> > > > efficient way of refining the similarity detection.
> > > > - the metric used will be a binary metric (simhash operats at byte
> > > level).
> > > > - given a similarity metric, there needs to be a threshold to
> determine
> > > how
> > > > close within the metric files need to be to count as similar.
> > > >
> > > > I would like my implementation of the simhash algorithm to answer two
> > > > questions:
> > > > - retrieving a set of files, similar to a given file;
> > > > - retrieving all pairs of similar files.
> > > >
> > > >
> > > > 2. simhash implementation
> > > > --
> > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea
> is
> > > to
> > > > count the occurances of these tags within the file under processing.
> > > >
> > > >
> > > > 2.1 Computing the hash function for each file
> > > > -
> > > > for each directory
> > > >  for each file within directory
> > > > - scan through the file and look for matches for each tag;
> > > > - detection window is moved one bit at a time
> > > > - if a match if found
> > > >- set a skip_counter and then decrement it for the following
> > bits;
> > > > - a count of matches is stored for each tag, and these are stored
> > as
> > > > SUM_TABLE_ENTRIES
> > > > - the KEY for the current file is computed as a function of the
> > > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > > > - after this processing, we store:
> > > >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > > >
> > > > The key function could also be implemented to take into account fi

Re: My ideas for GSoC 2010

2010-04-01 Thread Cristian Prodan
Thanks Robin, I will try have a look at that.

Cristi.

On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil  wrote:

> Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> and couple of overlapping ones. You can easily find issues with the
> clustering code with that dataset. Once its done you can try bigger
> datasets
> like wikipedia
>
> Robin
>
> On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> wrote:
>
> > Hi,
> >
> > Can anyone please point me a good data set on which I might try SimHash
> > clustering ?
> > Thank you,
> >
> > Cristi
> >
> > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > wrote:
> >
> > > Hello again,
> > >
> > > First of all, thank you all for taking time to answer my ideas. Based
> on
> > > your thoughts, I have been digging around, and  the project I would
> very
> > > much like to propose is a MapReduce implementation of the simhash
> > algorithm.
> > > Thank you Ted for the great idea! I'm keen on starting implementing a
> new
> > > algorithm for Mahout, which seems manageable and doable during the
> summer
> > > period and which has a clear scope. I have noticed quite a few
> comments,
> > > advising about not taking a project which would be to big and I intend
> to
> > > follow them.
> > >
> > > After taking a deep look at the article pointed out by Ted, as well as
> > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> for
> > > Minhash), I've decided to write some notes on the algorithm and a
> > schedule
> > > for implemeting the idea during the competition.  I kindly ask for your
> > > feedback on this proposal.
> > >
> > >
> > > 1. Problem and outline of simhash
> > > ---
> > >
> > > Detecting similar files and cliassifying documents requires complex
> > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > algorithm
> > > [1] relies on computing a hash function that hashes similar files to
> > similar
> > > values. The file similarities would then be determined by comparing the
> > > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > > In order to futher improve the detection mechanism, the algorithm will
> > also
> > > store some auxiliary data, used to compute the hash keys. This will be
> > used
> > > as a second filter, i.e. after the hash key comparison indicates that
> two
> > > files are potentiallly similar.
> > >
> > > Properties for the similarity function and the algorithm:
> > > - very similar files map to very similar or even the same hash key;
> > > - distance between keys should be some measure of the difference
> between
> > > files. This would lead to keys proportional to file sizes and this
> would
> > > create false positives. Some auxiliary data will provide an easy and
> > > efficient way of refining the similarity detection.
> > > - the metric used will be a binary metric (simhash operats at byte
> > level).
> > > - given a similarity metric, there needs to be a threshold to determine
> > how
> > > close within the metric files need to be to count as similar.
> > >
> > > I would like my implementation of the simhash algorithm to answer two
> > > questions:
> > > - retrieving a set of files, similar to a given file;
> > > - retrieving all pairs of similar files.
> > >
> > >
> > > 2. simhash implementation
> > > --
> > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> > to
> > > count the occurances of these tags within the file under processing.
> > >
> > >
> > > 2.1 Computing the hash function for each file
> > > -
> > > for each directory
> > >  for each file within directory
> > > - scan through the file and look for matches for each tag;
> > > - detection window is moved one bit at a time
> > > - if a match if found
> > >- set a skip_counter and then decrement it for the following
> bits;
> > > - a count of matches is stored for each tag, and these are stored
> as
> > > SUM_TABLE_ENTRIES
> > > - the KEY for the current file is computed as a function of the
> > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > > - after this processing, we store:
> > >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > >
> > > The key function could also be implemented to take into account file
> > > extensions.
> > >
> > >
> > > 2.2 Finding similarities
> > > 
> > > for each FILE
> > >   SIMILAR_FILES = files with keys within a tollerance range of
> KEY(FILE);
> > >   for each SIMILAR_FILE
> > >if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> > >  discard the query
> > > else
> > >  compute the distance between SIMLAR_FILE and FILE ' s
> > > SUM_TABLE_ENTRIES
> > >  (i.e. "the sum of the absolute values of the diffs between
> > > their entries)
> > >  if the computed distanc

Re: My ideas for GSoC 2010

2010-03-31 Thread Robin Anil
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
and couple of overlapping ones. You can easily find issues with the
clustering code with that dataset. Once its done you can try bigger datasets
like wikipedia

Robin

On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
wrote:

> Hi,
>
> Can anyone please point me a good data set on which I might try SimHash
> clustering ?
> Thank you,
>
> Cristi
>
> On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> wrote:
>
> > Hello again,
> >
> > First of all, thank you all for taking time to answer my ideas. Based on
> > your thoughts, I have been digging around, and  the project I would very
> > much like to propose is a MapReduce implementation of the simhash
> algorithm.
> > Thank you Ted for the great idea! I'm keen on starting implementing a new
> > algorithm for Mahout, which seems manageable and doable during the summer
> > period and which has a clear scope. I have noticed quite a few comments,
> > advising about not taking a project which would be to big and I intend to
> > follow them.
> >
> > After taking a deep look at the article pointed out by Ted, as well as
> > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> > Minhash), I've decided to write some notes on the algorithm and a
> schedule
> > for implemeting the idea during the competition.  I kindly ask for your
> > feedback on this proposal.
> >
> >
> > 1. Problem and outline of simhash
> > ---
> >
> > Detecting similar files and cliassifying documents requires complex
> > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> algorithm
> > [1] relies on computing a hash function that hashes similar files to
> similar
> > values. The file similarities would then be determined by comparing the
> > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > In order to futher improve the detection mechanism, the algorithm will
> also
> > store some auxiliary data, used to compute the hash keys. This will be
> used
> > as a second filter, i.e. after the hash key comparison indicates that two
> > files are potentiallly similar.
> >
> > Properties for the similarity function and the algorithm:
> > - very similar files map to very similar or even the same hash key;
> > - distance between keys should be some measure of the difference between
> > files. This would lead to keys proportional to file sizes and this would
> > create false positives. Some auxiliary data will provide an easy and
> > efficient way of refining the similarity detection.
> > - the metric used will be a binary metric (simhash operats at byte
> level).
> > - given a similarity metric, there needs to be a threshold to determine
> how
> > close within the metric files need to be to count as similar.
> >
> > I would like my implementation of the simhash algorithm to answer two
> > questions:
> > - retrieving a set of files, similar to a given file;
> > - retrieving all pairs of similar files.
> >
> >
> > 2. simhash implementation
> > --
> > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> to
> > count the occurances of these tags within the file under processing.
> >
> >
> > 2.1 Computing the hash function for each file
> > -
> > for each directory
> >  for each file within directory
> > - scan through the file and look for matches for each tag;
> > - detection window is moved one bit at a time
> > - if a match if found
> >- set a skip_counter and then decrement it for the following bits;
> > - a count of matches is stored for each tag, and these are stored as
> > SUM_TABLE_ENTRIES
> > - the KEY for the current file is computed as a function of the
> > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > - after this processing, we store:
> >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> >
> > The key function could also be implemented to take into account file
> > extensions.
> >
> >
> > 2.2 Finding similarities
> > 
> > for each FILE
> >   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
> >   for each SIMILAR_FILE
> >if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> >  discard the query
> > else
> >  compute the distance between SIMLAR_FILE and FILE ' s
> > SUM_TABLE_ENTRIES
> >  (i.e. "the sum of the absolute values of the diffs between
> > their entries)
> >  if the computed distance is within a tollerance or is equal
> to
> > 0 then
> >   "The files are identical"
> >
> >
> >
> > 3. Distributing simhash on hadoop
> > -
> > The algorithm above is very suited for a MapReduce implementation.
> >
> > 1. Map
> > In this phase we do the computation of the hash for a file.
> > It outputs (File

Re: My ideas for GSoC 2010

2010-03-31 Thread Cristian Prodan
Hi,

Can anyone please point me a good data set on which I might try SimHash
clustering ?
Thank you,

Cristi

On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
wrote:

> Hello again,
>
> First of all, thank you all for taking time to answer my ideas. Based on
> your thoughts, I have been digging around, and  the project I would very
> much like to propose is a MapReduce implementation of the simhash algorithm.
> Thank you Ted for the great idea! I'm keen on starting implementing a new
> algorithm for Mahout, which seems manageable and doable during the summer
> period and which has a clear scope. I have noticed quite a few comments,
> advising about not taking a project which would be to big and I intend to
> follow them.
>
> After taking a deep look at the article pointed out by Ted, as well as
> other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> Minhash), I've decided to write some notes on the algorithm and a schedule
> for implemeting the idea during the competition.  I kindly ask for your
> feedback on this proposal.
>
>
> 1. Problem and outline of simhash
> ---
>
> Detecting similar files and cliassifying documents requires complex
> heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm
> [1] relies on computing a hash function that hashes similar files to similar
> values. The file similarities would then be determined by comparing the
> pre-sorted hash key values (and reducing the complexity to O(n log n)).
> In order to futher improve the detection mechanism, the algorithm will also
> store some auxiliary data, used to compute the hash keys. This will be used
> as a second filter, i.e. after the hash key comparison indicates that two
> files are potentiallly similar.
>
> Properties for the similarity function and the algorithm:
> - very similar files map to very similar or even the same hash key;
> - distance between keys should be some measure of the difference between
> files. This would lead to keys proportional to file sizes and this would
> create false positives. Some auxiliary data will provide an easy and
> efficient way of refining the similarity detection.
> - the metric used will be a binary metric (simhash operats at byte level).
> - given a similarity metric, there needs to be a threshold to determine how
> close within the metric files need to be to count as similar.
>
> I would like my implementation of the simhash algorithm to answer two
> questions:
> - retrieving a set of files, similar to a given file;
> - retrieving all pairs of similar files.
>
>
> 2. simhash implementation
> --
> Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to
> count the occurances of these tags within the file under processing.
>
>
> 2.1 Computing the hash function for each file
> -
> for each directory
>  for each file within directory
> - scan through the file and look for matches for each tag;
> - detection window is moved one bit at a time
> - if a match if found
>- set a skip_counter and then decrement it for the following bits;
> - a count of matches is stored for each tag, and these are stored as
> SUM_TABLE_ENTRIES
> - the KEY for the current file is computed as a function of the
> SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> - after this processing, we store:
>   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
>
> The key function could also be implemented to take into account file
> extensions.
>
>
> 2.2 Finding similarities
> 
> for each FILE
>   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
>   for each SIMILAR_FILE
>if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
>  discard the query
> else
>  compute the distance between SIMLAR_FILE and FILE ' s
> SUM_TABLE_ENTRIES
>  (i.e. "the sum of the absolute values of the diffs between
> their entries)
>  if the computed distance is within a tollerance or is equal to
> 0 then
>   "The files are identical"
>
>
>
> 3. Distributing simhash on hadoop
> -
> The algorithm above is very suited for a MapReduce implementation.
>
> 1. Map
> In this phase we do the computation of the hash for a file.
> It outputs (File, simhash_key).
>
> 2. Reduce
> Once every item has a simhash, group everyfile with the same simhash into a
> cluster, sorted by the simhashkey (the key is an integer [#1]) .
>
> QUESTIONS SO FAR:
> 1. For the distributed version, I believe there should be a prepocessing
> phase where data about files should be stored on HDFS, in a proper vector
> format ? Would it be better to store this in HBase, Cassandra or something
> similar ?
> 2. Would it be better in the reduce phase to actually eliminate the
> duplicates and create 

Re: My ideas for GSoC 2010

2010-03-23 Thread cristi prodan
Hello again,

First of all, thank you all for taking time to answer my ideas. Based on your 
thoughts, I have been digging around, and  the project I would very much like 
to propose is a MapReduce implementation of the simhash algorithm. Thank you 
Ted for the great idea! I'm keen on starting implementing a new algorithm for 
Mahout, which seems manageable and doable during the summer period and which 
has a clear scope. I have noticed quite a few comments, advising about not 
taking a project which would be to big and I intend to follow them. 

After taking a deep look at the article pointed out by Ted, as well as other 
articles and simlar ideas (Charikar's algorithm, Ankur's patch for Minhash), 
I've decided to write some notes on the algorithm and a schedule for 
implemeting the idea during the competition.  I kindly ask for your feedback on 
this proposal. 


1. Problem and outline of simhash
---

Detecting similar files and cliassifying documents requires complex heuristics 
and/or O(n^2) pair-wise computations [1]. The simhash algorithm [1] relies on 
computing a hash function that hashes similar files to similar values. The file 
similarities would then be determined by comparing the pre-sorted hash key 
values (and reducing the complexity to O(n log n)). 
In order to futher improve the detection mechanism, the algorithm will also 
store some auxiliary data, used to compute the hash keys. This will be used as 
a second filter, i.e. after the hash key comparison indicates that two files 
are potentiallly similar.

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between files. 
This would lead to keys proportional to file sizes and this would create false 
positives. Some auxiliary data will provide an easy and efficient way of 
refining the similarity detection.  
- the metric used will be a binary metric (simhash operats at byte level). 
- given a similarity metric, there needs to be a threshold to determine how 
close within the metric files need to be to count as similar. 

I would like my implementation of the simhash algorithm to answer two questions:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.


2. simhash implementation
--
Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to 
count the occurances of these tags within the file under processing. 


2.1 Computing the hash function for each file
-
for each directory
  for each file within directory
 - scan through the file and look for matches for each tag;
 - detection window is moved one bit at a time
 - if a match if found
- set a skip_counter and then decrement it for the following bits;
 - a count of matches is stored for each tag, and these are stored as 
SUM_TABLE_ENTRIES
 - the KEY for the current file is computed as a function of the 
SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
 - after this processing, we store: 
   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |

The key function could also be implemented to take into account file 
extensions. 


2.2 Finding similarities

for each FILE 
   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); 
   for each SIMILAR_FILE
if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
  discard the query
 else 
  compute the distance between SIMLAR_FILE and FILE ' s 
SUM_TABLE_ENTRIES
  (i.e. "the sum of the absolute values of the diffs between their 
entries)
  if the computed distance is within a tollerance or is equal to 0 
then
   "The files are identical"
 


3. Distributing simhash on hadoop
-
The algorithm above is very suited for a MapReduce implementation. 

1. Map
In this phase we do the computation of the hash for a file. 
It outputs (File, simhash_key).

2. Reduce
Once every item has a simhash, group everyfile with the same simhash into a 
cluster, sorted by the simhashkey (the key is an integer [#1]) . 

QUESTIONS SO FAR:
1. For the distributed version, I believe there should be a prepocessing phase 
where data about files should be stored on HDFS, in a proper vector format ? 
Would it be better to store this in HBase, Cassandra or something similar ?
2. Would it be better in the reduce phase to actually eliminate the duplicates 
and create a single sorted-by-hash file ?


4. Implementation schedule
--

*** May 24 - Jul 12:
* Week 1: write documentation for the algorithm on the wiki and start planning 
the objects would interact (classes, methods etc.). * Week 2,3,4: write the 
algorithm for hashi

Re: My ideas for GSoC 2010

2010-03-22 Thread Ankur C. Goel
Since Sean already answered IDEA-2, I'll reply to IDEA 1.

Minhash (and Shingling in general) are very efficient clustering techniques 
that have traditionally been employed by Search engines for near-duplicate 
detection of web documents. They are known to be efficient and effective at 
web-scale. Also they have been applied successfully to "near-neighbor or 
similar ..." type of problems in recommendations, search, and various other 
domains on text, audio and image data. So I think they are quite "cool".

I personally have experimented with them in recommendations and found them to 
be surprisingly more effective than I anticipated especially when dealing with 
high dimensional data. I have a basic implementation that I submitted to 
mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome 
to work on it if you'd like.

Improving and integrating it with our recommender would make up for a good GSOC 
project IMO.

Regards
-...@nkur

 3/19/10 7:04 PM, "cristi prodan"  wrote:

Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science.
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition.

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering.
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it.
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.

IDEA 2 - Additions to Taste Recommender
---
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi.

BIBLIOGRAPHY:
---
[1] Owen, Anil - Mahout in Action. Manning, 2010.

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.




Re: My ideas for GSoC 2010

2010-03-19 Thread Ted Dunning
Minhash clustering is important for duplicate detection.  You can also
do simhash
clusteringwhich
might be simpler to implement.  I can imagine an implementation where
the map generates the simhash and emits multiple copies keyed on part of the
hash.  Near duplicates would thus be grouped together in the reduce where an
in-memory duplicate detection could be done.

Sean's intuitions about project size are generally better than mine, but if
you limit you are strict about not increasing scope, I think you could
succeed with this project.

On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan wrote:

> IDEA 1 - MinHash clustering
> ---
> The first idea come after taking a look at Google 's paper on collaborative
> filtering for their news system[2]. In that paper, I looked at MinHash
> clustering.
> My first question is: is MinHash clustering considered cool ? If yes, than
> I would like to take a stab at implementing it.
> The paper also describes the implementation in a MapReduce style. Since
> this is only a suggestion I will not elaborate very much on the solution
> now. I would like to ask you weather this might be considered a good choice
> (i.e. important for the framework to have something like this) and if this
> is a big enough project.
>


Re: My ideas for GSoC 2010

2010-03-19 Thread Sean Owen
I think that's a fine project indeed. It sounds even a little
ambitious for a GSoC project. Understanding, implementing, and
parallelizing this approach is not trivial. If you want to propose it,
sure, but scaling it back a little is probably OK too. As always it's
best to propose a simple project that you can complete, document and
test, rather than a big project you half-finish.

On Fri, Mar 19, 2010 at 1:34 PM, cristi prodan
 wrote:
> IDEA 2 - Additions to Taste Recommender
> ---
> As a second idea for this competition, was to add some capabilities to the 
> Taste framework. I have revised a couple of papers from the Netflix contest 
> winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's 
> code. My idea was to implement a parallel prediction blending support by 
> using linear regression or any other machine learning method - but so far I 
> didn't got to a point where I would have a clear solution of this. I'm 
> preparing my disertation paper on recommender systems and this was the first 
> idea I got when thinking about participating to GSoC. If you have any ideas 
> on this and want to share them, I would be very thankful.


My ideas for GSoC 2010

2010-03-19 Thread cristi prodan
Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science. 
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition. 

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering. 
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it. 
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.  

IDEA 2 - Additions to Taste Recommender
---
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi. 

BIBLIOGRAPHY:
---
[1] Owen, Anil - Mahout in Action. Manning, 2010. 

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.