GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi

I would like a detailed project description for MAHOUT-328.

Thanking You
Tanya Gupta


Re: GSOC 2010

2010-03-31 Thread Robin Anil
Hi Tanya,
 MAHOUT-328 is just a general stub. There is no detailed project
description other than what is given there. The idea is we let you propose
to implement a clustering algorithm in Mahout. Start here
http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at
what mahout has at the moment http://cwiki.apache.org/MAHOUT/algorithms.html.
There are couple of algorithms missing  from mahout like min-hash or
hierarchical clustering or even a generic EM framework. I would suggest you
to read carefully through the discussions on the mailing list using the
archives and then zero in on the algorithm you would want to implement and
then propose to implement it.

Robin


On Wed, Mar 31, 2010 at 10:27 PM, Tanya Gupta  wrote:

> Hi
>
> I would like a detailed project description for MAHOUT-328.
>
> Thanking You
> Tanya Gupta
>


[GSOC] 2010 Timelines

2010-04-03 Thread Grant Ingersoll
http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline

GSOC 2010 is here

2010-01-26 Thread Robin Anil
Greetings! Fellow GSOC alums, administrators and dear mentors, the next
edition is right here. Details are given in the link below.

https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f

Maybe we could identify key areas in Mahout which we need to develop apart
from the ML implementations and list it down for students to see before they
start trickling in.

Some ideas:
Benchmarking Framework with EC2 wrappers
Commandline Console+Launcher like Hbase and hadoop
Online Tool/Query UI for Algorithms in Mahout(like CF)


Possible ideas(I have no idea what i am talking here but there are nice
problems to solve)
Improvements in Math?
How to tackle management of datasets?
Error Recovery if a job fails?


Robin


Application for GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi

I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply?

Thanking You
Tanya


Re: [GSOC] 2010 Timelines

2010-04-06 Thread Robin Anil
2 days to go till the close of student submissions. A request to mentors to
provide feedback to all the queries on the list so that students can go and
work on tuning their proposal

Robin

On Sat, Apr 3, 2010 at 10:50 PM, Grant Ingersoll wrote:

>
> http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline


Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost

Timeline including Apache internal deadlines:

http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC

Mentors, please also click on the ranking link to the ranking explanation [1] 
for more information on how to rank student proposals.

Isabel

[1] 
http://cwiki.apache.org/confluence/display/COMDEVxSITE/Mentee+Ranking+Process


signature.asc
Description: This is a digitally signed message part.


Re: GSOC 2010 is here

2010-02-01 Thread Isabel Drost
On Wed Robin Anil  wrote:
> Greetings! Fellow GSOC alums, administrators and dear mentors, the
> next edition is right here. Details are given in the link below.
> 
> https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f

Some additional notes to committers: 

First of all mentoring a GSoC student is a great experience, so if
you do have some cycles left, I would highly recommend participating in
GSoC as a mentor (thanks Grant for convincing myself last year...).

We had several successful students here at Mahout in past GSoC years.
Each year there were strong proposals for projects within Mahout. As a
results projects usually turn out to be interesting for both, mentor
and student.

One final note: If there is anyone on this list who might be interested
in helping with general ASF GSoC logistics and administration tasks,
please have a look at the newly founded community development project
(d...@community.apache.org)

 
> Maybe we could identify key areas in Mahout which we need to develop
> apart from the ML implementations and list it down for students to
> see before they start trickling in.

And motivate students to come up with their own ideas and discuss them
on-list before submitting their submission.


> Some ideas:
> Benchmarking Framework with EC2 wrappers

+1 I would love to see that.


> Commandline Console+Launcher like Hbase and hadoop

+1


> Online Tool/Query UI for Algorithms in Mahout(like CF)
> 
> 
> Possible ideas(I have no idea what i am talking here but there are
> nice problems to solve)
> Improvements in Math?
> How to tackle management of datasets?
> Error Recovery if a job fails?

How to tackle managment of learned classification models?

Better tooling for Mahout integration? (Lucene for tokenization and
analysers?, data import and export?)



Isabel


Re: GSOC 2010 is here

2010-02-01 Thread Robin Anil
Some more Wild and Wacky Ideas. Might be out of scope for GSOC, but are nice
to have features for mahout. I would like to encourage all of you to put
down your ideas here.

1. Data Visualization tool backed with HDFS/Hbase for inspecting clusters,
Topic model etc etc
  - It could have many map/reduce jobs which transform the clustering
output, aggregates things and produce interesting stats or visualization of
data
2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are
taking in GSOC students)



Robin




On Mon, Feb 1, 2010 at 6:17 PM, Isabel Drost  wrote:

> On Wed Robin Anil  wrote:
> > Greetings! Fellow GSOC alums, administrators and dear mentors, the
> > next edition is right here. Details are given in the link below.
> >
> >
> https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f
>
> Some additional notes to committers:
>
> First of all mentoring a GSoC student is a great experience, so if
> you do have some cycles left, I would highly recommend participating in
> GSoC as a mentor (thanks Grant for convincing myself last year...).
>
> We had several successful students here at Mahout in past GSoC years.
> Each year there were strong proposals for projects within Mahout. As a
> results projects usually turn out to be interesting for both, mentor
> and student.
>
> One final note: If there is anyone on this list who might be interested
> in helping with general ASF GSoC logistics and administration tasks,
> please have a look at the newly founded community development project
> (d...@community.apache.org)
>
>
> > Maybe we could identify key areas in Mahout which we need to develop
> > apart from the ML implementations and list it down for students to
> > see before they start trickling in.
>
> And motivate students to come up with their own ideas and discuss them
> on-list before submitting their submission.
>
>
> > Some ideas:
> > Benchmarking Framework with EC2 wrappers
>
> +1 I would love to see that.
>
>
> > Commandline Console+Launcher like Hbase and hadoop
>
> +1
>
>
> > Online Tool/Query UI for Algorithms in Mahout(like CF)
> >
> >
> > Possible ideas(I have no idea what i am talking here but there are
> > nice problems to solve)
> > Improvements in Math?
> > How to tackle management of datasets?
> > Error Recovery if a job fails?
>
> How to tackle managment of learned classification models?
>
> Better tooling for Mahout integration? (Lucene for tokenization and
> analysers?, data import and export?)
>
>
>
> Isabel
>


Re: GSOC 2010 is here

2010-02-02 Thread Isabel Drost
On Mon Robin Anil  wrote:
> 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks
> are taking in GSOC students)

I guess one could easily split this one in two:

a) Using UIMA (whole pipeline or just the analysers if that is possible)
for data pre-processing before Mahout algorithms are run.

b) Making it easy to integrate Mahout algorithms (classification models
etc.) as UIMA annotators.

Isabel


Have Mahout applied GSOC 2010?

2010-03-08 Thread zhao zhendong
Hi

Robin told me Mahout gonna apply GSOC 2010 as a mentor.  Can anybody tell me
the answer? I really appreciate this chance.

Thanks,

-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<


My ideas for GSoC 2010

2010-03-19 Thread cristi prodan
Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science. 
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition. 

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering. 
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it. 
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.  

IDEA 2 - Additions to Taste Recommender
---
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi. 

BIBLIOGRAPHY:
---
[1] Owen, Anil - Mahout in Action. Manning, 2010. 

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.



Re: Application for GSOC 2010

2010-03-31 Thread Ted Dunning
File a JIRA issue with a detailed proposal of your project.  The community
will help work out details for your proposal and it will eventually be rated
and possibly selected.

Make sure you follow the guidelines for a proposal.  In particular, describe
the problem and your proposed work clearly, be reasonable about your
schedule and the scope of your project, make sure that there are good ways
to evaluate whether you have succeeded and put in good milestones.

Keep in mind also that many good GSOC projects don't get selected for
support.  Mahout has had a strong policy in the past of helping students who
want to complete their project even without Google recognition.  I expect
that the Mahout community would be even more supportive of such independent
projects this year because we have grown and become more active.

On Wed, Mar 31, 2010 at 5:27 AM, Tanya Gupta  wrote:

> Hi
>
> I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply?
>
> Thanking You
> Tanya
>


Re: Application for GSOC 2010

2010-03-31 Thread Grant Ingersoll

On Mar 31, 2010, at 1:52 PM, Ted Dunning wrote:

> File a JIRA issue with a detailed proposal of your project.  The community
> will help work out details for your proposal and it will eventually be rated
> and possibly selected.

Note, you also need to put your issue into the GSOC application.  I believe it 
is now open.  JIRA is just an organization tool for the ASF to interact with 
students and mentors.

> 
> Make sure you follow the guidelines for a proposal.  In particular, describe
> the problem and your proposed work clearly, be reasonable about your
> schedule and the scope of your project, make sure that there are good ways
> to evaluate whether you have succeeded and put in good milestones.
> 
> Keep in mind also that many good GSOC projects don't get selected for
> support.  Mahout has had a strong policy in the past of helping students who
> want to complete their project even without Google recognition.  I expect
> that the Mahout community would be even more supportive of such independent
> projects this year because we have grown and become more active.
> 
> On Wed, Mar 31, 2010 at 5:27 AM, Tanya Gupta  wrote:
> 
>> Hi
>> 
>> I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply?
>> 
>> Thanking You
>> Tanya
>> 




Mahout GSoC 2010: Association Mining

2010-04-09 Thread Neal Clark
Hello,

I just wanted to introduce myself. I am a MSc. Computer Science
student at the University of Victoria. My research over the past year
has been focused on developing and implementing an Apriori based
frequent item-set mining algorithm for mining large data sets at low
support counts.

https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en

The main finding of the above report is that support levels as low as
0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
On a 100 core cluster all frequent k2 pairs can calculated in
approximately 6 minutes.

I currently have an optimized k2 Hadoop implementation and algorithm
for generating frequent pairs and I am currently extending my work to
items of any length. The analysis of the extended approach will be
complete within the next two weeks.

Would you be interesting in moving forward with such an implementation
 as a GSoC project? If so any comments/feedback would be very much
appreciated. If you are interested I can create a proposal and submit
it to your issue tracker when it comes back online.

Thanks,

Neal.


Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Robin Anil
Committers from Mahout apply as mentors. The number of students that Mahout
will get will be a portion from the ASF quota. So if ASF is getting like 20
people, it will be divided among different projects. I have never been on
the mentor side of GSOC, but hope to this year. Grant, Ted, Isabel and Ian
will be able to comment better on the topic of ASF quota etc.

Short answer, Yes, ASF will be applying as an organization and Mahouters
will be applying as mentors

Regards
Robin


On Mon, Mar 8, 2010 at 11:36 PM, zhao zhendong wrote:

> Hi
>
> Robin told me Mahout gonna apply GSOC 2010 as a mentor.  Can anybody tell
> me
> the answer? I really appreciate this chance.
>
> Thanks,
>
> --
> -
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
> >>>>>>><><><><><><><><<><>><><<<<<<
>


Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Ted Dunning
Apache is definitely going to participate.  If Mahout gets strong
candidates, we would probably will get one or more slots.

On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong wrote:

> Robin told me Mahout gonna apply GSOC 2010 as a mentor.  Can anybody tell
> me
> the answer? I really appreciate this chance.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Grant Ingersoll
Yes, I intend on Mentoring this year, again.  See the archives for useful tips 
on applying.  We've got several "veterans" around here now that can give 
advice, too.  The number one thing I see go wrong with proposals that are 
rejected is they try to take on too much.  (hmm, maybe we could train a 
classifier...)

-Grant

On Mar 8, 2010, at 1:24 PM, Ted Dunning wrote:

> Apache is definitely going to participate.  If Mahout gets strong
> candidates, we would probably will get one or more slots.
> 
> On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong wrote:
> 
>> Robin told me Mahout gonna apply GSOC 2010 as a mentor.  Can anybody tell
>> me
>> the answer? I really appreciate this chance.
>> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve




Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread zhao zhendong
Hi Robin Ted and Grant,

Thank you very much.

To Grant:
One more thing, could you please tell us the link of "archives" you
mentioned before?

Cheers,
Zhendong

On Tue, Mar 9, 2010 at 3:11 AM, Grant Ingersoll  wrote:

> Yes, I intend on Mentoring this year, again.  See the archives for useful
> tips on applying.  We've got several "veterans" around here now that can
> give advice, too.  The number one thing I see go wrong with proposals that
> are rejected is they try to take on too much.  (hmm, maybe we could train a
> classifier...)
>
> -Grant
>
> On Mar 8, 2010, at 1:24 PM, Ted Dunning wrote:
>
> > Apache is definitely going to participate.  If Mahout gets strong
> > candidates, we would probably will get one or more slots.
> >
> > On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong  >wrote:
> >
> >> Robin told me Mahout gonna apply GSOC 2010 as a mentor.  Can anybody
> tell
> >> me
> >> the answer? I really appreciate this chance.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
>
>
>


-- 
-

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

>>>>>>><><><><><><><><<><>><><<<<<<


Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread Grant Ingersoll

On Mar 9, 2010, at 12:27 PM, zhao zhendong wrote:

> Hi Robin Ted and Grant,
> 
> Thank you very much.
> 
> To Grant:
> One more thing, could you please tell us the link of "archives" you
> mentioned before?

There's a bunch of 'em, but my personal fav. is 
http://search.lucidimagination.com ;-)

Just search for GSOC and related synonyms on the Mahout list.

Re: My ideas for GSoC 2010

2010-03-19 Thread Sean Owen
I think that's a fine project indeed. It sounds even a little
ambitious for a GSoC project. Understanding, implementing, and
parallelizing this approach is not trivial. If you want to propose it,
sure, but scaling it back a little is probably OK too. As always it's
best to propose a simple project that you can complete, document and
test, rather than a big project you half-finish.

On Fri, Mar 19, 2010 at 1:34 PM, cristi prodan
 wrote:
> IDEA 2 - Additions to Taste Recommender
> ---
> As a second idea for this competition, was to add some capabilities to the 
> Taste framework. I have revised a couple of papers from the Netflix contest 
> winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's 
> code. My idea was to implement a parallel prediction blending support by 
> using linear regression or any other machine learning method - but so far I 
> didn't got to a point where I would have a clear solution of this. I'm 
> preparing my disertation paper on recommender systems and this was the first 
> idea I got when thinking about participating to GSoC. If you have any ideas 
> on this and want to share them, I would be very thankful.


Re: My ideas for GSoC 2010

2010-03-19 Thread Ted Dunning
Minhash clustering is important for duplicate detection.  You can also
do simhash
clusteringwhich
might be simpler to implement.  I can imagine an implementation where
the map generates the simhash and emits multiple copies keyed on part of the
hash.  Near duplicates would thus be grouped together in the reduce where an
in-memory duplicate detection could be done.

Sean's intuitions about project size are generally better than mine, but if
you limit you are strict about not increasing scope, I think you could
succeed with this project.

On Fri, Mar 19, 2010 at 6:34 AM, cristi prodan wrote:

> IDEA 1 - MinHash clustering
> ---
> The first idea come after taking a look at Google 's paper on collaborative
> filtering for their news system[2]. In that paper, I looked at MinHash
> clustering.
> My first question is: is MinHash clustering considered cool ? If yes, than
> I would like to take a stab at implementing it.
> The paper also describes the implementation in a MapReduce style. Since
> this is only a suggestion I will not elaborate very much on the solution
> now. I would like to ask you weather this might be considered a good choice
> (i.e. important for the framework to have something like this) and if this
> is a big enough project.
>


Re: My ideas for GSoC 2010

2010-03-22 Thread Ankur C. Goel
Since Sean already answered IDEA-2, I'll reply to IDEA 1.

Minhash (and Shingling in general) are very efficient clustering techniques 
that have traditionally been employed by Search engines for near-duplicate 
detection of web documents. They are known to be efficient and effective at 
web-scale. Also they have been applied successfully to "near-neighbor or 
similar ..." type of problems in recommendations, search, and various other 
domains on text, audio and image data. So I think they are quite "cool".

I personally have experimented with them in recommendations and found them to 
be surprisingly more effective than I anticipated especially when dealing with 
high dimensional data. I have a basic implementation that I submitted to 
mahout, see - https://issues.apache.org/jira/browse/MAHOUT-344. You are welcome 
to work on it if you'd like.

Improving and integrating it with our recommender would make up for a good GSOC 
project IMO.

Regards
-...@nkur

 3/19/10 7:04 PM, "cristi prodan"  wrote:

Dear Mahout community,

My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student 
pursuing a MSc degree in Computer Science.
I started studying machine learning in the past year and during my research I 
found about the Mapreduce model. Then, I discovered hadoop and Mahout. I was 
very impressed by the power of these frameowrks and their great potential. For 
this reason I would like to submit a proposal for this year Google Summer of 
Code competition.

I have looked at the proposals made by Robin on JIRA 
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021).
 I have stopped at two ideas. I would like to ask for your help in deciding 
which idea would be best to pick. Since I've never done GSoC before, I'm hoping 
someone would advise on the size of the project (too small or two big for the 
summer period) and mostly, it's importance for the Mahout framwork. After 
hearing your answers my intentions are to fully focus on the thourough research 
of a single idea.


IDEA 1 - MinHash clustering
---
The first idea come after taking a look at Google 's paper on collaborative 
filtering for their news system[2]. In that paper, I looked at MinHash 
clustering.
My first question is: is MinHash clustering considered cool ? If yes, than I 
would like to take a stab at implementing it.
The paper also describes the implementation in a MapReduce style. Since this is 
only a suggestion I will not elaborate very much on the solution now. I would 
like to ask you weather this might be considered a good choice (i.e. important 
for the framework to have something like this) and if this is a big enough 
project.

IDEA 2 - Additions to Taste Recommender
---
As a second idea for this competition, was to add some capabilities to the 
Taste framework. I have revised a couple of papers from the Netflix contest 
winning teams, read chapters 1 thourgh 6 from [1] and looked into Taste's code. 
My idea was to implement a parallel prediction blending support by using linear 
regression or any other machine learning method - but so far I didn't got to a 
point where I would have a clear solution of this. I'm preparing my disertation 
paper on recommender systems and this was the first idea I got when thinking 
about participating to GSoC. If you have any ideas on this and want to share 
them, I would be very thankful.

Thank you in advance.

Best regards,
Cristi.

BIBLIOGRAPHY:
---
[1] Owen, Anil - Mahout in Action. Manning, 2010.

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News 
Personalization: Scalable Online Collaborative Filtering, WWW 2007.




Re: My ideas for GSoC 2010

2010-03-23 Thread cristi prodan
Hello again,

First of all, thank you all for taking time to answer my ideas. Based on your 
thoughts, I have been digging around, and  the project I would very much like 
to propose is a MapReduce implementation of the simhash algorithm. Thank you 
Ted for the great idea! I'm keen on starting implementing a new algorithm for 
Mahout, which seems manageable and doable during the summer period and which 
has a clear scope. I have noticed quite a few comments, advising about not 
taking a project which would be to big and I intend to follow them. 

After taking a deep look at the article pointed out by Ted, as well as other 
articles and simlar ideas (Charikar's algorithm, Ankur's patch for Minhash), 
I've decided to write some notes on the algorithm and a schedule for 
implemeting the idea during the competition.  I kindly ask for your feedback on 
this proposal. 


1. Problem and outline of simhash
---

Detecting similar files and cliassifying documents requires complex heuristics 
and/or O(n^2) pair-wise computations [1]. The simhash algorithm [1] relies on 
computing a hash function that hashes similar files to similar values. The file 
similarities would then be determined by comparing the pre-sorted hash key 
values (and reducing the complexity to O(n log n)). 
In order to futher improve the detection mechanism, the algorithm will also 
store some auxiliary data, used to compute the hash keys. This will be used as 
a second filter, i.e. after the hash key comparison indicates that two files 
are potentiallly similar.

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between files. 
This would lead to keys proportional to file sizes and this would create false 
positives. Some auxiliary data will provide an easy and efficient way of 
refining the similarity detection.  
- the metric used will be a binary metric (simhash operats at byte level). 
- given a similarity metric, there needs to be a threshold to determine how 
close within the metric files need to be to count as similar. 

I would like my implementation of the simhash algorithm to answer two questions:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.


2. simhash implementation
--
Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to 
count the occurances of these tags within the file under processing. 


2.1 Computing the hash function for each file
-
for each directory
  for each file within directory
 - scan through the file and look for matches for each tag;
 - detection window is moved one bit at a time
 - if a match if found
- set a skip_counter and then decrement it for the following bits;
 - a count of matches is stored for each tag, and these are stored as 
SUM_TABLE_ENTRIES
 - the KEY for the current file is computed as a function of the 
SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
 - after this processing, we store: 
   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |

The key function could also be implemented to take into account file 
extensions. 


2.2 Finding similarities

for each FILE 
   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE); 
   for each SIMILAR_FILE
if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
  discard the query
 else 
  compute the distance between SIMLAR_FILE and FILE ' s 
SUM_TABLE_ENTRIES
  (i.e. "the sum of the absolute values of the diffs between their 
entries)
  if the computed distance is within a tollerance or is equal to 0 
then
   "The files are identical"
 


3. Distributing simhash on hadoop
-
The algorithm above is very suited for a MapReduce implementation. 

1. Map
In this phase we do the computation of the hash for a file. 
It outputs (File, simhash_key).

2. Reduce
Once every item has a simhash, group everyfile with the same simhash into a 
cluster, sorted by the simhashkey (the key is an integer [#1]) . 

QUESTIONS SO FAR:
1. For the distributed version, I believe there should be a prepocessing phase 
where data about files should be stored on HDFS, in a proper vector format ? 
Would it be better to store this in HBase, Cassandra or something similar ?
2. Would it be better in the reduce phase to actually eliminate the duplicates 
and create a single sorted-by-hash file ?


4. Implementation schedule
--

*** May 24 - Jul 12:
* Week 1: write documentation for the algorithm on the wiki and start planning 
the objects would interact (classes, methods etc.). * Week 2,3,4: write the 
algorithm for hashi

Mahout GSoC 2010 proposal: Association Mining

2010-03-28 Thread Lukáš Vlček
Hello,

I would like to apply for Mahout GSoC 2010. My proposal is to implement
Association Mining algorithm utilizing existing PFPGrowth implementation (
http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).

As for the Assoiciation Mining I would like to implement a very general
algorithm based on old and established method called GUHA [1]. Short and
informative description of GUHA method can be found here [2]. Very
exhaustive theoretical foundation can be found here [3]. Since the GUHA
method has been developing from 60's and is very rich now and given the
limited time during GSoC program it would be wise to reduce scope of initial
implementation.

There are two main topics that I would like to focus on during GSoC:

1) Enhancing and generalization of PFPGworth:
Frequent pattern mining is usually part of each association maining task. In
Mahout there is existing implementation of PFPGrowth algorithm. I would like
to utilize this implementaion but it would be necessary to enhance or
generalize it first. The goal here would be to keep current functionality
and performance of PFPGrowth but allow more generalized input/output data
and conditions. We will need to get frequencies of very rare patterns thus
if it will show up that mining only top K is limiting factor then we would
need to allow relaxation of this approach. Also we will need to take account
on negative features (those missing in individual transaction). Negative
features can be either directly supported by implementation of FP algorithm
or it can be coded at the feature level (decision about what is better
approach will be part of the GSoC program). It should be noted that for the
GSoC program we will narrow scope of association mining antecedents and
succedents to conjunctions of data features only.

2) API for custom association mining functions based on 4ft table:
Association mining in GUHA sense means testing hypothesis on four-fold table
(4ft) [see 2. item 5]. There has been proposed a lot of association
functions for GUHA, some of them are based on statistical tests (for example
Fischer test, Chi-square test), some are based on frequent analysis while
not explicitly refering to statistical tests but in both cases all
frequencies from four-fold table are needed. Some tests/functions do not
require all frequencies, however; keeping this association mining
implementation general means that we are targeting for all frequencies from
4ft. The goal here would be to provide implementation of few GUHA functions
and design API for custom function based on 4ft table (if the custom
function can be expressed using 4ft table frequencies then it should be very
easy to implement it for association mining job).

General use case scenario:
Before the association mining task is executed one would have to decide
which features can be on the left hand side of the rule (antecedents) and
which on the right hand side of the rule (succedents). It may be practical
to limit number of features on both sides (rules with many features may not
be very useful). Then a specific test or function for the 4ft table will be
specified with additional custom treasholds.

Note: The terminology used in GUHA theory is not be unified. Various
researches used different terminology. This may be a source of confusion.

My background:
I have been working as a full-time Java developer for 9 years. Currently, I
am working for JBoss (thus being paid for working on open source). A few
years ago I also started Ph.D. studies in Ostrava, Czech Republic and
association mining is the topic I am focusing on right now. Implementing
association mining in context of Mahout makes sense because it is one of the
areas of data mining which is not yet covered in Mahout at all. MapReduce
implementation can be one possible way how to tackle the challenge of mining
association rules from large data.

Regards,
Lukas Vlcek

[1]
http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
[2] http://www.cs.cas.cz/coufal/guha/index.htm
[3] 
http://www.uivt.cas.cz/~hajek/guhabook/index.html<http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>


Re: My ideas for GSoC 2010

2010-03-31 Thread Cristian Prodan
Hi,

Can anyone please point me a good data set on which I might try SimHash
clustering ?
Thank you,

Cristi

On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
wrote:

> Hello again,
>
> First of all, thank you all for taking time to answer my ideas. Based on
> your thoughts, I have been digging around, and  the project I would very
> much like to propose is a MapReduce implementation of the simhash algorithm.
> Thank you Ted for the great idea! I'm keen on starting implementing a new
> algorithm for Mahout, which seems manageable and doable during the summer
> period and which has a clear scope. I have noticed quite a few comments,
> advising about not taking a project which would be to big and I intend to
> follow them.
>
> After taking a deep look at the article pointed out by Ted, as well as
> other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> Minhash), I've decided to write some notes on the algorithm and a schedule
> for implemeting the idea during the competition.  I kindly ask for your
> feedback on this proposal.
>
>
> 1. Problem and outline of simhash
> ---
>
> Detecting similar files and cliassifying documents requires complex
> heuristics and/or O(n^2) pair-wise computations [1]. The simhash algorithm
> [1] relies on computing a hash function that hashes similar files to similar
> values. The file similarities would then be determined by comparing the
> pre-sorted hash key values (and reducing the complexity to O(n log n)).
> In order to futher improve the detection mechanism, the algorithm will also
> store some auxiliary data, used to compute the hash keys. This will be used
> as a second filter, i.e. after the hash key comparison indicates that two
> files are potentiallly similar.
>
> Properties for the similarity function and the algorithm:
> - very similar files map to very similar or even the same hash key;
> - distance between keys should be some measure of the difference between
> files. This would lead to keys proportional to file sizes and this would
> create false positives. Some auxiliary data will provide an easy and
> efficient way of refining the similarity detection.
> - the metric used will be a binary metric (simhash operats at byte level).
> - given a similarity metric, there needs to be a threshold to determine how
> close within the metric files need to be to count as similar.
>
> I would like my implementation of the simhash algorithm to answer two
> questions:
> - retrieving a set of files, similar to a given file;
> - retrieving all pairs of similar files.
>
>
> 2. simhash implementation
> --
> Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is to
> count the occurances of these tags within the file under processing.
>
>
> 2.1 Computing the hash function for each file
> -
> for each directory
>  for each file within directory
> - scan through the file and look for matches for each tag;
> - detection window is moved one bit at a time
> - if a match if found
>- set a skip_counter and then decrement it for the following bits;
> - a count of matches is stored for each tag, and these are stored as
> SUM_TABLE_ENTRIES
> - the KEY for the current file is computed as a function of the
> SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> - after this processing, we store:
>   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
>
> The key function could also be implemented to take into account file
> extensions.
>
>
> 2.2 Finding similarities
> 
> for each FILE
>   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
>   for each SIMILAR_FILE
>if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
>  discard the query
> else
>  compute the distance between SIMLAR_FILE and FILE ' s
> SUM_TABLE_ENTRIES
>  (i.e. "the sum of the absolute values of the diffs between
> their entries)
>  if the computed distance is within a tollerance or is equal to
> 0 then
>   "The files are identical"
>
>
>
> 3. Distributing simhash on hadoop
> -
> The algorithm above is very suited for a MapReduce implementation.
>
> 1. Map
> In this phase we do the computation of the hash for a file.
> It outputs (File, simhash_key).
>
> 2. Reduce
> Once every item has a simhash, group everyfile with the same simhash into a
> cluster, sorted by the simhashkey (the key is an integer [#1]) .
>
> QUESTIONS SO FAR:
> 1. For the distributed version, I believe there should be a prepocessing
> phase where data about files should be stored on HDFS, in a proper vector
> format ? Would it be better to store this in HBase, Cassandra or something
> similar ?
> 2. Would it be better in the reduce phase to actually eliminate the
> duplicates and create 

Re: My ideas for GSoC 2010

2010-03-31 Thread Robin Anil
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
and couple of overlapping ones. You can easily find issues with the
clustering code with that dataset. Once its done you can try bigger datasets
like wikipedia

Robin

On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
wrote:

> Hi,
>
> Can anyone please point me a good data set on which I might try SimHash
> clustering ?
> Thank you,
>
> Cristi
>
> On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> wrote:
>
> > Hello again,
> >
> > First of all, thank you all for taking time to answer my ideas. Based on
> > your thoughts, I have been digging around, and  the project I would very
> > much like to propose is a MapReduce implementation of the simhash
> algorithm.
> > Thank you Ted for the great idea! I'm keen on starting implementing a new
> > algorithm for Mahout, which seems manageable and doable during the summer
> > period and which has a clear scope. I have noticed quite a few comments,
> > advising about not taking a project which would be to big and I intend to
> > follow them.
> >
> > After taking a deep look at the article pointed out by Ted, as well as
> > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> > Minhash), I've decided to write some notes on the algorithm and a
> schedule
> > for implemeting the idea during the competition.  I kindly ask for your
> > feedback on this proposal.
> >
> >
> > 1. Problem and outline of simhash
> > ---
> >
> > Detecting similar files and cliassifying documents requires complex
> > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> algorithm
> > [1] relies on computing a hash function that hashes similar files to
> similar
> > values. The file similarities would then be determined by comparing the
> > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > In order to futher improve the detection mechanism, the algorithm will
> also
> > store some auxiliary data, used to compute the hash keys. This will be
> used
> > as a second filter, i.e. after the hash key comparison indicates that two
> > files are potentiallly similar.
> >
> > Properties for the similarity function and the algorithm:
> > - very similar files map to very similar or even the same hash key;
> > - distance between keys should be some measure of the difference between
> > files. This would lead to keys proportional to file sizes and this would
> > create false positives. Some auxiliary data will provide an easy and
> > efficient way of refining the similarity detection.
> > - the metric used will be a binary metric (simhash operats at byte
> level).
> > - given a similarity metric, there needs to be a threshold to determine
> how
> > close within the metric files need to be to count as similar.
> >
> > I would like my implementation of the simhash algorithm to answer two
> > questions:
> > - retrieving a set of files, similar to a given file;
> > - retrieving all pairs of similar files.
> >
> >
> > 2. simhash implementation
> > --
> > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> to
> > count the occurances of these tags within the file under processing.
> >
> >
> > 2.1 Computing the hash function for each file
> > -
> > for each directory
> >  for each file within directory
> > - scan through the file and look for matches for each tag;
> > - detection window is moved one bit at a time
> > - if a match if found
> >- set a skip_counter and then decrement it for the following bits;
> > - a count of matches is stored for each tag, and these are stored as
> > SUM_TABLE_ENTRIES
> > - the KEY for the current file is computed as a function of the
> > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > - after this processing, we store:
> >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> >
> > The key function could also be implemented to take into account file
> > extensions.
> >
> >
> > 2.2 Finding similarities
> > 
> > for each FILE
> >   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
> >   for each SIMILAR_FILE
> >if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> >  discard the query
> > else
> >  compute the distance between SIMLAR_FILE and FILE ' s
> > SUM_TABLE_ENTRIES
> >  (i.e. "the sum of the absolute values of the diffs between
> > their entries)
> >  if the computed distance is within a tollerance or is equal
> to
> > 0 then
> >   "The files are identical"
> >
> >
> >
> > 3. Distributing simhash on hadoop
> > -
> > The algorithm above is very suited for a MapReduce implementation.
> >
> > 1. Map
> > In this phase we do the computation of the hash for a file.
> > It outputs (File

Re: My ideas for GSoC 2010

2010-04-01 Thread Cristian Prodan
Thanks Robin, I will try have a look at that.

Cristi.

On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil  wrote:

> Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> and couple of overlapping ones. You can easily find issues with the
> clustering code with that dataset. Once its done you can try bigger
> datasets
> like wikipedia
>
> Robin
>
> On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> wrote:
>
> > Hi,
> >
> > Can anyone please point me a good data set on which I might try SimHash
> > clustering ?
> > Thank you,
> >
> > Cristi
> >
> > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > wrote:
> >
> > > Hello again,
> > >
> > > First of all, thank you all for taking time to answer my ideas. Based
> on
> > > your thoughts, I have been digging around, and  the project I would
> very
> > > much like to propose is a MapReduce implementation of the simhash
> > algorithm.
> > > Thank you Ted for the great idea! I'm keen on starting implementing a
> new
> > > algorithm for Mahout, which seems manageable and doable during the
> summer
> > > period and which has a clear scope. I have noticed quite a few
> comments,
> > > advising about not taking a project which would be to big and I intend
> to
> > > follow them.
> > >
> > > After taking a deep look at the article pointed out by Ted, as well as
> > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> for
> > > Minhash), I've decided to write some notes on the algorithm and a
> > schedule
> > > for implemeting the idea during the competition.  I kindly ask for your
> > > feedback on this proposal.
> > >
> > >
> > > 1. Problem and outline of simhash
> > > ---
> > >
> > > Detecting similar files and cliassifying documents requires complex
> > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > algorithm
> > > [1] relies on computing a hash function that hashes similar files to
> > similar
> > > values. The file similarities would then be determined by comparing the
> > > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > > In order to futher improve the detection mechanism, the algorithm will
> > also
> > > store some auxiliary data, used to compute the hash keys. This will be
> > used
> > > as a second filter, i.e. after the hash key comparison indicates that
> two
> > > files are potentiallly similar.
> > >
> > > Properties for the similarity function and the algorithm:
> > > - very similar files map to very similar or even the same hash key;
> > > - distance between keys should be some measure of the difference
> between
> > > files. This would lead to keys proportional to file sizes and this
> would
> > > create false positives. Some auxiliary data will provide an easy and
> > > efficient way of refining the similarity detection.
> > > - the metric used will be a binary metric (simhash operats at byte
> > level).
> > > - given a similarity metric, there needs to be a threshold to determine
> > how
> > > close within the metric files need to be to count as similar.
> > >
> > > I would like my implementation of the simhash algorithm to answer two
> > > questions:
> > > - retrieving a set of files, similar to a given file;
> > > - retrieving all pairs of similar files.
> > >
> > >
> > > 2. simhash implementation
> > > --
> > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> > to
> > > count the occurances of these tags within the file under processing.
> > >
> > >
> > > 2.1 Computing the hash function for each file
> > > -
> > > for each directory
> > >  for each file within directory
> > > - scan through the file and look for matches for each tag;
> > > - detection window is moved one bit at a time
> > > - if a match if found
> > >- set a skip_counter and then decrement it for the following
> bits;
> > > - a count of matches is stored for each tag, and these are stored
> as
> > > SUM_TABLE_ENTRIES
> > > - the KEY for the current file is computed as a function of the
> > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > > - after this processing, we store:
> > >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > >
> > > The key function could also be implemented to take into account file
> > > extensions.
> > >
> > >
> > > 2.2 Finding similarities
> > > 
> > > for each FILE
> > >   SIMILAR_FILES = files with keys within a tollerance range of
> KEY(FILE);
> > >   for each SIMILAR_FILE
> > >if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> > >  discard the query
> > > else
> > >  compute the distance between SIMLAR_FILE and FILE ' s
> > > SUM_TABLE_ENTRIES
> > >  (i.e. "the sum of the absolute values of the diffs between
> > > their entries)
> > >  if the computed distanc

Re: My ideas for GSoC 2010

2010-04-02 Thread Tanya Gupta
Hi

There are couple of questions I would like to ask.

1. What type of clustering would you like me to use? Is K-Means good enough?

2.Can you tell me more about the map reduce code that you would have
written. Or first do I need to implement that as well using Hadoop?

Thanking You
Tanya



On Thu, Apr 1, 2010 at 1:21 PM, Cristian Prodan
wrote:

> Thanks Robin, I will try have a look at that.
>
> Cristi.
>
> On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil  wrote:
>
> > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
> > and couple of overlapping ones. You can easily find issues with the
> > clustering code with that dataset. Once its done you can try bigger
> > datasets
> > like wikipedia
> >
> > Robin
> >
> > On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
> > wrote:
> >
> > > Hi,
> > >
> > > Can anyone please point me a good data set on which I might try SimHash
> > > clustering ?
> > > Thank you,
> > >
> > > Cristi
> > >
> > > On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> > > wrote:
> > >
> > > > Hello again,
> > > >
> > > > First of all, thank you all for taking time to answer my ideas. Based
> > on
> > > > your thoughts, I have been digging around, and  the project I would
> > very
> > > > much like to propose is a MapReduce implementation of the simhash
> > > algorithm.
> > > > Thank you Ted for the great idea! I'm keen on starting implementing a
> > new
> > > > algorithm for Mahout, which seems manageable and doable during the
> > summer
> > > > period and which has a clear scope. I have noticed quite a few
> > comments,
> > > > advising about not taking a project which would be to big and I
> intend
> > to
> > > > follow them.
> > > >
> > > > After taking a deep look at the article pointed out by Ted, as well
> as
> > > > other articles and simlar ideas (Charikar's algorithm, Ankur's patch
> > for
> > > > Minhash), I've decided to write some notes on the algorithm and a
> > > schedule
> > > > for implemeting the idea during the competition.  I kindly ask for
> your
> > > > feedback on this proposal.
> > > >
> > > >
> > > > 1. Problem and outline of simhash
> > > > ---
> > > >
> > > > Detecting similar files and cliassifying documents requires complex
> > > > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> > > algorithm
> > > > [1] relies on computing a hash function that hashes similar files to
> > > similar
> > > > values. The file similarities would then be determined by comparing
> the
> > > > pre-sorted hash key values (and reducing the complexity to O(n log
> n)).
> > > > In order to futher improve the detection mechanism, the algorithm
> will
> > > also
> > > > store some auxiliary data, used to compute the hash keys. This will
> be
> > > used
> > > > as a second filter, i.e. after the hash key comparison indicates that
> > two
> > > > files are potentiallly similar.
> > > >
> > > > Properties for the similarity function and the algorithm:
> > > > - very similar files map to very similar or even the same hash key;
> > > > - distance between keys should be some measure of the difference
> > between
> > > > files. This would lead to keys proportional to file sizes and this
> > would
> > > > create false positives. Some auxiliary data will provide an easy and
> > > > efficient way of refining the similarity detection.
> > > > - the metric used will be a binary metric (simhash operats at byte
> > > level).
> > > > - given a similarity metric, there needs to be a threshold to
> determine
> > > how
> > > > close within the metric files need to be to count as similar.
> > > >
> > > > I would like my implementation of the simhash algorithm to answer two
> > > > questions:
> > > > - retrieving a set of files, similar to a given file;
> > > > - retrieving all pairs of similar files.
> > > >
> > > >
> > > > 2. simhash implementation
> > > > --
> > > > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea
> is
> > > to
> > > > count the occurances of these tags within the file under processing.
> > > >
> > > >
> > > > 2.1 Computing the hash function for each file
> > > > -
> > > > for each directory
> > > >  for each file within directory
> > > > - scan through the file and look for matches for each tag;
> > > > - detection window is moved one bit at a time
> > > > - if a match if found
> > > >- set a skip_counter and then decrement it for the following
> > bits;
> > > > - a count of matches is stored for each tag, and these are stored
> > as
> > > > SUM_TABLE_ENTRIES
> > > > - the KEY for the current file is computed as a function of the
> > > > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > > > - after this processing, we store:
> > > >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> > > >
> > > > The key function could also be implemented to take into account fi

Re: Mahout GSoC 2010: Association Mining

2010-04-09 Thread Ted Dunning
Neal, I think that this might well be a useful contribution to Mahout, but,
if I am not mistaken, I think that the deadline for student proposals for
GSoC has just passed.

That likely means that making this contribution an official GSoC project is
not possible.  I am sure that the Mahout community would welcome you as a
contributor even without official Google status.  If you would like to do
this, go ahead and propose what you want to do (when JIRA comes back or just
by email discussion) and you can get started.

On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark  wrote:

> Hello,
>
> I just wanted to introduce myself. I am a MSc. Computer Science
> student at the University of Victoria. My research over the past year
> has been focused on developing and implementing an Apriori based
> frequent item-set mining algorithm for mining large data sets at low
> support counts.
>
>
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
>
> The main finding of the above report is that support levels as low as
> 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> On a 100 core cluster all frequent k2 pairs can calculated in
> approximately 6 minutes.
>
> I currently have an optimized k2 Hadoop implementation and algorithm
> for generating frequent pairs and I am currently extending my work to
> items of any length. The analysis of the extended approach will be
> complete within the next two weeks.
>
> Would you be interesting in moving forward with such an implementation
>  as a GSoC project? If so any comments/feedback would be very much
> appreciated. If you are interested I can create a proposal and submit
> it to your issue tracker when it comes back online.
>
> Thanks,
>
> Neal.
>


Re: Mahout GSoC 2010: Association Mining

2010-04-10 Thread Robin Anil
Like Ted said, its a bit late for a GSOC proposal, but I am excited at the
possibility of improving the frequent pattern mining package. Check out the
current Parallel FPGrowth implementation in the code, you can find more
explanation on usage the Mahout wiki. Apriori should be trivially
parallelizable without the extra memory problem of PFPGrowth and should
scale well for large datasets. You can contribute it separately from GSOC,
and Apache community always welcomes such contributions. The wiki should
help you get started on the Mahout development, with correct code style and
practices.  Let me know If you have any doubts or thoughts.

Robin
On Sat, Apr 10, 2010 at 5:51 AM, Ted Dunning  wrote:

> Neal, I think that this might well be a useful contribution to Mahout, but,
> if I am not mistaken, I think that the deadline for student proposals for
> GSoC has just passed.
>
> That likely means that making this contribution an official GSoC project is
> not possible.  I am sure that the Mahout community would welcome you as a
> contributor even without official Google status.  If you would like to do
> this, go ahead and propose what you want to do (when JIRA comes back or
> just
> by email discussion) and you can get started.
>
> On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark  wrote:
>
> > Hello,
> >
> > I just wanted to introduce myself. I am a MSc. Computer Science
> > student at the University of Victoria. My research over the past year
> > has been focused on developing and implementing an Apriori based
> > frequent item-set mining algorithm for mining large data sets at low
> > support counts.
> >
> >
> >
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
> >
> > The main finding of the above report is that support levels as low as
> > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> > On a 100 core cluster all frequent k2 pairs can calculated in
> > approximately 6 minutes.
> >
> > I currently have an optimized k2 Hadoop implementation and algorithm
> > for generating frequent pairs and I am currently extending my work to
> > items of any length. The analysis of the extended approach will be
> > complete within the next two weeks.
> >
> > Would you be interesting in moving forward with such an implementation
> >  as a GSoC project? If so any comments/feedback would be very much
> > appreciated. If you are interested I can create a proposal and submit
> > it to your issue tracker when it comes back online.
> >
> > Thanks,
> >
> > Neal.
> >
>


Re: Mahout GSoC 2010: Association Mining

2010-04-13 Thread Neal Clark
I am interested in contributing. The next two weeks will be a little
busy, as it is end of term, but I would be more than happy to work
over the summer on this project.

Perhaps you can also give me some advice on how to accomplish a few
tasks. Currently I am using NLineInputFormat to create/configure the
necessary map tasks.  Each map task reads the entire transaction file
from Hadoop's DistributedCache. Since each MapTask processing a
portion of the count table the task can emit the final pairs and
confidence values without a reduce phase. No extra intermediary data
is created.

If the number of nodes and/or input data were sufficiently large it
would be useful to be able to both duplicate and split the input data
across the cluster. Adding a reduce phase is quite costly as the
candidate items and not just the frequent items must be sent to the
reducer. My test have shown that there are at least an oder of
magnitude more candidates that frequent items. However it is trivial
to throw a reduce phase on the end of the job to aggregate the items
if necessary. This could be configured via a job parameter.

Do you have any ideas on how to best distribute the configuration and
input data? I have considered writing a custom input spit or input
format to split the input data however it is not clear how the
configuration data would be distributed. I suppose keeping the
NLineInputFormat and reading the transactions directly from HDFS is an
option but then we would loose the benefit of Hadoop's location
awareness.

Thanks,

Neal.

On Sat, Apr 10, 2010 at 1:28 AM, Robin Anil  wrote:
> Like Ted said, its a bit late for a GSOC proposal, but I am excited at the
> possibility of improving the frequent pattern mining package. Check out the
> current Parallel FPGrowth implementation in the code, you can find more
> explanation on usage the Mahout wiki. Apriori should be trivially
> parallelizable without the extra memory problem of PFPGrowth and should
> scale well for large datasets. You can contribute it separately from GSOC,
> and Apache community always welcomes such contributions. The wiki should
> help you get started on the Mahout development, with correct code style and
> practices.  Let me know If you have any doubts or thoughts.
>
> Robin
> On Sat, Apr 10, 2010 at 5:51 AM, Ted Dunning  wrote:
>
>> Neal, I think that this might well be a useful contribution to Mahout, but,
>> if I am not mistaken, I think that the deadline for student proposals for
>> GSoC has just passed.
>>
>> That likely means that making this contribution an official GSoC project is
>> not possible.  I am sure that the Mahout community would welcome you as a
>> contributor even without official Google status.  If you would like to do
>> this, go ahead and propose what you want to do (when JIRA comes back or
>> just
>> by email discussion) and you can get started.
>>
>> On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark  wrote:
>>
>> > Hello,
>> >
>> > I just wanted to introduce myself. I am a MSc. Computer Science
>> > student at the University of Victoria. My research over the past year
>> > has been focused on developing and implementing an Apriori based
>> > frequent item-set mining algorithm for mining large data sets at low
>> > support counts.
>> >
>> >
>> >
>> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
>> >
>> > The main finding of the above report is that support levels as low as
>> > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
>> > On a 100 core cluster all frequent k2 pairs can calculated in
>> > approximately 6 minutes.
>> >
>> > I currently have an optimized k2 Hadoop implementation and algorithm
>> > for generating frequent pairs and I am currently extending my work to
>> > items of any length. The analysis of the extended approach will be
>> > complete within the next two weeks.
>> >
>> > Would you be interesting in moving forward with such an implementation
>> >  as a GSoC project? If so any comments/feedback would be very much
>> > appreciated. If you are interested I can create a proposal and submit
>> > it to your issue tracker when it comes back online.
>> >
>> > Thanks,
>> >
>> > Neal.
>> >
>>
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
Hi Lukas,
   Sorry for being late to getting back to you on this.
Association rule mining is a great addition to FPGrowth. I am not sure I
understand GUHA method well but then again I understood Ted's LLR after some
deep reading. Could you put up an interesting example to help us understand
this method. Maybe starting from a transaction of shopping cart item ? A
great demo is big plus for a GSOC project.

Robin

On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  wrote:

> Hello,
>
> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> Association Mining algorithm utilizing existing PFPGrowth implementation (
> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>
> As for the Assoiciation Mining I would like to implement a very general
> algorithm based on old and established method called GUHA [1]. Short and
> informative description of GUHA method can be found here [2]. Very
> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> method has been developing from 60's and is very rich now and given the
> limited time during GSoC program it would be wise to reduce scope of
> initial
> implementation.
>
> There are two main topics that I would like to focus on during GSoC:
>
> 1) Enhancing and generalization of PFPGworth:
> Frequent pattern mining is usually part of each association maining task.
> In
> Mahout there is existing implementation of PFPGrowth algorithm. I would
> like
> to utilize this implementaion but it would be necessary to enhance or
> generalize it first. The goal here would be to keep current functionality
> and performance of PFPGrowth but allow more generalized input/output data
> and conditions. We will need to get frequencies of very rare patterns thus
> if it will show up that mining only top K is limiting factor then we would
> need to allow relaxation of this approach. Also we will need to take
> account
> on negative features (those missing in individual transaction). Negative
> features can be either directly supported by implementation of FP algorithm
> or it can be coded at the feature level (decision about what is better
> approach will be part of the GSoC program). It should be noted that for the
> GSoC program we will narrow scope of association mining antecedents and
> succedents to conjunctions of data features only.
>
> 2) API for custom association mining functions based on 4ft table:
> Association mining in GUHA sense means testing hypothesis on four-fold
> table
> (4ft) [see 2. item 5]. There has been proposed a lot of association
> functions for GUHA, some of them are based on statistical tests (for
> example
> Fischer test, Chi-square test), some are based on frequent analysis while
> not explicitly refering to statistical tests but in both cases all
> frequencies from four-fold table are needed. Some tests/functions do not
> require all frequencies, however; keeping this association mining
> implementation general means that we are targeting for all frequencies from
> 4ft. The goal here would be to provide implementation of few GUHA functions
> and design API for custom function based on 4ft table (if the custom
> function can be expressed using 4ft table frequencies then it should be
> very
> easy to implement it for association mining job).
>
> General use case scenario:
> Before the association mining task is executed one would have to decide
> which features can be on the left hand side of the rule (antecedents) and
> which on the right hand side of the rule (succedents). It may be practical
> to limit number of features on both sides (rules with many features may not
> be very useful). Then a specific test or function for the 4ft table will be
> specified with additional custom treasholds.
>
> Note: The terminology used in GUHA theory is not be unified. Various
> researches used different terminology. This may be a source of confusion.
>
> My background:
> I have been working as a full-time Java developer for 9 years. Currently, I
> am working for JBoss (thus being paid for working on open source). A few
> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> association mining is the topic I am focusing on right now. Implementing
> association mining in context of Mahout makes sense because it is one of
> the
> areas of data mining which is not yet covered in Mahout at all. MapReduce
> implementation can be one possible way how to tackle the challenge of
> mining
> association rules from large data.
>
> Regards,
> Lukas Vlcek
>
> [1]
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
refactored out or even a vanilla implementation of FPGrowth is not so
difficult to re-create by re-using the existing methods.

Robin


On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:

> Hi Lukas,
>Sorry for being late to getting back to you on this.
> Association rule mining is a great addition to FPGrowth. I am not sure I
> understand GUHA method well but then again I understood Ted's LLR after some
> deep reading. Could you put up an interesting example to help us understand
> this method. Maybe starting from a transaction of shopping cart item ? A
> great demo is big plus for a GSOC project.
>
> Robin
>
>
> On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček wrote:
>
>> Hello,
>>
>> I would like to apply for Mahout GSoC 2010. My proposal is to implement
>> Association Mining algorithm utilizing existing PFPGrowth implementation (
>> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>>
>> As for the Assoiciation Mining I would like to implement a very general
>> algorithm based on old and established method called GUHA [1]. Short and
>> informative description of GUHA method can be found here [2]. Very
>> exhaustive theoretical foundation can be found here [3]. Since the GUHA
>> method has been developing from 60's and is very rich now and given the
>> limited time during GSoC program it would be wise to reduce scope of
>> initial
>> implementation.
>>
>> There are two main topics that I would like to focus on during GSoC:
>>
>> 1) Enhancing and generalization of PFPGworth:
>> Frequent pattern mining is usually part of each association maining task.
>> In
>> Mahout there is existing implementation of PFPGrowth algorithm. I would
>> like
>> to utilize this implementaion but it would be necessary to enhance or
>> generalize it first. The goal here would be to keep current functionality
>> and performance of PFPGrowth but allow more generalized input/output data
>> and conditions. We will need to get frequencies of very rare patterns thus
>> if it will show up that mining only top K is limiting factor then we would
>> need to allow relaxation of this approach. Also we will need to take
>> account
>> on negative features (those missing in individual transaction). Negative
>> features can be either directly supported by implementation of FP
>> algorithm
>> or it can be coded at the feature level (decision about what is better
>> approach will be part of the GSoC program). It should be noted that for
>> the
>> GSoC program we will narrow scope of association mining antecedents and
>> succedents to conjunctions of data features only.
>>
>> 2) API for custom association mining functions based on 4ft table:
>> Association mining in GUHA sense means testing hypothesis on four-fold
>> table
>> (4ft) [see 2. item 5]. There has been proposed a lot of association
>> functions for GUHA, some of them are based on statistical tests (for
>> example
>> Fischer test, Chi-square test), some are based on frequent analysis while
>> not explicitly refering to statistical tests but in both cases all
>> frequencies from four-fold table are needed. Some tests/functions do not
>> require all frequencies, however; keeping this association mining
>> implementation general means that we are targeting for all frequencies
>> from
>> 4ft. The goal here would be to provide implementation of few GUHA
>> functions
>> and design API for custom function based on 4ft table (if the custom
>> function can be expressed using 4ft table frequencies then it should be
>> very
>> easy to implement it for association mining job).
>>
>> General use case scenario:
>> Before the association mining task is executed one would have to decide
>> which features can be on the left hand side of the rule (antecedents) and
>> which on the right hand side of the rule (succedents). It may be practical
>> to limit number of features on both sides (rules with many features may
>> not
>> be very useful). Then a specific test or function for the 4ft table will
>> be
>> specified with additional custom treasholds.
>>
>> Note: The terminology used in GUHA theory is not be unified. Various
>> researches used different terminology. This may be a source of confusion.
>>
>> My background:
>> I have been working as a full-time Java developer for 9 years. Currently,
>> I
>> am working for JBoss (thus being paid for working on open source). A few
>> years ago I also started Ph.

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Robin,

I think it does not make sense for me to catch with GSoC timeline now as I
am quite busy with other stuff. However, I will develop the proposal for
Association Mining (or GUHA if you like) and keep this discussion going on.
I am really interested in contributing some implementation to Mahout but as
of now the GCoS timeline is not of any help to me.

Let me look at this in detail and I will get back to mahout community with
more details.

As for the use cases for association mining there can be find a lot examples
in literature. When it comes to missing or negative attributes of the data
(of the transaction) I think there can be a lot of examples as well. One
example would be analysis of click stream, where you can learn that those
people visiting some negative comment on product blog never enter order
form. Not saying this is best example but in general this is the essence of
it. You simply need to take all possible values from the transaction into
account even if it is missing in the market basket. I also remember that
Simpson's paradox is often referred in connection with this issue. As for
GUHA the power of it is that it has well developed theory background. This
for example means that it stated formalized framework for various analysis
that have probably origin in psychology and similar kind of soft-science and
"association-like-functions" between data attributes can be expressed using
4ft table members and user thresholds.

The biggest challenge in implementing this would be the fact that the
analysis have to deal with all the data (not just the most frequent
patterns) and combinations. It is very resource expensive.

I am thinking of starting a wiki page for Mahout about association mining
... this could help, what do you think?

Regards,
Lukas

On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:

> PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> refactored out or even a vanilla implementation of FPGrowth is not so
> difficult to re-create by re-using the existing methods.
>
> Robin
>
>
> On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
>
> > Hi Lukas,
> >Sorry for being late to getting back to you on this.
> > Association rule mining is a great addition to FPGrowth. I am not sure I
> > understand GUHA method well but then again I understood Ted's LLR after
> some
> > deep reading. Could you put up an interesting example to help us
> understand
> > this method. Maybe starting from a transaction of shopping cart item ? A
> > great demo is big plus for a GSOC project.
> >
> > Robin
> >
> >
> > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  >wrote:
> >
> >> Hello,
> >>
> >> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> >> Association Mining algorithm utilizing existing PFPGrowth implementation
> (
> >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> >>
> >> As for the Assoiciation Mining I would like to implement a very general
> >> algorithm based on old and established method called GUHA [1]. Short and
> >> informative description of GUHA method can be found here [2]. Very
> >> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> >> method has been developing from 60's and is very rich now and given the
> >> limited time during GSoC program it would be wise to reduce scope of
> >> initial
> >> implementation.
> >>
> >> There are two main topics that I would like to focus on during GSoC:
> >>
> >> 1) Enhancing and generalization of PFPGworth:
> >> Frequent pattern mining is usually part of each association maining
> task.
> >> In
> >> Mahout there is existing implementation of PFPGrowth algorithm. I would
> >> like
> >> to utilize this implementaion but it would be necessary to enhance or
> >> generalize it first. The goal here would be to keep current
> functionality
> >> and performance of PFPGrowth but allow more generalized input/output
> data
> >> and conditions. We will need to get frequencies of very rare patterns
> thus
> >> if it will show up that mining only top K is limiting factor then we
> would
> >> need to allow relaxation of this approach. Also we will need to take
> >> account
> >> on negative features (those missing in individual transaction). Negative
> >> features can be either directly supported by implementation of FP
> >> algorithm
> >> or it can be coded at the feature level (decision about what is better
> >> approach will be part of the GSoC program). It should be noted that for
> >> the
&

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
Hi Lukáš,
It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.

Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček  wrote:

> Robin,
>
> I think it does not make sense for me to catch with GSoC timeline now as I
> am quite busy with other stuff. However, I will develop the proposal for
> Association Mining (or GUHA if you like) and keep this discussion going on.
> I am really interested in contributing some implementation to Mahout but as
> of now the GCoS timeline is not of any help to me.
>
> Let me look at this in detail and I will get back to mahout community with
> more details.
>
> As for the use cases for association mining there can be find a lot
> examples
> in literature. When it comes to missing or negative attributes of the data
> (of the transaction) I think there can be a lot of examples as well. One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket. I also remember that
> Simpson's paradox is often referred in connection with this issue. As for
> GUHA the power of it is that it has well developed theory background. This
> for example means that it stated formalized framework for various analysis
> that have probably origin in psychology and similar kind of soft-science
> and
> "association-like-functions" between data attributes can be expressed using
> 4ft table members and user thresholds.
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>
> I am thinking of starting a wiki page for Mahout about association mining
> ... this could help, what do you think?
>
> Regards,
> Lukas
>
> On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:
>
> > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> > refactored out or even a vanilla implementation of FPGrowth is not so
> > difficult to re-create by re-using the existing methods.
> >
> > Robin
> >
> >
> > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
> >
> > > Hi Lukas,
> > >Sorry for being late to getting back to you on this.
> > > Association rule mining is a great addition to FPGrowth. I am not sure
> I
> > > understand GUHA method well but then again I understood Ted's LLR after
> > some
> > > deep reading. Could you put up an interesting example to help us
> > understand
> > > this method. Maybe starting from a transaction of shopping cart item ?
> A
> > > great demo is big plus for a GSOC project.
> > >
> > > Robin
> > >
> > >
> > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  > >wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to apply for Mahout GSoC 2010. My proposal is to
> implement
> > >> Association Mining algorithm utilizing existing PFPGrowth
> implementation
> > (
> > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> > >>
> > >> As for the Assoiciation Mining I would like to implement a very
> general
> > >> algorithm based on old and established method called GUHA [1]. Short
> and
> > >> informative description of GUHA method can be found here [2]. Very
> > >> exhaustive theoretical foundation can be found here [3]. Since the
> GUHA
> > >> method has been developing from 60's and is very rich now and given
> the
> > >> limited time during GSoC program it would be wise to reduce scope of
> > >> initial
> > >> implementation.
> > >>
> > >> There are two main topics that I would like to focus on during GSoC:
> > >>
> > >> 1) Enhancing and generalization of PFPGworth:
> > >> Frequent pattern minin

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas,

The strongest alternative for this kind of application (and the normal
choice for large scale applications) is on-line gradient descent learning
with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
some outcome (click or purchase or signup) from a variety of large
vocabulary features.  As such association mining is usually just a
pre-processing step before actual learning is applied.  There is some
indication that an efficient sparse on-line gradient descent algorithm
applied to features and combinations could do just as well, especially if
the learning on combinations is applied in several passes.

These on-line algorithms have the virtue of being extremely fast and with
feature sharding, have substantial potential for parallel implementation.

What do you think about these two methods?  Can you compare them?

On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:

>  One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted,

do you think you can give some good links to paper or orther resources about
mentioned approaches? I would like to look at it after the weekend.
As far as I can see the association mining (and the guha method in its
original form) is not meant to be a predictive method but rather data
exploratory method (although having some kind of predictive power but not
formaly supported in the theoretical background). However, comparing
association mining to other approaches can be very interesting topic as
well.

Best regards,
Lukas

On Fri, Apr 9, 2010 at 8:03 PM, Ted Dunning  wrote:

> Lukas,
>
> The strongest alternative for this kind of application (and the normal
> choice for large scale applications) is on-line gradient descent learning
> with an L_1 or L_1 + L_2 regularization.  The typical goal is to predict
> some outcome (click or purchase or signup) from a variety of large
> vocabulary features.  As such association mining is usually just a
> pre-processing step before actual learning is applied.  There is some
> indication that an efficient sparse on-line gradient descent algorithm
> applied to features and combinations could do just as well, especially if
> the learning on combinations is applied in several passes.
>
> These on-line algorithms have the virtue of being extremely fast and with
> feature sharding, have substantial potential for parallel implementation.
>
> What do you think about these two methods?  Can you compare them?
>
> On Fri, Apr 9, 2010 at 4:26 AM, Lukáš Vlček  wrote:
>
> >  One
> > example would be analysis of click stream, where you can learn that those
> > people visiting some negative comment on product blog never enter order
> > form. Not saying this is best example but in general this is the essence
> of
> > it. You simply need to take all possible values from the transaction into
> > account even if it is missing in the market basket
> >
> > The biggest challenge in implementing this would be the fact that the
> > analysis have to deal with all the data (not just the most frequent
> > patterns) and combinations. It is very resource expensive.
> >
>


Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread zhao zhendong
 Hi all,
The updated proposal for GSoC 2010 is as follows, any comment is welcome.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Title/Summary:
Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student
e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
Retrieval /Computer ScienceStudent Degree: MasterStudent Graduation:
NUS’10   Organization: Hadoop

0 Abstract

Linear Support Vector Machine (SVM) is pretty useful in some applications
with large-scale datasets or datasets with high dimension features. This
proposal will port one of the most famous linear SVM solvers, say, LIBLINEAR
[1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
is another linear SVM solver and almost finished by me. Two distinct
con­tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified
interfaces for linear SVM classifier.

1 Motivation

As one of TOP 10 algorithms in data mining society [3], Support Vector
Machine is very powerful Machine Learning tool and widely adopted in Data
Mining, Pattern Recognition and Information Retrieval domains.

The SVM training procedure is pretty slow, however, especially on the case
with large-scale dataset. Nowadays, several literatures propose SVM solvers
with linear kernel that can handle large-scale learning problem, for
instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
Nevertheless, as the winner of ICML 2008 large-scale learning challenge
(linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to be
incorporated in Mahout too. Currently, LIBLINEAR package supports:

   -

   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
   logistic regression (LR)
   -

   L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)


Main features of LIBLINEAR are following:

   -

   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
   -

   Cross validation for model selection
   -

   Probability estimates (logistic regression only)
   -

   Weights for unbalanced data

*All the functionalities suppose to be implemented except probability
estimates and weights for unbalanced data* (If time permitting, I would like
to do so).

2 Unified Interfaces

Linear SVM classifier based on Pegasos package on Mahout already can provide
such functionalities: *(http://issues.apache.org/jira/browse/MAHOUT-232)*

   -

   Sequential Binary Classification (Two-class Classification), includes
   sequential training and prediction;
   -

   Sequential Regression;
   -

   Parallel & Sequential Multi-Classification, includes One-vs.-One and
   One-vs.-Others schemes.

Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR
are quite similar to each other. As aforementioned, in this section I will
introduce an unified interfaces for linear SVM classifier on Mahout, which
will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
illustrated in Figure 1:

The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms.
I will introduce them separately.

*2.1 Data Handler*

The dataset can be stored on personal computer or on Hadoop cluster. This
framework provides high performance Random Loader, Sequential Loader for
accessing large-scale data.

 Figure 1: The framework of linear SVM on Mahout

*2.2 Sequential Algorithms*

Sequential Algorithms will include binary classification, regression based on
Pegasos and LIBLINEAR with unified interface.

*2.3 Parallel Algorithms*

It is widely accepted that to parallelize binary SVM classifier is hard. For
multi-classification, however, the coarse-grained scheme (e.g. each Mapper or
Reducer has one independent SVM binary classifier) is easier to achieve great
improvement. Besides, cross validation for model selection also can take
advantage of such coarse-grained parallelism. I will introduce a unified
interface for all of them.

3 Biography:

I am a graduating masters student in Multimedia Information Retrieval System
from National University of Singapore. My research has involved the
large-scale SVM classifier.

I have worked with Hadoop and Map Reduce since one year ago, and I have
dedicated lots of my spare time to Sequential SVM (Pegasos) based on Mahout.

*(http://issues.apache.org/jira/browse/MAHOUT-232).* I have taken part in
setting up and maintaining a Hadoop cluster with around 70 nodes in our
group.

4 Timeline:

Weeks 1-4: Implement binary classifier

Weeks 5-6: Implement parallel multi-class classification and cross
validation for model selection

Weeks 7-8: Interface re-factory 

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Robin Anil
Hi Zhao,
  Some quick feedback.

1) Can you update the gsoc issue on classifier with a nabble link to this
thread or using any other aggregator
2) I hope you would read the Gsoc timelines more clearly, there is mid term
evaluation, end term evaluation and some buffer time. You will have to
change your currently timeline to accurately reflect that.

 I will post more queries about the design choice later

Robin

On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong wrote:

>  Hi all,
> The updated proposal for GSoC 2010 is as follows, any comment is welcome.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Title/Summary:
> Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student
> e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
> Retrieval /Computer ScienceStudent Degree: MasterStudent
> Graduation:
> NUS’10   Organization: Hadoop
>
> 0 Abstract
>
> Linear Support Vector Machine (SVM) is pretty useful in some applications
> with large-scale datasets or datasets with high dimension features. This
> proposal will port one of the most famous linear SVM solvers, say,
> LIBLINEAR
> [1] to mahout with unified interface as same as Pegasos [2] @ mahout, which
> is another linear SVM solver and almost finished by me. Two distinct
> con­tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified
> interfaces for linear SVM classifier.
>
> 1 Motivation
>
> As one of TOP 10 algorithms in data mining society [3], Support Vector
> Machine is very powerful Machine Learning tool and widely adopted in Data
> Mining, Pattern Recognition and Information Retrieval domains.
>
> The SVM training procedure is pretty slow, however, especially on the case
> with large-scale dataset. Nowadays, several literatures propose SVM solvers
> with linear kernel that can handle large-scale learning problem, for
> instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype of
> linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
> Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> (linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
> http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose to
> be
> incorporated in Mahout too. Currently, LIBLINEAR package supports:
>
>   -
>
>   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
>   logistic regression (LR)
>   -
>
>   L1-regularized classifiers L2-loss linear SVM and logistic regression (LR)
>
>
> Main features of LIBLINEAR are following:
>
>   -
>
>   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
>   -
>
>   Cross validation for model selection
>   -
>
>   Probability estimates (logistic regression only)
>   -
>
>   Weights for unbalanced data
>
> *All the functionalities suppose to be implemented except probability
> estimates and weights for unbalanced data* (If time permitting, I would
> like
> to do so).
>
> 2 Unified Interfaces
>
> Linear SVM classifier based on Pegasos package on Mahout already can provide
> such functionalities: *(http://issues.apache.org/jira/browse/MAHOUT-232)*
>
>   -
>
>   Sequential Binary Classification (Two-class Classification), includes
>   sequential training and prediction;
>   -
>
>   Sequential Regression;
>   -
>
>   Parallel & Sequential Multi-Classification, includes One-vs.-One and
>   One-vs.-Others schemes.
>
> Apparently, the functionalities of Pegasos package on Mahout and LIBLINEAR
> are quite similar to each other. As aforementioned, in this section I will
> introduce an unified interfaces for linear SVM classifier on Mahout, which
> will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
> illustrated in Figure 1:
>
> The unfied interfaces has two main parts: 1) Dataset loader; 2) Algorithms.
> I will introduce them separately.
>
> *2.1 Data Handler*
>
> The dataset can be stored on personal computer or on Hadoop cluster. This
> framework provides high performance Random Loader, Sequential Loader for
> accessing large-scale data.
>
>  Figure 1: The framework of linear SVM on Mahout
>
> *2.2 Sequential Algorithms*
>
> Sequential Algorithms will include binary classification, regression based
> on
> Pegasos and LIBLINEAR with unified interface.
>
> *2.3 Parallel Algorithms*
>
> It is widely accepted that to parallelize binary SVM classifier is hard. For
> multi-classification, however

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread zhao zhendong
Sure, I will revise it tonight.

Thanks, Robin.


On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil  wrote:

> Hi Zhao,
>  Some quick feedback.
>
> 1) Can you update the gsoc issue on classifier with a nabble link to this
> thread or using any other aggregator
> 2) I hope you would read the Gsoc timelines more clearly, there is mid term
> evaluation, end term evaluation and some buffer time. You will have to
> change your currently timeline to accurately reflect that.
>
>  I will post more queries about the design choice later
>
> Robin
>
> On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong  >wrote:
>
> >  Hi all,
> > The updated proposal for GSoC 2010 is as follows, any comment is welcome.
> > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > Title/Summary:
> > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student
> > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
> > Retrieval /Computer ScienceStudent Degree: MasterStudent
> > Graduation:
> > NUS’10   Organization: Hadoop
> >
> > 0 Abstract
> >
> > Linear Support Vector Machine (SVM) is pretty useful in some applications
> > with large-scale datasets or datasets with high dimension features. This
> > proposal will port one of the most famous linear SVM solvers, say,
> > LIBLINEAR
> > [1] to mahout with unified interface as same as Pegasos [2] @ mahout,
> which
> > is another linear SVM solver and almost finished by me. Two distinct
> > con­tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified
> > interfaces for linear SVM classifier.
> >
> > 1 Motivation
> >
> > As one of TOP 10 algorithms in data mining society [3], Support Vector
> > Machine is very powerful Machine Learning tool and widely adopted in Data
> > Mining, Pattern Recognition and Information Retrieval domains.
> >
> > The SVM training procedure is pretty slow, however, especially on the
> case
> > with large-scale dataset. Nowadays, several literatures propose SVM
> solvers
> > with linear kernel that can handle large-scale learning problem, for
> > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype
> of
> > linear SVM classifier based on Pegasos [2] for Mahout (issue: Mahout-232).
> > Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
> > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose
> to
> > be
> > incorporated in Mahout too. Currently, LIBLINEAR package supports:
> >
> >   -
> >
> >   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
> >   logistic regression (LR)
> >   -
> >
> >   L1-regularized classifiers L2-loss linear SVM and logistic regression
> (LR)
> >
> >
> > Main features of LIBLINEAR are following:
> >
> >   -
> >
> >   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
> >   -
> >
> >   Cross validation for model selection
> >   -
> >
> >   Probability estimates (logistic regression only)
> >   -
> >
> >   Weights for unbalanced data
> >
> > *All the functionalities suppose to be implemented except probability
> > estimates and weights for unbalanced data* (If time permitting, I would
> > like
> > to do so).
> >
> > 2 Unified Interfaces
> >
> > Linear SVM classifier based on Pegasos package on Mahout already can
> provide
> > such functionalities: *(
> http://issues.apache.org/jira/browse/MAHOUT-232)*
> >
> >   -
> >
> >   Sequential Binary Classification (Two-class Classification), includes
> >   sequential training and prediction;
> >   -
> >
> >   Sequential Regression;
> >   -
> >
> >   Parallel & Sequential Multi-Classification, includes One-vs.-One and
> >   One-vs.-Others schemes.
> >
> > Apparently, the functionalities of Pegasos package on Mahout and
> LIBLINEAR
> > are quite similar to each other. As aforementioned, in this section I
> will
> > introduce an unified interfaces for linear SVM classifier on Mahout, which
> > will incorporate Pegasos, LIBLINEAR. The whole picture of interfaces is
> > illustrated in Figure 1:
> >
> > The unfied interfaces has two main parts: 1) Datas

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Ted Dunning
Also, this proposal should itself be a  JIRA ticket with the GSOC tag
applied to it.  That will make it visible to the Apache-wide summer of code
administrator's.

On Fri, Mar 12, 2010 at 3:57 AM, zhao zhendong wrote:

> Sure, I will revise it tonight.
>
> Thanks, Robin.
>
>
> On Fri, Mar 12, 2010 at 7:22 PM, Robin Anil  wrote:
>
> > Hi Zhao,
> >  Some quick feedback.
> >
> > 1) Can you update the gsoc issue on classifier with a nabble link to this
> > thread or using any other aggregator
> > 2) I hope you would read the Gsoc timelines more clearly, there is mid
> term
> > evaluation, end term evaluation and some buffer time. You will have to
> > change your currently timeline to accurately reflect that.
> >
> >  I will post more queries about the design choice later
> >
> > Robin
> >
> > On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong  > >wrote:
> >
> > >  Hi all,
> > > The updated proposal for GSoC 2010 is as follows, any comment is
> welcome.
> > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> > > Title/Summary:
> > > Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao
> Student
> > > e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information
> > > Retrieval /Computer ScienceStudent Degree: MasterStudent
> > > Graduation:
> > > NUS’10   Organization: Hadoop
> > >
> > > 0 Abstract
> > >
> > > Linear Support Vector Machine (SVM) is pretty useful in some
> applications
> > > with large-scale datasets or datasets with high dimension features.
> This
> > > proposal will port one of the most famous linear SVM solvers, say,
> > > LIBLINEAR
> > > [1] to mahout with unified interface as same as Pegasos [2] @ mahout,
> > which
> > > is another linear SVM solver and almost finished by me. Two distinct
> > > con­tributions would be: 1) Introduce LIBLINEAR to Mahout; 2) Unified
> > > interfaces for linear SVM classifier.
> > >
> > > 1 Motivation
> > >
> > > As one of TOP 10 algorithms in data mining society [3], Support Vector
> > > Machine is very powerful Machine Learning tool and widely adopted in
> Data
> > > Mining, Pattern Recognition and Information Retrieval domains.
> > >
> > > The SVM training procedure is pretty slow, however, especially on the
> > case
> > > with large-scale dataset. Nowadays, several literatures propose SVM
> > solvers
> > > with linear kernel that can handle large-scale learning problem, for
> > > instance, LIBLINEAR [1] and Pegasos [2]. I have implemented a prototype
> > of
> > > linear SVM classifier based on Pegasos [2] for Mahout (issue:
> Mahout-232).
> > > Nevertheless, as the winner of ICML 2008 large-scale learning challenge
> > > (linear SVM <http://largescale.first.fraunhofer.de/summary/>track (
> > > http://largescale.first.fraunhofer.de/summary/), LIBLINEAR [1] suppose
> > to
> > > be
> > > incorporated in Mahout too. Currently, LIBLINEAR package supports:
> > >
> > >   -
> > >
> > >   L2-regularized classifiers L2-loss linear SVM, L1-loss linear SVM, and
> > >   logistic regression (LR)
> > >   -
> > >
> > >   L1-regularized classifiers L2-loss linear SVM and logistic regression
> > (LR)
> > >
> > >
> > > Main features of LIBLINEAR are following:
> > >
> > >   -
> > >
> > >   Multi-class classification: 1) one-vs-the rest, 2) Crammer & Singer
> > >   -
> > >
> > >   Cross validation for model selection
> > >   -
> > >
> > >   Probability estimates (logistic regression only)
> > >   -
> > >
> > >   Weights for unbalanced data
> > >
> > > *All the functionalities suppose to be implemented except probability
> > > estimates and weights for unbalanced data* (If time permitting, I would
> > > like
> > > to do so).
> > >
> > > 2 Unified Interfaces
> > >
> > > Linear SVM classifier based on Pegasos package on Mahout already can
> > provide
> > > such functionalities: *(
> > http://issues.apache.org/jira/browse/MAHOUT-232)*<http://issues.apache.org/jira/browse/MAHO

[GSoC 2010] Proposal to implement SimHash clustering for Mahout

2010-04-07 Thread Cristian Prodan
Hello,

I'm posting a draft for my proposal for this year's GSoC. I kindly ask for
your feedback on it.

I have also posted a JIRA ticket with it:
https://issues.apache.org/jira/browse/MAHOUT-365 .

Thank you in advance.
Cristi.

-

Application for Google Summer of Code 2010 - Mahout Project

Student: Cristian Prodan

1. Synopsis

I will add a map-reduce implementation of the SimHash clustering algorithm
to the Mahout project. This algorithm provides an efficient way of finding
similar/identical files in a large collection of files.


2. Project

Storage capacities become larger and thus it is difficult to organize and
manage growing file systems. It is easy to loose track of identical copies
or older versions of the files in a directory structure. Duplicate detection
technologies do not extend well when files are not identical. A typical idea
for detecting similar files is to see the features of a file into a
high-dimensional space and then use distance within space as a measure of
similarity. This may not be feasible since it involves a O(n^2) complexity
of the algorithm. If these file-to-vector mappings are reduced ow a
one-dimensional space, then the data points could be sorted in O(n log n)
time - a big increase of detection speed.

I will implement the SimHash algorithm presented in detail in [1]. The idea
is the following: using a hash function that hashed similar files to similar
values, file similarity could be determined simply by comparing pre-sorted
hash key values.
I will implement a family of similarity hash functions which will do this,
as described in [1]. Furthermore, the performance will be enhanced by
storing auxiliary data used to compute the hash keys. This data will be used
as a second filter after a hash key comparison indicates that two files are
potentially similar.

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between
files. This would lead to keys proportional to file sizes and this would
create false positives. The auxiliary data mentioned above will provide an
easy and efficient way of refining the similarity detection.
- the metric used will be a binary metric (simhash operates at byte level).
- given a similarity metric, there needs to be a threshold to determine how
close within the metric files need to be to count as similar.

>From a distributed point of view, the algorithm above is very suited for a
MapReduce implementation. The sketch is the following:

I. MAP phase
In this phase we compute the hash for a file along with additional info
which serves as a second filter in similarity detection.
It outputs (File, simhash_key).

II. REDUCE phase
Once every file has a simhash, group every file with the same simhash into a
cluster, sorted by the simhashkey (the key is an integer) .
* The similarity check can be done in main memory.


3. Deliverables:

1. Batch for sim-hashing the files. The hashes will be stored either on
normal file system (for small tests), HDFS or on HBase.
2. A tool for answering the next type of queries:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.
There will be the option to run this as a standalone program, for tests or
smaller scale purposes.
3. Documentation and unit tests for the written code.
4. Getting started tutorials for SimHash.
5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.


4. Benefits for the Mahout community:

1) A distributed tool for efficient similarity detection of files in large
datasets. Algorithms for detecting similar files can also be useful for
classification purposes and as an aid to search. Next release of Mahout will
contain this functionality.
2) This will be used at smaller scale also, by running it in an
"urn-distributed" mode, for small scale datasets.
3) Good tutorials on how to work with this tool.


5. Roadmap

1).  Community Bonding Period (21th April to 24rd May)
- 1 week: Familiarize with Mahout, hadoop, and the existing clustering
algorithms;
- 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and
other interfaces and classes used in Mahout) and start initial planning of
the system in terms of interactions and data structures.
- 1 week: Setup a hadoop cluster formed of 2-3 nodes;
- During this time, I plan to speak with my mentor and ask him very specific
questions about the plans I make for the implementation.

2).  May 24th to Jul 12th
- Week 1: Detailed planning on how the objects would interact (classes,
methods, the Vector interfaces I plan to use etc.) and ask feedback from the
mentor.
- Week 2: Write the code for hashing files (a batch process which will work
as an indexer), according to the algorithm, in a TDD style. At the end tests
will acco

[jira] Created: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)
Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
--

 Key: MAHOUT-363
 URL: https://issues.apache.org/jira/browse/MAHOUT-363
 Project: Mahout
  Issue Type: Task
Reporter: Shannon Quinn


Proposal Title: EigenCuts spectral clustering implementation on map/reduce for 
Apache Mahout (addresses issue Mahout-328)

Student Name: Shannon Quinn

Student E-mail: mag...@gmail.com

Organization/Project:Assigned Mentor:

Proposal Abstract:
Clustering algorithms are advantageous when the number of classes are not known 
a priori. However, most techniques still require an explicit K to be chosen, 
and most spectral algorithms' use of piecewise constant approximation of 
eigenvectors breaks down when the clusters are tightly coupled. EigenCuts[1] 
solves both these problems by choosing an eigenvector to create a new cluster 
boundary and iterating until no more edges are cut.

Detailed Description

Clustering techniques are extremely useful unsupervised methods, particularly 
within my field of computational biology, for situations where the number (and 
often the characteristics as well) of classes expressed in the data are not 
known a priori. K-means is a classic technique which, given some K, attempts to 
label data points within a cluster as a function of their distance (e.g. 
Euclidean) from the cluster's mean, iterating to convergence.

Another approach is spectral clustering, which models the data as a weighted, 
undirected graph in some n-dimensional space, and creates a matrix M of 
transition probabilities between nodes. By computing the eigenvalues and 
eigenvectors of this matrix, most spectral clustering techniques take advantage 
of the fact that, for data with loosely coupled clusters, the K leading 
eigenvectors will identify the roughly piecewise constant regions in the data 
that correspond to clusters.

However, these techniques all suffer from drawbacks, the two most significant 
of which are having to choose an arbitrary K a priori, and in the situation of 
tightly coupled clusters where the piecewise constant approximation on the 
eigenvectors no longer holds.

The EigenCuts algorithm addresses both these issues. As a type of spectral 
clustering algorithm it works by constructing a Markov chain representation of 
the data and computing the eigenvectors and eigenvalues of the transition 
matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
half life of flow decay called eigenflow. By perturbing the weights between 
nodes, it can be observed where bottlenecks exist in the eigenflow's halflife, 
allowing for the identification of boundaries between clusters. Thus, this 
algorithm iterates until no more cuts between clusters need to be made, 
eliminating the need for an a prior K, and conferring the ability to separate 
tightly coupled clusters.

The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
eigenvalues at each iterative step, incurring a large computational overhead. 
This problem can be adequately addressed within the map/reduce framework and on 
a Hadoop cluster by parallelizing the computation of each eigenvector and its 
associated eigenvalue. Apache Hama in particular, with its specializations in 
graph and matrix data, will be crucial in parallelizing the computations of 
transition matrices and their corresponding eigenvalues and eigenvectors at 
each iteration.

Since Dr Chennubhotla is currently a member of the faculty at the University of 
Pittsburgh, I have been in contact with him for the past few weeks, and we both 
envision and eagerly anticipate continued collaboration over the course of the 
summer and this project's implementation. He has thus far been instrumental in 
highlighting the finer points of the underlying theory, and coupled with my 
experience in and knowledge of software engineering, this is a project we are 
both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. 
It may not do much, but what it does will work and have full coverage 
accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - 
familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on 
documentation, fine-tuning the project plan and requirements. This part will 
kick into high gear after May 6 (my last final exam and final academic 
obligation, prior to the actual graduation ceremony), but likely nothing before 
April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering 
algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows 
for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral 
clustering. 

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853303#action_12853303
 ] 

Ted Dunning commented on MAHOUT-363:



Nice proposal.  Well written and well conceived.

One tiny suggestion is to omit Hama from your plan as it would just be a 
distraction for you.  The Hama project is pretty much independent of Mahout and 
there hasn't any contribution in the H->M direction.  



> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. He has thus far been 
> instrumental in highlighting the finer points of the underlying theory, and 
> coupled with my experience in and knowledge of software engineering, this is 
> a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
>

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853306#action_12853306
 ] 

Shannon Quinn commented on MAHOUT-363:
--

Thank you for the feedback!

Cutting out Hama would certainly improve the feasibility of the project 
timeline and allow me to further refine the overall algorithm. I will 
absolutely adhere to your advice; I'll edit this ticket and my GSoC 
application. Thank you again!

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. He has thus far been 
> instrumental in highlighting the finer points of the underlying theory, and 
> coupled with my experience in and knowledge of software engineering, this is 
> a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, func

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853308#action_12853308
 ] 

Jake Mannix commented on MAHOUT-363:


... and actually, there is no need for Hama, as distributed matrix computations 
can already be done in Mahout natively (in particular: finding eigenvectors and 
eigenvalues is already done as a Map/Reduce job).

This proposal seems to be in exactly the right level of additional work and 
utilization of our current set of primitive operations for a GSoC project.  I 
wish I had the time to help with mentoring this project, in fact.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. He has thus far been 
> instrumental in highlighting the finer points of the underlying theory, and 
> coupled with my experience in and knowledge of software engineering,

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)
and eagerly anticipate continued collaboration over the course of the 
summer and this project's implementation. He has thus far been instrumental in 
highlighting the finer points of the underlying theory, and coupled with my 
experience in and knowledge of software engineering, this is a project we are 
both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. 
It may not do much, but what it does will work and have full coverage 
accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - 
familiarization with and dev deployments of Hadoop, Mahout, Hama; reading up on 
documentation, fine-tuning the project plan and requirements. This part will 
kick into high gear after May 6 (my last final exam and final academic 
obligation, prior to the actual graduation ceremony), but likely nothing before 
April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering 
algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows 
for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral 
clustering. Integrate map/reduce framework via Hama for calculating of 
transition matrices and associated eigenvectors and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to 
EigenCuts. Fully parallelize with Hama. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with 
EigenCuts. Finalize interface for running the algorithm, selecting datasets and 
visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit 
tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational 
biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at 
Carnegie Mellon this coming fall. I have worked extensively with clustering 
techniques as a master's student, as a significant amount of my work has 
involved bioimage analysis within Dr Robert Murphy's lab. Previous work has 
involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW 
to cluster those patterns according to geographic location. My master's thesis 
involves use of matrix completion to infer linear transformations of proteins 
and their associated subcellular locations across varying cell conditions 
(drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with 
an undergraduate computer science degree from Georgia Tech, and after an 
internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new 
frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for 
Spectral Clustering. NIPS 2002.


> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eig

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853409#action_12853409
 ] 

Robin Anil commented on MAHOUT-363:
---

Hi Shannon, did you take time to explore the Mahout code. I believe the k-means 
you are looking to implement is already there it will shave 2 weeks of your 
GSOC :). Reading the code/wiki is a great exercise for you to be more realistic 
in your proposal

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. He has thus far been 
> instrumental in highlighting the finer points of the underlying theory, and 
> coupled with my experience in and knowledge of software engineering, this is 
> a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
> deliverable. I

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853415#action_12853415
 ] 

Shannon Quinn commented on MAHOUT-363:
--

Hi Robin, thanks for the suggestions! I have started playing with the code 
(though I should read the wiki more, excellent point), and my rationale behind 
the k-means implementation was to build an initial framework for the EigenCuts 
algorithm, around which I could build a UI that accepted input and displayed 
output, and also to tap into the map/reduce framework. Both would work 
alongside a very simple clustering technique that would be easily implemented. 
Essentially, the goal of the first sprint is to get the UI and map/reduce 
integration running; the k-means algorithm is merely a sort of test condition 
for those two features, given its ease of implementation.

That's just my explanation; if you feel otherwise I'm happy to adjust my 
proposal :)

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with 

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)
ticipate continued collaboration over the course of the 
summer and this project's implementation. He has thus far been instrumental in 
highlighting the finer points of the underlying theory, and coupled with my 
experience in and knowledge of software engineering, this is a project we are 
both extremely excited about implementing.

Timeline

At the end of each sprint, there should be a concrete, functional deliverable. 
It may not do much, but what it does will work and have full coverage 
accompanying unit tests.

Sprint 0 (April 26 - May 23): Work with mentor on any pre-coding tasks - 
familiarization with and dev deployments of Hadoop and Mahout; reading up on 
documentation, fine-tuning the project plan and requirements. This part will 
kick into high gear after May 6 (my last final exam and final academic 
obligation, prior to the actual graduation ceremony), but likely nothing before 
April 29 (the day of my thesis defense).

Sprint 1 (2 weeks; May 24 - June 6): Implement basic k-means clustering 
algorithm and parallelize on Mahout over Hadoop. Preliminary interface allows 
for dataset selection and visualization of resulting clusters.

Sprint 2 (3 weeks; June 7 - 27): Modify k-means algorithm to spectral 
clustering. Integrate map/reduce framework via Mahout and take advantage of 
existing core calculation of transition matrices and associated eigenvectors 
and eigenvalues.

Sprint 3 (3 weeks; June 28 - July 18): Augment spectral clustering to 
EigenCuts. Fully parallelize with Mahout. Also, mid-term evaluations.

Sprint 4 (3 weeks; July 19 - August 8): Fix any remaining issues with 
EigenCuts. Finalize interface for running the algorithm, selecting datasets and 
visualizing results.

Sprint 5 (1 week; August 9 - 15): Tidy up documentation, write final unit 
tests, fix outstanding bugs.

Other Information

I am finishing up my last semester as a master's student in computational 
biology at Carnegie Mellon, prior to beginning the PhD program in CompBio at 
Carnegie Mellon this coming fall. I have worked extensively with clustering 
techniques as a master's student, as a significant amount of my work has 
involved bioimage analysis within Dr Robert Murphy's lab. Previous work has 
involved using HMMs to detect patterns in tuberculosis genomes and use CLUSTALW 
to cluster those patterns according to geographic location. My master's thesis 
involves use of matrix completion to infer linear transformations of proteins 
and their associated subcellular locations across varying cell conditions 
(drugs, cell lines, etc). 

I unfortunately have limited experience with Apache Mahout and Hadoop, but with 
an undergraduate computer science degree from Georgia Tech, and after an 
internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new 
frameworks quickly.

References

[1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for 
Spectral Clustering. NIPS 2002.


> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854950#action_12854950
 ] 

Jake Mannix commented on MAHOUT-363:


If possible, Shannon, if you could simply add comments to this JIRA ticket, 
instead of editing the original ticket itself, we'll be able to more easily 
follow your thinking.  Otherwise, we can't really see what has changed.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. His guidance in highlighting 
> the finer points of the underlying theory, coupled with my experience in and 
> knowledge of software engineering, makes this is a project we are both 
> extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
> deliverable. It may not do much, but what it does wil

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855061#action_12855061
 ] 

Shannon Quinn commented on MAHOUT-363:
--

I apologize for that. I actually only changed some of the wording; the overall 
proposal structure wasn't changed. But I will certainly refrain from editing 
the ticket itself.

Are there any other suggestions for making the proposal more viable?

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. His guidance in highlighting 
> the finer points of the underlying theory, coupled with my experience in and 
> knowledge of software engineering, makes this is a project we are both 
> extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
> deliverable. It may not do much,

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855068#action_12855068
 ] 

Ted Dunning commented on MAHOUT-363:


{quote}
Are there any other suggestions for making the proposal more viable?
{quote}
I think that the proposal as it stands is very strong.

When and if it comes down to doing the work, I will be asking questions like 
whether it is possible to take several eigenvectors at a time instead of just 
one, and how this work relates to the stochastic decomposition work described 
at http://arxiv.org/abs/0909.4061v1 .  In particular, I wonder if you could 
re-use the the randomized product in doing subsequent decompositions.

But these kinds of questions aren't necessarily the kind of thing that you need 
to answer in the proposal ... they are just how you get side-tracked after you 
start.  :-)



> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticip

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-12 Thread Shannon Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856229#action_12856229
 ] 

Shannon Quinn commented on MAHOUT-363:
--

I've been getting lot of good feedback, and after discussing my proposal at 
length with a few people, I'll submitting changes to the proposal timeline, 
particularly regarding the k-means implementation.

Essentially, I'll cut out what is currently sprint 1, shifting sprints 2-4 back 
two weeks. I'll start immediately on implementing the EigenCuts algorithm, 
relying heavily on test-drive development and full-coverage unit testing to 
highlight any bugs in the EigenCuts implementation. Using the extra two weeks, 
I'll focus on providing more thorough documentation, including detailed specs 
on how to use the algorithm: given that I am working with several PhD students 
and professors at CMU who are interested in a map/reduce implementation of this 
algorithm, they have massive datasets that are readily available for use. I'll 
post an example dataset, along with instructions on how to feed it to Mahout, 
and what the output should look like.

If there is extra time at the end, I can work on the user interface and data 
visualization.

The schedule aside, there are already a few points of the implementation that I 
can probably address in slightly greater detail. Regarding the decomposition of 
the similarity matrix and computation of eigenvectors, this is the 
computationally difficult part and can be done several different ways. A 
stochastic decomposition, as proposed in http://arxiv.org/abs/0909.4061v1 (via 
Ted Dunning) would be one of several approaches, though from a performance 
perspective this looks to be one of the best. For the mechanics of 
parallelizing EigenCuts, the paper from ECML/PKDD 2008, "Parallel spectral 
clustering" (via Isabel Drost) provides an excellent starting point, in 
addition to using Mahout's existing map/reduce eigensolver.

I hope this helps clarify some points and strengthen the overall feasibility of 
the proposal.

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called e

[GSoC 2010] Requesting feedback on my proposal for implementing Neural Network with backpropagation learning

2010-04-06 Thread Zaid Md Abdul Wahab Sheikh
r gradient vector is written back to the FileSystem


** I propose to complete all of the following sub-tasks during GSoC 2010:

Implementation of the Backpropagation algorithm:
- Initialization of weights: using the Nguyen-Widrow algorithm to
select the initial range of starting weight values.
- Input, transfer and error functions: implement basic input functions
like WeightedSum and transfer functions like Sigmoid, Gaussian, tanh
and linear. Implement the sum-of-squares error function.
- Optimization methods to update the weights: (a) Batch Gradient
descent, with momentum and a variable learning rate method [2] (b) A
Conjugate gradient method with Brent's line search.

Improving generalization:
- Validating the network to test for overfitting (Early stopping method)
- Regularization (weight decay method)

Create examples for:
- Classification: using the Abalone Data Set from UCI Machine Learning
Repository
- Classification, Regression: Breast Cancer Wisconsin (Prognostic) Data Set

If time permits, also implement:
- Resilient Backpropagation (RPROP)


III. Week Plan with list of deliverables


(Till May 23rd, community bonding period)
Brainstorm with my mentor and the Apache Mahout community to come up
with the most optimal design for an extensible Neural Network
framework. Code prototypes to identify potential problems and/or
investigate new solutions.
Deliverable: A detailed report or design document on how to implement
the basic Neural Network framework and the learning algorithms.

(May 24th, coding starts) Week 1:
Deliverables: Basic Neural network classes (Neuron, Connection,
Weight, Layer, LearningRule, NeuralNetwork etc) and the various input,
transfer and error functions mentioned previously.

(May 31st) Week 2 and Week 3:
Deliverable: Driver, Mapper, Combiner and Reducer classes with basic
functonality to run a feedforward Neural Network on Hadoop (no
training methods yet, weights are generated using Nguyen-Widrow
algorithm).

(June 14th) Week 4:
Deliverable: Backpropagation algorithm using standard Batch Gradient descent.

(June 21st) Week 5:
Deliverables: Variable learning rate and momentum during Batch
Gradient descent. Validation tests support. Do some big tests.

(June 28th) Week 6:
Deliverable: Support for Early stopping and Regularization (weight
decay) during training.

(July 5th) Week 7 and Week 8:
Deliverable: Conjugate gradient method with Brent's line search algorithm.

(July 19th) Week 9:
Deliverable: Write unit tests. Do bigger scale tests for both batch
gradient descent and conjugate gradient method.

(July 26th) Week 10 and Week 11:
Deliverable: 2 examples of classification and regression on real-world
datasets from UCI Machine Learning Repository. More tests.

(August 9th, tentative 'pencils down' date) Week 12:
Deliverable: Wind up the work. Scrub code. Improved documentation,
tutorials (on the wiki) etc.

(August 16: Final evaluation)


IV. Additional Information

I am a final year Computer Science student at NIT Allahabad (India)
graduating in May. For my final year project/thesis, I am working on
Open Domain Question Answering. I participated in GSoC last year for
the Apertium machine translation system
(http://google-opensource.blogspot.com/2009/11/apertium-projects-first-google-summer.html).
I am familiar with the three major opensource Neural Network
frameworks in Java, JOONE, Encog and Neuroph since I have used them in
past projects on fingerprint recognition and face recognition (during
a summer course on image and speech processing). My research interests
are machine learning and statistical natural language processing and I
will be enrolling for a Ph.D. next semester(i.e. next fall) in the
same institute.

I have no specific time constraints throughout the GSoC period. I will
devote a minimum of 6 hours everyday to GSoC.
Time offset: UTC+5:30 (IST)


V. References

[1] Fast parallel off-line training of multilayer perceptrons, S
McLoone, GW Irwin - IEEE Transactions on Neural Networks, 1997
[2] Optimization of the backpropagation algorithm for training
multilayer perceptrons, W. Schiffmann, M. Joost and R. Werner, 1994
[3] Map-Reduce for Machine Learning on Multicore, Cheng T. Chu, Sang
K. Kim, Yi A. Lin, et al - in NIPS, 2006
[4] Neural networks for pattern recognition, CM Bishop - 1995 [BOOK]


[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)

2010-04-09 Thread Yinghua Hu (JIRA)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) 
-

 Key: MAHOUT-374
 URL: https://issues.apache.org/jira/browse/MAHOUT-374
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
 Environment: Ubuntu 9.10
Reporter: Yinghua Hu




* Introduction

In recent years, multicore computer becomes main stream. However, the potential 
of multicore has not been fully exploited for machine learning because the lack 
of good programming framework for multicore. Recently, Chu et. al. [CHU07] 
adapt Google's map-reduce paradigm [DEA04] and implemented 10 different machine 
learning algorithms on multicore processor. Their results show almost double 
speedup on average with a dual processor. Their work has inspired a lot of 
interesting projects under Mahout of Apache Software Foundation.

An artificial neural network (ANN) is a supervised learning tool inspired by 
the biological nervous system. It can capture and represent complex 
input/output relationships. The neural network has capability to represent both 
linear and non-linear relationships by learning the relationships directly from 
the data being modeled. Artificial neural network have been widely applied for 
classification, data processing and function approximation.

We propose to implement an artificial neural network with back-propagation 
under map-reduce framework. Success of delivery should include a fast neural 
network customized for multicore computer.



* Methodology

In this section, we briefly introduce map-reduce and back propagated neural 
network.

- Map-reduce

Map-reduce [DEA04] is a programming model developed by Google. It gets its name 
from the map and reduce primitives in the functional language such as Lisp. Its 
inventors realize that most of their distributed computation involve applying a 
map operation and reduce operation. Where map operation is to compute a set of 
intermediate key/value pairs for each logical record in the input and reduce 
operation is applied on all the values that share the same key in order to 
combine the derived data appropriately. The reduce process allow users to 
handle large value list difficult to fit in memory.

Map-reduce makes it possible to have a simple interface enabling automatic 
parallelization and distribution of large-scale computation. The programming 
model can be applied on multiple core personal computer as well as large 
clusters of commodity PCs.

Here is an example of map-reduce from [DEA04]. To count the number of 
occurrences of each word in a large collection of documents, the user can write 
like the pseudo code below:

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

   EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

   result += ParseInt(v);

   Emit(AsString(result))

The map function above emits each word and its associated counts (one in this 
example). The reduce function sums together all counts emitted for a particular 
word.

The Map-reduce model has been successfully applied to a large variety of 
problems, such as Google's Web search service, sorting, data mining etc. It 
helps to reduce the amount of data sent across the network. In addition, its 
easiness to use makes the parallel and distributed computing achievable even 
for inexperienced programmer.

[CHU07] provide a multicore map-reduce framework that is capable of 
implementing algorithm fitting the Statistical Query Model. Neural network is 
one of the ten machine learning algorithms fitting the requirement.

- Neural Network

Neural network is one of many modern technology inspired by biology. It is a 
model which can perform simple computation such as linear regression as well as 
very complex nonlinear calculation. Without doubt, neural network is one of the 
most popular machine learning methodology.

The simplest artificial neural network is a single-layer perceptron as shown in 
Figure 1.  It contains a single layer of adaptive weights.  The output is 
calculated by applying an activation function f to the weighted sum of inputs:

http://www.cs.ucf.edu/~yhu/gsoc/formula1.JPG (1)

http://www.cs.ucf.edu/~yhu/gsoc/fig1.JPG
Figure 1 - Single-layer Perceptron

The limitation of a single-layer perceptron is that it can only model linear 
relationships between an output and the inputs. This is overcome by multi-layer 
perceptrons. Multi-layer perceptrons contain several layers of adaptive 
weights. In Figure 2, a hidden layer was added into the single layer perceptron 
in Figure 1  to form a two-layer perceptron. If there is an activation function 
g for the fir