Re: Mahout GSoC 2010: Association Mining

2010-04-13 Thread Neal Clark
I am interested in contributing. The next two weeks will be a little busy, as it is end of term, but I would be more than happy to work over the summer on this project. Perhaps you can also give me some advice on how to accomplish a few tasks. Currently I am using NLineInputFormat to

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-12 Thread Shannon Quinn (JIRA)
the overall feasibility of the proposal. Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https://issues.apache.org/jira/browse/MAHOUT-363 Project

Re: Mahout GSoC 2010: Association Mining

2010-04-10 Thread Robin Anil
Like Ted said, its a bit late for a GSOC proposal, but I am excited at the possibility of improving the frequent pattern mining package. Check out the current Parallel FPGrowth implementation in the code, you can find more explanation on usage the Mahout wiki. Apriori should be trivially

Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost
Timeline including Apache internal deadlines: http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC Mentors, please also click on the ranking link to the ranking explanation [1] for more information on how to rank student proposals. Isabel [1]

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining algorithm utilizing existing PFPGrowth implementation ( http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). As for the Assoiciation Mining I would like to implement a very general algorithm

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
plus for a GSOC project. Robin On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining algorithm utilizing existing PFPGrowth implementation

[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)

2010-04-09 Thread Yinghua Hu (JIRA)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) - Key: MAHOUT-374 URL: https://issues.apache.org/jira/browse/MAHOUT-374 Project: Mahout

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas, The strongest alternative for this kind of application (and the normal choice for large scale applications) is on-line gradient descent learning with an L_1 or L_1 + L_2 regularization. The typical goal is to predict some outcome (click or purchase or signup) from a variety of large

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted, do you think you can give some good links to paper or orther resources about mentioned approaches? I would like to look at it after the weekend. As far as I can see the association mining (and the guha method in its original form) is not meant to be a predictive method but rather data

Mahout GSoC 2010: Association Mining

2010-04-09 Thread Neal Clark
Hello, I just wanted to introduce myself. I am a MSc. Computer Science student at the University of Victoria. My research over the past year has been focused on developing and implementing an Apriori based frequent item-set mining algorithm for mining large data sets at low support counts.

Re: Mahout GSoC 2010: Association Mining

2010-04-09 Thread Ted Dunning
Neal, I think that this might well be a useful contribution to Mahout, but, if I am not mistaken, I think that the deadline for student proposals for GSoC has just passed. That likely means that making this contribution an official GSoC project is not possible. I am sure that the Mahout

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)
for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https://issues.apache.org/jira/browse/MAHOUT-363 Project: Mahout Issue Type: Task

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Jake Mannix (JIRA)
comments to this JIRA ticket, instead of editing the original ticket itself, we'll be able to more easily follow your thinking. Otherwise, we can't really see what has changed. Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)
some of the wording; the overall proposal structure wasn't changed. But I will certainly refrain from editing the ticket itself. Are there any other suggestions for making the proposal more viable? Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Ted Dunning (JIRA)
how you get side-tracked after you start. :-) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https://issues.apache.org/jira/browse/MAHOUT-363

[GSoC 2010] Proposal to implement SimHash clustering for Mahout

2010-04-07 Thread Cristian Prodan
Hello, I'm posting a draft for my proposal for this year's GSoC. I kindly ask for your feedback on it. I have also posted a JIRA ticket with it: https://issues.apache.org/jira/browse/MAHOUT-365 . Thank you in advance. Cristi.

[GSoC 2010] Requesting feedback on my proposal for implementing Neural Network with backpropagation learning

2010-04-06 Thread Zaid Md Abdul Wahab Sheikh
gradient. In the Reducer class: - There's a single reducer class that will combine all the partial gradients from the Mappers to get the overall batch gradient. - The final error gradient vector is written back to the FileSystem ** I propose to complete all of the following sub-tasks during GSoC

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)
computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly. References [1] Chakra Chennubhotla and Allan D. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002. Proposal for GSoC 2010

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Robin Anil (JIRA)
code. I believe the k-means you are looking to implement is already there it will shave 2 weeks of your GSOC :). Reading the code/wiki is a great exercise for you to be more realistic in your proposal Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)
, given its ease of implementation. That's just my explanation; if you feel otherwise I'm happy to adjust my proposal :) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
understand this method. Maybe starting from a transaction of shopping cart item ? A great demo is big plus for a GSOC project. Robin On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
this method. Maybe starting from a transaction of shopping cart item ? A great demo is big plus for a GSOC project. Robin On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček lukas.vl...@gmail.comwrote: Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining

[jira] Created: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)
Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https://issues.apache.org/jira/browse/MAHOUT-363 Project: Mahout Issue Type: Task

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)
would certainly improve the feasibility of the project timeline and allow me to further refine the overall algorithm. I will absolutely adhere to your advice; I'll edit this ticket and my GSoC application. Thank you again! Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Jake Mannix (JIRA)
for a GSoC project. I wish I had the time to help with mentoring this project, in fact. Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https

[GSOC] 2010 Timelines

2010-04-03 Thread Grant Ingersoll
http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline

Re: My ideas for GSoC 2010

2010-04-01 Thread Cristian Prodan
Hi, Can anyone please point me a good data set on which I might try SimHash clustering ? Thank you, Cristi On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan prodan.crist...@gmail.comwrote: Hello again, First of all, thank you all for taking time to answer my ideas. Based on your thoughts, I

Re: My ideas for GSoC 2010

2010-04-01 Thread Robin Anil
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics and couple of overlapping ones. You can easily find issues with the clustering code with that dataset. Once its done you can try bigger datasets like wikipedia Robin On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan

Re: My ideas for GSoC 2010

2010-04-01 Thread Cristian Prodan
Thanks Robin, I will try have a look at that. Cristi. On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil robin.a...@gmail.com wrote: Why dont you try it on 20 newsgroups. There are about 17-18 unique topics and couple of overlapping ones. You can easily find issues with the clustering code with that

Application for GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply? Thanking You Tanya

GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi I would like a detailed project description for MAHOUT-328. Thanking You Tanya Gupta

Re: GSOC 2010

2010-03-31 Thread Robin Anil
Hi Tanya, MAHOUT-328 is just a general stub. There is no detailed project description other than what is given there. The idea is we let you propose to implement a clustering algorithm in Mahout. Start here http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at

Re: Application for GSOC 2010

2010-03-31 Thread Ted Dunning
: Hi I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply? Thanking You Tanya

Re: Application for GSOC 2010

2010-03-31 Thread Grant Ingersoll
GSOC 2010 project.How do I apply? Thanking You Tanya

Mahout GSoC 2010 proposal: Association Mining

2010-03-28 Thread Lukáš Vlček
Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining algorithm utilizing existing PFPGrowth implementation ( http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). As for the Assoiciation Mining I would like to implement a very general

Re: My ideas for GSoC 2010

2010-03-22 Thread Ankur C. Goel
Since Sean already answered IDEA-2, I'll reply to IDEA 1. Minhash (and Shingling in general) are very efficient clustering techniques that have traditionally been employed by Search engines for near-duplicate detection of web documents. They are known to be efficient and effective at

My ideas for GSoC 2010

2010-03-19 Thread cristi prodan
Dear Mahout community, My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I

Re: My ideas for GSoC 2010

2010-03-19 Thread Sean Owen
I think that's a fine project indeed. It sounds even a little ambitious for a GSoC project. Understanding, implementing, and parallelizing this approach is not trivial. If you want to propose it, sure, but scaling it back a little is probably OK too. As always it's best to propose a simple project

Re: My ideas for GSoC 2010

2010-03-19 Thread Ted Dunning
Minhash clustering is important for duplicate detection. You can also do simhash clusteringhttp://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdfwhich might be simpler to implement. I can imagine an implementation where the map generates the simhash and emits multiple copies keyed on

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Robin Anil
to change your currently timeline to accurately reflect that. I will post more queries about the design choice later Robin On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong zhaozhend...@gmail.comwrote: Hi all, The updated proposal for GSoC 2010 is as follows, any comment is welcome. Title/Summary

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Ted Dunning
all, The updated proposal for GSoC 2010 is as follows, any comment is welcome. Title/Summary: Linear SVM Package (LIBLINEAR) for Mahout Student: Zhen-Dong Zhao Student e-mail: zha...@comp.nus.edu.sg Student Major: Multimedia Information Retrieval /Computer ScienceStudent Degree

Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread zhao zhendong
Dunning wrote: Apache is definitely going to participate. If Mahout gets strong candidates, we would probably will get one or more slots. On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong zhaozhend...@gmail.com wrote: Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody

Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread Grant Ingersoll
On Mar 9, 2010, at 12:27 PM, zhao zhendong wrote: Hi Robin Ted and Grant, Thank you very much. To Grant: One more thing, could you please tell us the link of archives you mentioned before? There's a bunch of 'em, but my personal fav. is http://search.lucidimagination.com ;-) Just

Have Mahout applied GSOC 2010?

2010-03-08 Thread zhao zhendong
Hi Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody tell me the answer? I really appreciate this chance. Thanks, -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National

Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Robin Anil
GSOC 2010 as a mentor. Can anybody tell me the answer? I really appreciate this chance. Thanks, -- - Zhen-Dong Zhao (Maxim) Department of Computer Science School of Computing National University of Singapore

Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Ted Dunning
Apache is definitely going to participate. If Mahout gets strong candidates, we would probably will get one or more slots. On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong zhaozhend...@gmail.comwrote: Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody tell me the answer? I

Re: GSOC 2010 is here

2010-02-02 Thread Isabel Drost
On Mon Robin Anil robin.a...@gmail.com wrote: 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are taking in GSOC students) I guess one could easily split this one in two: a) Using UIMA (whole pipeline or just the analysers if that is possible) for data pre-processing

Re: GSOC 2010 is here

2010-02-01 Thread Isabel Drost
On Wed Robin Anil robin.a...@gmail.com wrote: Greetings! Fellow GSOC alums, administrators and dear mentors, the next edition is right here. Details are given in the link below. https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Some

Re: GSOC 2010 is here

2010-02-01 Thread Robin Anil
Some more Wild and Wacky Ideas. Might be out of scope for GSOC, but are nice to have features for mahout. I would like to encourage all of you to put down your ideas here. 1. Data Visualization tool backed with HDFS/Hbase for inspecting clusters, Topic model etc etc - It could have many

GSOC 2010 is here

2010-01-26 Thread Robin Anil
Greetings! Fellow GSOC alums, administrators and dear mentors, the next edition is right here. Details are given in the link below. https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Maybe we could identify key areas in Mahout which we need to