Re: Mahout GSoC 2010: Association Mining

2010-04-13 Thread Neal Clark
I am interested in contributing. The next two weeks will be a little busy, as it is end of term, but I would be more than happy to work over the summer on this project. Perhaps you can also give me some advice on how to accomplish a few tasks. Currently I am using NLineInputFormat to create/config

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-12 Thread Shannon Quinn (JIRA)
ope this helps clarify some points and strengthen the overall feasibility of the proposal. > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > -- > > Key: MAHOUT-363 >

Re: Mahout GSoC 2010: Association Mining

2010-04-10 Thread Robin Anil
Like Ted said, its a bit late for a GSOC proposal, but I am excited at the possibility of improving the frequent pattern mining package. Check out the current Parallel FPGrowth implementation in the code, you can find more explanation on usage the Mahout wiki. Apriori should be trivially paralleliz

Re: Mahout GSoC 2010: Association Mining

2010-04-09 Thread Ted Dunning
Neal, I think that this might well be a useful contribution to Mahout, but, if I am not mistaken, I think that the deadline for student proposals for GSoC has just passed. That likely means that making this contribution an official GSoC project is not possible. I am sure that the Mahout community

Mahout GSoC 2010: Association Mining

2010-04-09 Thread Neal Clark
Hello, I just wanted to introduce myself. I am a MSc. Computer Science student at the University of Victoria. My research over the past year has been focused on developing and implementing an Apriori based frequent item-set mining algorithm for mining large data sets at low support counts. https:

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
Ted, do you think you can give some good links to paper or orther resources about mentioned approaches? I would like to look at it after the weekend. As far as I can see the association mining (and the guha method in its original form) is not meant to be a predictive method but rather data explora

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Ted Dunning
Lukas, The strongest alternative for this kind of application (and the normal choice for large scale applications) is on-line gradient descent learning with an L_1 or L_1 + L_2 regularization. The typical goal is to predict some outcome (click or purchase or signup) from a variety of large vocabu

[jira] Created: (MAHOUT-374) GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342)

2010-04-09 Thread Yinghua Hu (JIRA)
GSOC 2010 Proposal Implement Map/Reduce Enabled Neural Networks (mahout-342) - Key: MAHOUT-374 URL: https://issues.apache.org/jira/browse/MAHOUT-374 Project: Mahout

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
hod well but then again I understood Ted's LLR after > > some > > > deep reading. Could you put up an interesting example to help us > > understand > > > this method. Maybe starting from a transaction of shopping cart item ? > A > > > great demo is

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Lukáš Vlček
gt; > > > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček >wrote: > > > >> Hello, > >> > >> I would like to apply for Mahout GSoC 2010. My proposal is to implement > >> Association Mining algorithm utilizing existing PFPGrowth imp

Re: [GSOC] 2010 Timelines

2010-04-09 Thread Isabel Drost
Timeline including Apache internal deadlines: http://cwiki.apache.org/confluence/display/COMDEVxSITE/GSoC Mentors, please also click on the ranking link to the ranking explanation [1] for more information on how to rank student proposals. Isabel [1] http://cwiki.apache.org/confluence/display

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Ted Dunning (JIRA)
are just how you get side-tracked after you start. :-) > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > -- > > Key: MAHOUT-363 > URL: https://issues.apache.or

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)
nged some of the wording; the overall proposal structure wasn't changed. But I will certainly refrain from editing the ticket itself. Are there any other suggestions for making the proposal more viable? > Proposal for GSoC 2010 (EigenCuts clustering algorith

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Jake Mannix (JIRA)
add comments to this JIRA ticket, instead of editing the original ticket itself, we'll be able to more easily follow your thinking. Otherwise, we can't really see what has changed. > Proposal for GSoC 2010 (EigenCuts clustering alg

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-08 Thread Shannon Quinn (JIRA)
pson. Half-Lives of EigenFlows for Spectral Clustering. NIPS 2002. > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > -- > > Key: MAHOUT-363 > URL: https://iss

[GSoC 2010] Proposal to implement SimHash clustering for Mahout

2010-04-07 Thread Cristian Prodan
Hello, I'm posting a draft for my proposal for this year's GSoC. I kindly ask for your feedback on it. I have also posted a JIRA ticket with it: https://issues.apache.org/jira/browse/MAHOUT-365 . Thank you in advance. Cristi. -

[GSoC 2010] Requesting feedback on my proposal for implementing Neural Network with backpropagation learning

2010-04-06 Thread Zaid Md Abdul Wahab Sheikh
a partial batch gradient. In the Reducer class: - There's a single reducer class that will combine all the partial gradients from the Mappers to get the overall batch gradient. - The final error gradient vector is written back to the FileSystem ** I propose to complete all of the following s

Re: [GSOC] 2010 Timelines

2010-04-06 Thread Robin Anil
2 days to go till the close of student submissions. A request to mentors to provide feedback to all the queries on the list so that students can go and work on tuning their proposal Robin On Sat, Apr 3, 2010 at 10:50 PM, Grant Ingersoll wrote: > > http://socghop.appspot.com/document/show/gsoc_pr

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
> this method. Maybe starting from a transaction of shopping cart item ? A > great demo is big plus for a GSOC project. > > Robin > > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček wrote: > >> Hello, >> >> I would like to apply for Mahout GSoC 2010. My prop

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
lp us understand this method. Maybe starting from a transaction of shopping cart item ? A great demo is big plus for a GSOC project. Robin On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček wrote: > Hello, > > I would like to apply for Mahout GSoC 2010. My proposal is to implement > Association

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)
ures, given its ease of implementation. That's just my explanation; if you feel otherwise I'm happy to adjust my proposal :) > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > -- > >

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Robin Anil (JIRA)
hout code. I believe the k-means you are looking to implement is already there it will shave 2 weeks of your GSOC :). Reading the code/wiki is a great exercise for you to be more realistic in your proposal > Proposal for GSoC 2010 (EigenCuts clustering algorithm for

[jira] Updated: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Shannon Quinn (JIRA)
ave limited experience with Apache Mahout and Hadoop, but with an undergraduate computer science degree from Georgia Tech, and after an internship with IBM ExtremeBlue, I feel I am extremely adept at picking up new frameworks quickly. References [1] Chakra Chennubhotla and Allan D. Jepson. Half-Liv

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Jake Mannix (JIRA)
for a GSoC project. I wish I had the time to help with mentoring this project, in fact. > Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) > -- > > Key: MAHOUT-363 >

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)
Hama would certainly improve the feasibility of the project timeline and allow me to further refine the overall algorithm. I will absolutely adhere to your advice; I'll edit this ticket and my GSoC application. Thank you again! > Proposal for GSoC 2010 (EigenCuts clustering algorith

[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Ted Dunning (JIRA)
ived. One tiny suggestion is to omit Hama from your plan as it would just be a distraction for you. The Hama project is pretty much independent of Mahout and there hasn't any contribution in the H->M direction. > Proposal for GSoC 2010 (EigenCuts clustering algor

[jira] Created: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-04 Thread Shannon Quinn (JIRA)
Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout) -- Key: MAHOUT-363 URL: https://issues.apache.org/jira/browse/MAHOUT-363 Project: Mahout Issue Type: Task

[GSOC] 2010 Timelines

2010-04-03 Thread Grant Ingersoll
http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline

Re: My ideas for GSoC 2010

2010-04-02 Thread Tanya Gupta
Hi There are couple of questions I would like to ask. 1. What type of clustering would you like me to use? Is K-Means good enough? 2.Can you tell me more about the map reduce code that you would have written. Or first do I need to implement that as well using Hadoop? Thanking You Tanya On Th

Re: My ideas for GSoC 2010

2010-04-01 Thread Cristian Prodan
Thanks Robin, I will try have a look at that. Cristi. On Thu, Apr 1, 2010 at 9:36 AM, Robin Anil wrote: > Why dont you try it on 20 newsgroups. There are about 17-18 unique topics > and couple of overlapping ones. You can easily find issues with the > clustering code with that dataset. Once its

Re: My ideas for GSoC 2010

2010-03-31 Thread Robin Anil
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics and couple of overlapping ones. You can easily find issues with the clustering code with that dataset. Once its done you can try bigger datasets like wikipedia Robin On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan wrote:

Re: My ideas for GSoC 2010

2010-03-31 Thread Cristian Prodan
Hi, Can anyone please point me a good data set on which I might try SimHash clustering ? Thank you, Cristi On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan wrote: > Hello again, > > First of all, thank you all for taking time to answer my ideas. Based on > your thoughts, I have been digging arou

Re: Application for GSOC 2010

2010-03-31 Thread Grant Ingersoll
: > >> Hi >> >> I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply? >> >> Thanking You >> Tanya >>

Re: Application for GSOC 2010

2010-03-31 Thread Ted Dunning
t; > I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply? > > Thanking You > Tanya >

Re: GSOC 2010

2010-03-31 Thread Robin Anil
Hi Tanya, MAHOUT-328 is just a general stub. There is no detailed project description other than what is given there. The idea is we let you propose to implement a clustering algorithm in Mahout. Start here http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at what

GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi I would like a detailed project description for MAHOUT-328. Thanking You Tanya Gupta

Application for GSOC 2010

2010-03-31 Thread Tanya Gupta
Hi I want to work under MAHOUT-328 for my GSOC 2010 project.How do I apply? Thanking You Tanya

Mahout GSoC 2010 proposal: Association Mining

2010-03-28 Thread Lukáš Vlček
Hello, I would like to apply for Mahout GSoC 2010. My proposal is to implement Association Mining algorithm utilizing existing PFPGrowth implementation ( http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html). As for the Assoiciation Mining I would like to implement a very general

Re: My ideas for GSoC 2010

2010-03-23 Thread cristi prodan
Hello again, First of all, thank you all for taking time to answer my ideas. Based on your thoughts, I have been digging around, and the project I would very much like to propose is a MapReduce implementation of the simhash algorithm. Thank you Ted for the great idea! I'm keen on starting impl

Re: My ideas for GSoC 2010

2010-03-22 Thread Ankur C. Goel
Since Sean already answered IDEA-2, I'll reply to IDEA 1. Minhash (and Shingling in general) are very efficient clustering techniques that have traditionally been employed by Search engines for near-duplicate detection of web documents. They are known to be efficient and effective at web-scale.

Re: My ideas for GSoC 2010

2010-03-19 Thread Ted Dunning
Minhash clustering is important for duplicate detection. You can also do simhash clusteringwhich might be simpler to implement. I can imagine an implementation where the map generates the simhash and emits multiple copies keyed on

Re: My ideas for GSoC 2010

2010-03-19 Thread Sean Owen
I think that's a fine project indeed. It sounds even a little ambitious for a GSoC project. Understanding, implementing, and parallelizing this approach is not trivial. If you want to propose it, sure, but scaling it back a little is probably OK too. As always it's best to propose a simple project

My ideas for GSoC 2010

2010-03-19 Thread cristi prodan
Dear Mahout community, My name is Cristi Prodan, I'm 23 years old and currently a 2nd year student pursuing a MSc degree in Computer Science. I started studying machine learning in the past year and during my research I found about the Mapreduce model. Then, I discovered hadoop and Mahout. I wa

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Ted Dunning
gt; > > > On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong > >wrote: > > > > > Hi all, > > > The updated proposal for GSoC 2010 is as follows, any comment is > welcome. > > > >>>>>>>>>>>>>>>>>&g

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread zhao zhendong
ri, Mar 12, 2010 at 4:18 PM, zhao zhendong >wrote: > > > Hi all, > > The updated proposal for GSoC 2010 is as follows, any comment is welcome. > > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>&g

Re: Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread Robin Anil
change your currently timeline to accurately reflect that. I will post more queries about the design choice later Robin On Fri, Mar 12, 2010 at 4:18 PM, zhao zhendong wrote: > Hi all, > The updated proposal for GSoC 2010 is as follows, any comment is w

Updated Proposal (LIBLINEAR on Mahout) for GSoC 2010

2010-03-12 Thread zhao zhendong
Hi all, The updated proposal for GSoC 2010 is as follows, any comment is welcome. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>&

Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread Grant Ingersoll
On Mar 9, 2010, at 12:27 PM, zhao zhendong wrote: > Hi Robin Ted and Grant, > > Thank you very much. > > To Grant: > One more thing, could you please tell us the link of "archives" you > mentioned before? There's a bunch of 'em, but my personal fav. is http://search.lucidimagination.com ;-)

Re: Have Mahout applied GSOC 2010?

2010-03-09 Thread zhao zhendong
nt > > On Mar 8, 2010, at 1:24 PM, Ted Dunning wrote: > > > Apache is definitely going to participate. If Mahout gets strong > > candidates, we would probably will get one or more slots. > > > > On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong >wrote: > >

Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Grant Ingersoll
>> Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody tell >> me >> the answer? I really appreciate this chance. >> > > > > -- > Ted Dunning, CTO > DeepDyve

Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Ted Dunning
Apache is definitely going to participate. If Mahout gets strong candidates, we would probably will get one or more slots. On Mon, Mar 8, 2010 at 10:06 AM, zhao zhendong wrote: > Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody tell > me > the answer? I really a

Re: Have Mahout applied GSOC 2010?

2010-03-08 Thread Robin Anil
Ian will be able to comment better on the topic of ASF quota etc. Short answer, Yes, ASF will be applying as an organization and Mahouters will be applying as mentors Regards Robin On Mon, Mar 8, 2010 at 11:36 PM, zhao zhendong wrote: > Hi > > Robin told me Mahout gonna apply GSOC

Have Mahout applied GSOC 2010?

2010-03-08 Thread zhao zhendong
Hi Robin told me Mahout gonna apply GSOC 2010 as a mentor. Can anybody tell me the answer? I really appreciate this chance. Thanks, -- - Zhen-Dong Zhao (Maxim) <><<><><><><><><>

Re: GSOC 2010 is here

2010-02-02 Thread Isabel Drost
On Mon Robin Anil wrote: > 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks > are taking in GSOC students) I guess one could easily split this one in two: a) Using UIMA (whole pipeline or just the analysers if that is possible) for data pre-processing before Mahout algorithms

Re: GSOC 2010 is here

2010-02-01 Thread Robin Anil
Some more Wild and Wacky Ideas. Might be out of scope for GSOC, but are nice to have features for mahout. I would like to encourage all of you to put down your ideas here. 1. Data Visualization tool backed with HDFS/Hbase for inspecting clusters, Topic model etc etc - It could have many map/redu

Re: GSOC 2010 is here

2010-02-01 Thread Isabel Drost
On Wed Robin Anil wrote: > Greetings! Fellow GSOC alums, administrators and dear mentors, the > next edition is right here. Details are given in the link below. > > https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Some additional notes to commit

GSOC 2010 is here

2010-01-26 Thread Robin Anil
Greetings! Fellow GSOC alums, administrators and dear mentors, the next edition is right here. Details are given in the link below. https://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f Maybe we could identify key areas in Mahout which we need to deve