Re: [gsoc] random forests

2009-04-01 Thread deneche abdelhakim
Should be visible now on both the Gsoc app. and the Apache Wiki --- En date de : Mer 1.4.09, Ted Dunning a écrit : > De: Ted Dunning > Objet: Re: [gsoc] random forests > À: mahout-dev@lucene.apache.org > Date: Mercredi 1 Avril 2009, 3h30 > Deneche, > > I don't see your application on the GSOC

Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Atul Kulkarni
Thanks David, that helped. On Wed, Apr 1, 2009 at 1:47 AM, David Hall wrote: > On Tue, Mar 31, 2009 at 11:43 PM, Atul Kulkarni > wrote: > > questions in line. > > > > On Wed, Apr 1, 2009 at 1:27 AM, Ted Dunning > wrote: > > > >> Nobody is working on SVD yet, but one GSOC applicant has said th

Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Tim Bass
We are thinking about installing Mahout on a Ubuntu server with a MySQL database full of time series data about network and server events. We want to mine the data and set up a statistical baseline for for "normal" and "abnormal" patterns. Is there an algorithm or method in Mahout ready to test

Re: Schema to store graph

2009-04-01 Thread Edward J. Yoon
Let's assume the graph looks like presented below: 1 - 2 - 3 | / 4 We can now represent as: | 1 2 3 4 --+-- 1 | 0 1 0 1 2 | 1 0 1 1 3 | 0 1 0 0 4 | 1 0 1 0 We don't need to store the zeros, Hbase is ideal in storing sparse matrices. So, It can be simply implemented usi

Re: Schema to store graph

2009-04-01 Thread Edward J. Yoon
Plus, The row URLs and anchor family of webTable that mentioned in BigTable paper is same with above structure. It's the web-link graph which is represented as an adjacency matrix. On Wed, Apr 1, 2009 at 5:33 PM, Edward J. Yoon wrote: > Let's assume the graph looks like presented below: > > 1 - 2

Re: [GSOC] Ranking Process

2009-04-01 Thread Richard Tomsett
I'm preparing an application, but haven't submitted yet as I was waiting on confirmation of my student status... as I now know that I'm going to be eligible I'll get my application in soon :) 2009/4/1 Ted Dunning : > I only see two applications for Mahout, one reasonably strong, one much less > so

Re: Schema to store graph

2009-04-01 Thread stack
Edward is referring to https://issues.apache.org/jira/browse/HBASE-867. We need to fix it for 0.20.0 hbase release. St.Ack On Wed, Apr 1, 2009 at 5:02 AM, Edward J. Yoon wrote: > One thing is Hbase 0.19 doesn't work with over 5,000 qualifier of one > column so I couldn't test/benchmark for larg

Re: [GSOC] Ranking Process

2009-04-01 Thread Grant Ingersoll
Hmm, I see several in there, but they aren't all labeled w/ Mahout, so that may be why. I also expanded to see 100 at a time. -Grant On Mar 31, 2009, at 8:43 PM, Ted Dunning wrote: I only see two applications for Mahout, one reasonably strong, one much less so. Are there students out the

Re: [GSOC] Ranking Process

2009-04-01 Thread Grant Ingersoll
The other thing to note, here, is that people should be aware that the ASF is only going to get a certain number of slots from Google (last year, it was somewhere in the 30-40 range, I think), which are distributed across all projects that have expressed an interest in mentoring. While Mah

Re: Schema to store graph

2009-04-01 Thread Amandeep Khurana
Right. Edward, I didnt understand what you were trying to say with this: Anyway, I guess If you store the graph like that, you'll only need update the row 'v/w' to add v to w's/w to v's list of neighbors. Can you explain it please? Thanks Amandeep Amandeep Khurana Computer Science Graduate Stu

Frequent Itemset Mining using MapReduce (interesting paper)

2009-04-01 Thread Lukáš Vlček
Hi, For anybody who might be interested in frequent itemset mining using MapReduce: http://www.haoyuanli.com/publication/recsys08-69.pdf Their application of search query recommendation and related search is interesing. Regards, Lukas -- http://blog.lukas-vlcek.com/

Re: Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Ted Dunning
The short answer is no. The slightly longer answer is a highly qualified yes. There is an algorithm very recently in Mahout that does non-parametric clustering of mixture models. If you can define a reasonable form of models for your time series, then this algorithm would plausibly give you a se

Re: [gsoc] Collaborative filtering algorithms

2009-04-01 Thread Ted Dunning
The machinery of SVD is almost always described in terms of least squares matrix approximation without mentioning the probabilistic underpinnings of why least-squares is a good idea. The connection, however, goes all the way back to Gauss' reduction of planetary position observations (this is *why

Re: Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Jeff Eastman
Actually, the Dirichlet implementation in 0.1 *is* a parallel Hadoop version but it has a packaging issue that needs to be resolved to run it there. See my earlier thread about Dirichlet Example Class Not Found Exception. Your application is exactly the kind of problem that I was investigating

[GSoC] "SimRank Algorithms on Mahout" Proposal draft from Xuan Yang

2009-04-01 Thread Xuan Yang
Hello everyone, This is my proposal draft. Is there something I need to modify? Thanks a lot :) * Title: GSoC2009 XuanYang-Mahout-Algorithms Proposal Abstract: SimRankis an model measuring "similarity" of objects. It is applicable in any dom

Re: [GSOC] Ranking Process

2009-04-01 Thread Ted Dunning
Let me second that. When I am hiring a student without professional experience, it is almost a perfect predictor that if they have done significant work on a significant outside project they will get an interview with me and if not, they won't. Moreover, if I have a candidate at any level who has

Re: gsoc , EM or SVM?

2009-04-01 Thread Grant Ingersoll
Hi Yifan, I think both are good candidates, although AIUI, SVM is a bit harder to parallelize, so maybe it would make sense to focus on EM. Of course, we don't have to be distributed, so you could propose a non- distributed SVM implementation as a first cut and then work on the distribute

Re: Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Tim Bass
This is exciting. To clarify, experimentation is perfectly ok. We have a production web server where we collect and store time series data (on nearly 350 metrics) in mysql (on a different server) so we are not looking for production code, but an experimental, collaborative, open effort. I like

Re: gsoc , EM or SVM?

2009-04-01 Thread Ted Dunning
Yifan, EM is a highly non-specific term and covers a huge range of very different algorithms. For example, pLSI, HMM's, and mixture models can all be estimated using EM. What exactly did you mean to address with an EM implementation? On Wed, Apr 1, 2009 at 1:05 PM, Grant Ingersoll wrote: > Hi

Re: Participating in GSOC 2009

2009-04-01 Thread Grant Ingersoll
Those added features all sounds good, Robin, would like to see a proposal from you. I do know Allen Day submitted some HBase work in JIRA, but I believe it needs updating. You might want to check that out. And good luck at Y! Hopefully your Mahout experience was helpful in landing the

[jira] Commented: (MAHOUT-113) CDInfosToolTest.testGatherInfos failure in Mahout examples

2009-04-01 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694715#action_12694715 ] Grant Ingersoll commented on MAHOUT-113: I seem to recall seeing this from time to

Re: Schema to store graph

2009-04-01 Thread Amandeep Khurana
Alright. Got it. Thanks. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Apr 1, 2009 at 1:33 AM, Edward J. Yoon wrote: > Let's assume the graph looks like presented below: > > 1 - 2 - 3 > | / > 4 > > We can now represent as: > > | 1 2 3 4 >

Re: gsoc , EM or SVM?

2009-04-01 Thread Yifan Wang
I will choose Mixture Model for the EM implementation. Yifan 2009/4/1 Ted Dunning : > Yifan, > > EM is a highly non-specific term and covers a huge range of very different > algorithms.  For example, pLSI, HMM's, and mixture models can all be > estimated using EM. > > What exactly did you mean to

Re: [GSoC] "SimRank Algorithms on Mahout" Proposal draft from Xuan Yang

2009-04-01 Thread Robert Burrell Donkin
On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang wrote: > Hello everyone, > >    This is my proposal draft. BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3 - robert

Re: Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Ted Dunning
On Wed, Apr 1, 2009 at 1:08 PM, Tim Bass wrote: > I like the idea of clustering of mixture models and, think with a bit of > effort, it would not be too difficult to create initial first order > behavioral models. > ... > what might be the next steps? > I think the first steps are: a) define an

Re: [GSoC] "SimRank Algorithms on Mahout" Proposal draft from Xuan Yang

2009-04-01 Thread Xuan Yang
Thanks, I have submited it there. :) 2009/4/2 Robert Burrell Donkin : > On Wed, Apr 1, 2009 at 7:12 PM, Xuan Yang wrote: >> Hello everyone, >> >>    This is my proposal draft. > > BTW remember http://markmail.org/message/rbwp2hf6iipc2ut3 > > - robert > -- Xuan Yang

Re: Mahout Recommendation for MySQL Data (Network Trending)

2009-04-01 Thread Tim Bass
Hi Ted, I am on travel for a few days, starting in a few hours; I'll respond in more detail when I am back. After reading your generous reply, it seems I need to outline the overall data and problem set, rough behavioral model, and other higher level abstractions to help insure the details fit t