Mahout first application for biginer

2011-04-26 Thread venkat
Hi, I am new to mahout.I have to do any example using mahout to start.Please give me detailed steps to do an application using mahout.Pls help me.

Re: Introduction

2011-04-26 Thread Sebastian Schelter
Hi Ray, welcome to the list. On twitter we talked about your evaluation of Mahout's recommender code. I'd like to go into detail on this to either clear up your doubts or learn from your input. Your evaluation listed a bunch of pros and cons regarding Mahout, can you share them here to start

AUC of Random Forest

2011-04-26 Thread praneet mhatre
Hi All, I tried classifying my dataset using the BuildForest() and TestForest() functions and it worked perfectly fine. But the final output is displayed in terms of standard accuracy. Is there an easy way to also compute the AUC for the Forest built? Thank you, -- Praneet Mhatre

Re: Determining Document Cluster Probabilities with LDA

2011-04-26 Thread Lance Norskog
What is a good name for what LDA and SVD do? "Basis concentration"? "Basis isolation"? On 4/26/11, Ted Dunning wrote: > I think you are right. > > On Tue, Apr 26, 2011 at 2:32 PM, Jake Mannix wrote: > >> >> Ted, I think what they are asking is for the output of the gamma matrix >> (i.e. >> the L

Re: Which exact algorithm is used in the Mahout SGD?

2011-04-26 Thread Stanley Xu
Hi Ted, For the data, currently, we digg the logs for a specific cookie. For example, we will check how many times has he seen the banner from the advertiser in last 7 days. We didn't has 1000 non-zero value now, I thought we will only have 100-200 now, but we expect to have 1000 at most I thought

Re: Determining Document Cluster Probabilities with LDA

2011-04-26 Thread Ted Dunning
I think you are right. On Tue, Apr 26, 2011 at 2:32 PM, Jake Mannix wrote: > > Ted, I think what they are asking is for the output of the gamma matrix > (i.e. > the LDA version of the *left* singular vectors, living in > document-by-topic-space, > not topic-by-word space), which is currently not

Re: Determining Document Cluster Probabilities with LDA

2011-04-26 Thread Jake Mannix
On Tue, Apr 26, 2011 at 2:08 PM, Ted Dunning wrote: > > - LDA isn't really clustering. It is more along the lines of SVD as a > dimensionality reduction. It should > be possible to display the internals to find which terms or documents have > the highest components on > a single topic, but combi

Re: Determining Document Cluster Probabilities with LDA

2011-04-26 Thread Ted Dunning
Two things, - use trunk. We are about to release 0.5 and there has been a ton of progress since 0.4 including several important bug fixes. - LDA isn't really clustering. It is more along the lines of SVD as a dimensionality reduction. It should be possible to display the internals to find whic

Determining Document Cluster Probabilities with LDA

2011-04-26 Thread Ian Helmke
I'm looking at using LDA to cluster documents based on topics. I've gotten LDA to work in Mahout 0.4 and I am able to get keywords and topics using the built-in mahout utilities. Is there any simple way to view which documents are assigned to which clusters after performing LDA? This could easily

Re: Cosine distances to Random Vector basis

2011-04-26 Thread Randall McRee
I've done a new, clean, implementation of this (just the knn piece) at my current company which has agreed to allow an open source contribution. Thanks, Randy On Mon, Apr 25, 2011 at 11:09 PM, Ted Dunning wrote: > Available cheaper at my old company. > > > http://www.deepdyve.com/lp/association

Re: best similarity metric for collaborative filtering

2011-04-26 Thread Ted Dunning
On Tue, Apr 26, 2011 at 9:12 AM, Sean Owen wrote: > That reduces to something like the Jaccard / Tanimoto coefficient -- not > precisely since you're dividing by the length of those vectors rather than > the size of their "union", but practically similar. And that's implemented > as TanimotoCoeff

Re: best similarity metric for collaborative filtering

2011-04-26 Thread Sean Owen
That reduces to something like the Jaccard / Tanimoto coefficient -- not precisely since you're dividing by the length of those vectors rather than the size of their "union", but practically similar. And that's implemented as TanimotoCoefficientSimilarity. Perhaps my point is that in Mahout (well

Re: best similarity metric for collaborative filtering

2011-04-26 Thread Ted Dunning
Setting didn't-buy to 0 and getting a valid cosine distance is pretty common in these scenarios. I still prefer what Sean is recommending in terms of LLR for item to item links, but the cosine version does make sense to support, especially for purchase histories. Even better would be to remember

Re: Which exact algorithm is used in the Mahout SGD?

2011-04-26 Thread Ted Dunning
On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu wrote: > 1 hour is acceptable, but I guess you misunderstand the data scale I mean > here. The 900M records didn't mean 900M Bytes, but 900M lines of training > set(900M training example.). If every training data has 1000 dimension, it > means 900 mill

Re: Is any more detailed documentation aout the sgd logistic regression example.

2011-04-26 Thread Xiaobo Gu
I am reading the book now, and will refer to you if I have any questions then. Thanks. On Fri, Apr 22, 2011 at 6:16 AM, Ted Dunning wrote: > The trainlogistic command is (as Stanley says) only a simple example. > > You will need to write a program something like TrainNewsGroups for your > modele

Re: Introduction

2011-04-26 Thread Benson Margulies
Maybe he used to be a window? On Tue, Apr 26, 2011 at 1:57 AM, Ted Dunning wrote: > Welcome! > > (like the email name ... as long as you don't toss too much out the window) > > On Mon, Apr 25, 2011 at 8:00 PM, Raymond Richardson > wrote: > >> I represent Simularity.com, an organization which is p

Re: best similarity metric for collaborative filtering

2011-04-26 Thread Steven Bourke
What exactly does 'didnt buy' mean here ? Was the user shown the item or its just an item they never considered? To find the 'best' metric here you could simply run an offline evaluation across your dataset. But what appears to be the most important thing is what does each representation actually

Re: Convert preference matrix

2011-04-26 Thread Mathieu sgard
Thanks, I'm going to look at that 2011/4/26 Sean Owen > There are Mapper / Reducer pairs in org.apache.mahout.cf.taste.hadoop.item > that would do the conversion on Hadoop. If you want something that's not on > Hadoop, you would have to write your own code, but it's pretty easy. > > On Tue, Apr

Re: Convert preference matrix

2011-04-26 Thread Sean Owen
There are Mapper / Reducer pairs in org.apache.mahout.cf.taste.hadoop.item that would do the conversion on Hadoop. If you want something that's not on Hadoop, you would have to write your own code, but it's pretty easy. On Tue, Apr 26, 2011 at 8:25 AM, Mathieu sgard wrote: > Hello, > > I'm playin

Convert preference matrix

2011-04-26 Thread Mathieu sgard
Hello, I'm playing with mahout to discover it and I would like to cluster a sample of customers. I have a preference matrix file (userID, ItemID, score) and I would like to use clustering functions. How could I change this file into VectorWritable/SequenceFile ? Thanks, Best Regards,

Re: best similarity metric for collaborative filtering

2011-04-26 Thread Sean Owen
I think my comment mostly addressed his comments. Yes, this is the definition of cosine distance, and is implemented. No it doesn't work over true binary data. There is no "0", only "1" or non-existent. What is the remaining question? On Tue, Apr 26, 2011 at 3:21 AM, Chris Waggoner wrote: > > > I

Re: How to evaluate a recommender with binary ratings?

2011-04-26 Thread Sean Owen
Peter (/Ted), Yes this is all answered in the framework already. You would never directly use the recommenders intended for data sets with ratings, as most don't make sense when all ratings are 1.0. You would use, for example, GenericBooleanPrefItemBasedRecommender, a variant on GenericItemBasedRe