Mahout Binary Recommender Evaluator

2011-07-25 Thread MT
Mahout Binary Recommender Evaluation I'm working on a common dataset that includes the user id, item id, and timestamp (the moment the user bought the item). As there are no preferences, I needed a binary item-based recommender, which I found in Mahout (GenericBooleanPrefItemBasedRecommender and t

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Sean Owen
On Mon, Jul 25, 2011 at 10:05 AM, MT wrote: > > > In fact, correct me if I'm wrong, but to me the evaluator will invariably > give us the same value for precision and recall. Since the items are all > rated with the binary 1.0 value, we give the recommender a threshold lower > than 1, thus for each

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Marko Ciric
Hi, First of all, it's rather easy to implement the evaluator not to remove all the items (which is the case when working with boolean preferences data set). The easiest implementation would be to use relevanceThreshold argument as the percent of the whole user's preference data set. For example i

HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread NightWolf
Hi all, I'm working on a large text classification project and we have our text data (simple messages) stored in HBase. We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests. Second, we would like to be able to store the mo

HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Nightie Wolfi
Hi all, I'm working on a large text classification project and we have our text data (simple messages) stored in HBase. We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests. Second, we would like to be able to store the mode

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Robin Anil
We dropped it after pruning the dependencies in Mahout. You can simply bring back the class(from the repository) and use it to connect to HBase in your client code. Robin On Mon, Jul 25, 2011 at 6:23 PM, NightWolf wrote: > Hi all, > > I'm working on a large text classification project and we ha

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Nightie Wolfi
Thanks Robin for your quick response, that's great news. As I understand it, this will allow me to store the generated classifier model in HBase. Are there any examples of its usage? Does anyone know where can I find some test cases (such as the ones in MAHOUT-124

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Robin Anil
You might want to sync back to an earlier version of Mahout which had this and try to run trainer with --dataSource hbase This will train over data from hdfs and store model on hbase. Similarly, you can run the classifier with --dataSource hbase and use the model to classify new instances. Note, w

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Night Wolf
Ok thanks Robin. Just a quick one, any idea what release may have had this so i can download and use without having to build? 0.3 seems to be the most recent one which had it..? I'm assuming, just looking at the code quickly, I assume it just uses the default base-dir found in core-default.xml

What about a universal input data handling mechanism for Mahout?

2011-07-25 Thread Xiaobo Gu
Hi, Most time Mahout algorithms use Vector as the model training input, but don’t take care of how the instance vectors are generated, then every algorithm has it’s unique way, causing the original input file format requirement bound to specific algorithm. That causes a lot of work for the actual u

Re: What about a universal input data handling mechanism for Mahout?

2011-07-25 Thread Fernando Fernández
That would be very nice, actually I haven't tested most of Mahout algorithms for that reason... 2011/7/25 Xiaobo Gu > Hi, > Most time Mahout algorithms use Vector as the model training input, > but don’t take care of how the instance vectors are generated, then > every algorithm has it’s unique

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Ted Dunning
You should be extracting the hbase code from 0.3 and keeping it in your code. You definitely don't want to stay on 0.3. This is *very* easy and fast to do using the git repository: $ git clone https://github.com/apache/mahout.git Initialized empty Git repository in /Users/tdunning/tmp/mahout/.gi

Re: What about a universal input data handling mechanism for Mahout?

2011-07-25 Thread Ted Dunning
Good idea. Somebody should file a JIRA. My guess is that the best first step would be to have the logistic regression handle the naive Bayes input format. 2011/7/25 Fernando Fernández > That would be very nice, actually I haven't tested most of Mahout > algorithms > for that reason... > > 2011

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Ted Dunning
That would allow you to compute AUC which might be useful. AUC is the probability that a relevant (purchased) item is ranked higher than a non-relevant item. On Mon, Jul 25, 2011 at 3:16 AM, Marko Ciric wrote: > The better way to do it is to implement an evaluator which accepts the > collection

RE: What about a universal input data handling mechanism for Mahout?

2011-07-25 Thread XiaoboGu
Can you show me any material describing the file format requirement of Naïve Bayes please. > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Monday, July 25, 2011 11:16 PM > To: user@mahout.apache.org > Cc: d...@mahout.apache.org > Subject: Re: What about a

RE: fkmeans or Cluster Dumper not working?

2011-07-25 Thread Jeff Eastman
Sorry, I was traveling over the weekend. I will take a look at your data asap. -Original Message- From: Jeffrey [mailto:mycyber...@yahoo.com] Sent: Sunday, July 24, 2011 3:51 AM To: user@mahout.apache.org Subject: Re: fkmeans or Cluster Dumper not working? Erm, is there any update? is the

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Marko Ciric
Is there a plan to include this in Mahout in some future release? On 25 July 2011 17:20, Ted Dunning wrote: > That would allow you to compute AUC which might be useful. AUC is the > probability that a relevant (purchased) item is ranked higher than a > non-relevant item. > > On Mon, Jul 25, 201

AUC

2011-07-25 Thread Marko Ciric
Hi guys, I'm wondering if any resources or tutorials are available (and where) about calculating AUC when working with boolean preferences data models? -- -- Marko Ćirić ciric.ma...@gmail.com

RE: fkmeans or Cluster Dumper not working?

2011-07-25 Thread Jeff Eastman
I'm able to run fuzzyk on your data set with k=10 and k=50 without problems. I also ran it fine with k=100 just to push it a bit harder. Runs took longer as k increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s, 45s, 1m11s). The cluster dumper is throwing an OME with your data p

Re: AUC

2011-07-25 Thread Ted Dunning
It isn't quite what you are asking for, but there is org.apache.mahout.math.stats.GroupedOnlineAuc This will do the per user AUC for a limited number (a million should work fine) of users. You may also be interested in org.apache.mahout.math.stats.GlobalOnlineAuc and org.apache.mahout.classifier.

Re: Mahout Binary Recommender Evaluator

2011-07-25 Thread Ted Dunning
Well, we do have numerous ways to compute AUC. I don't think that they are integrated into the recommendation evaluation framework yet. Would you like to take on the application of suitable glue? On Mon, Jul 25, 2011 at 1:00 PM, Marko Ciric wrote: > Is there a plan to include this in Mahout i

Re: HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

2011-07-25 Thread Lance Norskog
This is a JDBC driver for HBase: http://www.hbql.com/examples/jdbc.html There is a JDBC data model in Mahout. HBase now has a high-speed fetching interface that is different from the original one. If you want a direct HBase interface after you do this investigation, you might implement that. Lanc

Re: fkmeans or Cluster Dumper not working?

2011-07-25 Thread Jeffrey
No worries :) > >From: Jeff Eastman >To: "user@mahout.apache.org" ; Jeffrey > >Sent: Tuesday, July 26, 2011 12:30 AM >Subject: RE: fkmeans or Cluster Dumper not working? > >Sorry, I was traveling over the weekend. I will take a look at your data asap. > >-Or

RE: fkmeans or Cluster Dumper not working?

2011-07-25 Thread Jeff Eastman
Also makes sense that fuzzyk centroids would be completely dense, since every point is a member of every cluster. My reducer heaps are 4G. -Original Message- From: Jeff Eastman [mailto:jeast...@narus.com] Sent: Monday, July 25, 2011 2:32 PM To: user@mahout.apache.org; Jeffrey Subject: RE: