date:20130216

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

The very question at hand is how to label the data as "relevant" and "not relevant" results. The question exists because this is not given, which is why I would not call this a supervised problem. That may just be semantics, but the point I wanted to make is that the reasons choosing a random train

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ted Dunning

Sean I think it is still a supervised learning problem in that there is a labelled training data set and an unlabeled test data set. Learning a ranking doesn't change the basic dichotomy between supervised and unsupervised. It just changes the desired figure of merit. Sent from my iPhone O

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ted Dunning

There are a variety of common time based effects which make time splits best in many practical cases. Having the training data all be from the past emulates this better than random splits. For one thing, you can have the same user under different names in training and test. For another thing

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ahmet Ylmaz

Thanks for the replies. From: Sean Owen To: Mahout User List Sent: Saturday, February 16, 2013 11:34 PM Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator I understand the idea, but this boils down to the current implementation, plus going bac

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

I understand the idea, but this boils down to the current implementation, plus going back and throwing out some additional training data that is lower rated -- it's neither in test or training. Anything's possible, but I do not imagine this is a helpful practice in general. On Sat, Feb 16, 2013 a

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin

I'm suggesting the second one. In that way the test user's ratings in the training set will compose of both low and high rated items, that prevents the problem pointed out by Ahmet. On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen wrote: > If you're suggesting that you hold out only high-rated items,

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out item

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin

What I mean is you can choose ratings randomly and try to recommend the ones above the threshold On Sat, Feb 16, 2013 at 10:32 PM, Sean Owen wrote: > Sure, if you were predicting ratings for one movie given a set of ratings > for that movie and the ratings for many other movies. That isn't what

Getting java.lang.OutOfMemoryError when running mahout in sequential mode

2013-02-16 Thread Haddad Said

Hi I am having difficulties linking my two machines into a hadoop cluster so I am running mahout jobs in a single machine and I am running into java.lang.OutOfMemoryError issues when the input files are big (see outputs below, one is "Java heap space" and the other is "GC overhead limit exceeded")

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

Sure, if you were predicting ratings for one movie given a set of ratings for that movie and the ratings for many other movies. That isn't what the recommender problem is. Here, the problem is to list N movies most likely to be top-rated. The precision-recall test is, in turn, a test of top N resul

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin

No, rating prediction is clearly a supervised ML problem On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen wrote: > This is a good answer for evaluation of supervised ML, but, this is > unsupervised. Choosing randomly is choosing the 'right answers' randomly, > and that's plainly problematic. > > > On

Re: seqdirectory command in MapReduce

2013-02-16 Thread Josh Patterson

look at MAHOUT-833 , this patch gives you this functionality. On Sat, Feb 16, 2013 at 10:55 AM, Claudio Reggiani wrote: > Hello, > > I have a text dataset. Running "seqdirectory" command on it I see it's not > written in MapReduce style (looking at the source code of > SequenceFilesFromDirector

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic. On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin wrote: > I think, it is better to choose ratings of the test user in a ran

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin

I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen wrote: > Yes. But: the test sample is small. Using 40% of your data to test is > probably quite too much. > > My point is that it may be the least-bad thing to do. What test ar

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing? On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz wrote: > But

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ahmet Ylmaz

But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings. Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3

Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Sean Owen

No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was mor

Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon

But why would this be a problem? As long as it's using HDFS to access the files, it should be able to fetch the chunks from wherever they might be in the cluster. I don't see why it wouldn't work. Let us know if it works! On Sat, Feb 16, 2013 at 7:38 PM, Claudio Reggiani wrote: > Yes, thank you

Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Ahmet Ylmaz

Hi, I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here. According to my understanding the experimental protocol used in this code is something like this: It takes away a certain percentage of users as test users. For

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani

Yes, thank you Steve. And sorry for my encoded messages Claudio 2013/2/16 Steve Chien > I think he meant that code is reading and converting the files from the > Input directory as a standalone program. Not a map-reduce program... > > On Feb 16, 2013, at 11:22, Dan Filimon > wrote: > > > Hi

Re: seqdirectory command in MapReduce

2013-02-16 Thread Steve Chien

I think he meant that code is reading and converting the files from the Input directory as a standalone program. Not a map-reduce program... On Feb 16, 2013, at 11:22, Dan Filimon wrote: > Hi Claudio, > > Could you be more specific? What does 'MapReduce style' mean? > seqdirectory should crea

Re: seqdirectory command in MapReduce

2013-02-16 Thread Claudio Reggiani

Let say the directory has only one big text. Logically it's one file but actually on HDFS the data is distributed among the cluster. Suppose now the big text can't stay in memory (in any memory of the cluster), does "seqdirectory" work? If so, the only way is to run seqdirectory as MapReduce job.

Re: seqdirectory command in MapReduce

2013-02-16 Thread Dan Filimon

Hi Claudio, Could you be more specific? What does 'MapReduce style' mean? seqdirectory should create sequence files from the documents in a folder, where the keys are the document names and the values are the documents' content. What do you need it to do? On Sat, Feb 16, 2013 at 5:55 PM, Claudio

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Getting java.lang.OutOfMemoryError when running mahout in sequential mode

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: seqdirectory command in MapReduce

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: Problems with Mahout's RecommenderIRStatsEvaluator

Re: seqdirectory command in MapReduce

Problems with Mahout's RecommenderIRStatsEvaluator

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

Re: seqdirectory command in MapReduce

23 matches

Site Navigation

Mail list logo

Footer information