The very question at hand is how to label the data as "relevant" and "not
relevant" results. The question exists because this is not given, which is
why I would not call this a supervised problem. That may just be semantics,
but the point I wanted to make is that the reasons choosing a random
train
Sean
I think it is still a supervised learning problem in that there is a labelled
training data set and an unlabeled test data set.
Learning a ranking doesn't change the basic dichotomy between supervised and
unsupervised. It just changes the desired figure of merit.
Sent from my iPhone
O
There are a variety of common time based effects which make time splits best in
many practical cases. Having the training data all be from the past emulates
this better than random splits.
For one thing, you can have the same user under different names in training and
test. For another thing
Thanks for the replies.
From: Sean Owen
To: Mahout User List
Sent: Saturday, February 16, 2013 11:34 PM
Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator
I understand the idea, but this boils down to the current implementation,
plus going bac
I understand the idea, but this boils down to the current implementation,
plus going back and throwing out some additional training data that is
lower rated -- it's neither in test or training. Anything's possible, but I
do not imagine this is a helpful practice in general.
On Sat, Feb 16, 2013 a
I'm suggesting the second one. In that way the test user's ratings in
the training set will compose of both low and high rated items, that
prevents the problem pointed out by Ahmet.
On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen wrote:
> If you're suggesting that you hold out only high-rated items,
If you're suggesting that you hold out only high-rated items, and then
sample them, then that's what is done already in the code, except without
the sampling. The sampling doesn't buy anything that I can see.
If you're suggesting holding out a random subset and then throwing away the
held-out item
What I mean is you can choose ratings randomly and try to recommend
the ones above the threshold
On Sat, Feb 16, 2013 at 10:32 PM, Sean Owen wrote:
> Sure, if you were predicting ratings for one movie given a set of ratings
> for that movie and the ratings for many other movies. That isn't what
Hi
I am having difficulties linking my two machines into a hadoop cluster so I
am running mahout jobs in a single machine and I am running into
java.lang.OutOfMemoryError issues when the input files are big (see outputs
below, one is "Java heap space" and the other is "GC overhead limit
exceeded")
Sure, if you were predicting ratings for one movie given a set of ratings
for that movie and the ratings for many other movies. That isn't what the
recommender problem is. Here, the problem is to list N movies most likely
to be top-rated. The precision-recall test is, in turn, a test of top N
resul
No, rating prediction is clearly a supervised ML problem
On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen wrote:
> This is a good answer for evaluation of supervised ML, but, this is
> unsupervised. Choosing randomly is choosing the 'right answers' randomly,
> and that's plainly problematic.
>
>
> On
look at MAHOUT-833 , this patch gives you this functionality.
On Sat, Feb 16, 2013 at 10:55 AM, Claudio Reggiani wrote:
> Hello,
>
> I have a text dataset. Running "seqdirectory" command on it I see it's not
> written in MapReduce style (looking at the source code of
> SequenceFilesFromDirector
This is a good answer for evaluation of supervised ML, but, this is
unsupervised. Choosing randomly is choosing the 'right answers' randomly,
and that's plainly problematic.
On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin wrote:
> I think, it is better to choose ratings of the test user in a ran
I think, it is better to choose ratings of the test user in a random fashion.
On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen wrote:
> Yes. But: the test sample is small. Using 40% of your data to test is
> probably quite too much.
>
> My point is that it may be the least-bad thing to do. What test ar
Yes. But: the test sample is small. Using 40% of your data to test is
probably quite too much.
My point is that it may be the least-bad thing to do. What test are you
proposing instead, and why is it coherent with what you're testing?
On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz wrote:
> But
But modeling a user only by his/her low ratings can be problematic since people
generally are more precise (I believe) in their high ratings.
Another problem is that recommender algorithms in general first mean normalize
the ratings for each user. Suppose that we have the following ratings of 3
No, this is not a problem.
Yes it builds a model for each user, which takes a long time. It's
accurate, but time-consuming. It's meant for small data. You could rewrite
your own test to hold out data for all test users at once. That's what I
did when I rewrote a lot of this just because it was mor
But why would this be a problem? As long as it's using HDFS to access
the files, it should be able to fetch the chunks from wherever they
might be in the cluster.
I don't see why it wouldn't work. Let us know if it works!
On Sat, Feb 16, 2013 at 7:38 PM, Claudio Reggiani wrote:
> Yes, thank you
Hi,
I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I
think that there are two important problems here.
According to my understanding the experimental protocol used in this code is
something like this:
It takes away a certain percentage of users as test users.
For
Yes, thank you Steve. And sorry for my encoded messages
Claudio
2013/2/16 Steve Chien
> I think he meant that code is reading and converting the files from the
> Input directory as a standalone program. Not a map-reduce program...
>
> On Feb 16, 2013, at 11:22, Dan Filimon
> wrote:
>
> > Hi
I think he meant that code is reading and converting the files from the Input
directory as a standalone program. Not a map-reduce program...
On Feb 16, 2013, at 11:22, Dan Filimon wrote:
> Hi Claudio,
>
> Could you be more specific? What does 'MapReduce style' mean?
> seqdirectory should crea
Let say the directory has only one big text. Logically it's one file but
actually on HDFS the data is distributed among the cluster. Suppose now the
big text can't stay in memory (in any memory of the cluster), does
"seqdirectory" work?
If so, the only way is to run seqdirectory as MapReduce job.
Hi Claudio,
Could you be more specific? What does 'MapReduce style' mean?
seqdirectory should create sequence files from the documents in a
folder, where the keys are the document names and the values are the
documents' content.
What do you need it to do?
On Sat, Feb 16, 2013 at 5:55 PM, Claudio
23 matches
Mail list logo