Hi again,
seeing the answers to this question and the other I had posted ("adjusted 
cosine similarity for item-based recommender?"), I think I should clarify a bit 
what I'm trying to achieve and why I (believe I should) do things the way I'm 
doing.

I'm doing a class called "Learning from User-Generated data". Our first 
assignment deals with analysing the results of various types of recommenders. 
I'll go as far as saying "old-school" recommenders, given the content of your 
answers.
We have been introduced to:
 * Memory based:
     - user-based
     - item-based (*with* adjusted cosine similarity!)
     - slope-one
     - graph-based transitivity
 * Memory based
     - preprocessed item/user based (? this is unclear to me but I didn't reach 
this part of the assignment so I'll search for information before I ask 
questions; I also found an article where they mentioned slope-one amongst the 
model based; I guess I'll need to do more research on this)
     - matrix factorization-based (I saw that SVD is available in Mahout; my 
project partner is looking into that right now)

We have a *static* training dataset (800.000 <user,movie,preference> triples) 
and another static dataset for which we have to extract the predicted 
preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. 
recompose the <user,movie,preference> triples). Note that this will never go in 
a production environment, as it is merely a university requirement. For the 
same reason, I would prefer not to mix up things too much and I'd rather do a 
step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and 
check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit 
too much to handle at once with the deadline we were given).

So if I might get back to my original questions (again, I'm sorry for being 
stubborn but I'm under specific constraints - I'll really try to understand the 
search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend 
AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go 
through my 200.000 lines and find the already calculated preference. What am I 
doing wrong? :/ Should I store my whole datamodel in a file (how?) and then 
read through the file? I don't see how this could be faster than just reading 
the exact value I'm searching for...

Thanks again for your answers! Regards,

Pier Lorenzo


--------------------------------------------
On Fri, 4/3/15, Ted Dunning <ted.dunn...@gmail.com> wrote:

 Subject: Re: fast performance way of writing preferences to file?
 To: "user@mahout.apache.org" <user@mahout.apache.org>
 Date: Friday, April 3, 2015, 5:52 PM
 
 Are you sure that the
 problem is writing the results?  It seems to me that
 the real problem is the use of a user-based
 recommender.
 
 For such a
 small data set, for instance, a search-based recommender
 will be
 able to make recommendations in less
 than a millisecond with multiple
 recommendations possible in parallel.  This
 should allow you to do 200,000
 recommendations in a few minutes on a single
 machine.
 
 With such a small
 dataset, indicator-based methods may not be the best
 option.  To improve that, try using something
 larger such as the million
 song dataset. 
 See http://labrosa.ee.columbia.edu/millionsong/
 
 Also, using and estimating
 ratings is not a particularly good thing to be
 doing if you want to build a real
 recommender.
 
 
 On
 Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini <
 piell...@yahoo.com.invalid>
 wrote:
 
 > Hello
 everyone,
 > I'm new to mahout, to
 recommender systems and to the mailing list.
 >
 > I''m trying
 to find a (fast) way to write back preferences to a file.
 I
 > tried a few methods but I'm sure
 there must be a better approach.
 >
 Here's the deal (you can find the same post in
 stackoverflow[1]).
 > I have a training
 dataset of 800.000 records from 6000 users rating 3900
 > movies. These are stored in a comma
 separated file like:
 >
 userId,movieId,preference. I have another dataset (200.000
 records) in the
 > format: userId,movieId.
 My goal is to use the first dataset as a
 > training-set, in order to determine the
 missing preferences of the second
 >
 set.
 >
 > So far, I
 managed to load the training dataset and I generated
 user-based
 > recommendations. This is
 pretty smooth and doesn't take too much time. But
 > I'm struggling when it comes to
 writing back the recommendations.
 >
 > The first method I tried is:
 >
 >  * read a line from
 the file and get the userId,movieId tuple.
 >  * retrieve the calculated preference
 with estimatePreference(userId,
 >
 movieId)
 >  * append the preference to
 the line and save it in a new file
 > This
 works, but it's incredibly slow (I added a counter to
 print every
 > 10.000th iteration: after a
 couple of minutes it had only printed once. I
 > have 8GB-RAM with an i7-core... how long
 can it take to process 200.000
 >
 lines?!)
 >
 > My second
 choise was:
 >
 >  *
 create a new FileDataModel with the second dataset
 >  * do something like this:
 newDataModel.setPreference(userId, movieId,
 > recommender.estimatePreference(userId,
 movieId));
 >
 > Here I
 get several problems:
 >  * at runtime:
 java.lang.UnsupportedOperationException (as I found out
 in
 > [2], FileDataModel actually
 can't be updated. I don't understand why the
 > function setPreference exists in the first
 place...)
 >  * The API of
 FileDataModel#setPreference states "This method should
 also
 > be considered relatively
 slow."
 >
 > I read
 around that a solution would be to use delta files, but I
 couldn't
 > find out what that
 actually means. Any suggestion on how I could speed up
 > my writing-the-preferences process?
 > Thank you!
 >
 > Pier Lorenzo
 >
 >
 > [1]
 > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
 > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330
 >

Reply via email to