Hi again,
seeing the answers to this question and the other I had posted ("adjusted 
cosine similarity for item-based recommender?"), I think I should clarify a bit 
what I'm trying to achieve and why I (believe I should) do things the way I'm 

I'm doing a class called "Learning from User-Generated data". Our first 
assignment deals with analysing the results of various types of recommenders. 
I'll go as far as saying "old-school" recommenders, given the content of your 
We have been introduced to:
 * Memory based:
     - user-based
     - item-based (*with* adjusted cosine similarity!)
     - slope-one
     - graph-based transitivity
 * Memory based
     - preprocessed item/user based (? this is unclear to me but I didn't reach 
this part of the assignment so I'll search for information before I ask 
questions; I also found an article where they mentioned slope-one amongst the 
model based; I guess I'll need to do more research on this)
     - matrix factorization-based (I saw that SVD is available in Mahout; my 
project partner is looking into that right now)

We have a *static* training dataset (800.000 <user,movie,preference> triples) 
and another static dataset for which we have to extract the predicted 
preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. 
recompose the <user,movie,preference> triples). Note that this will never go in 
a production environment, as it is merely a university requirement. For the 
same reason, I would prefer not to mix up things too much and I'd rather do a 
step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and 
check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit 
too much to handle at once with the deadline we were given).

So if I might get back to my original questions (again, I'm sorry for being 
stubborn but I'm under specific constraints - I'll really try to understand the 
search-based approach when I have more time) ;)
1. I'm guessing that to implement an adjusted cosine similarity I should extend 
AbstractSimilarity (or maybe even AbstractRecommender?). Is this right?
2. I still can't believe that it takes more than at-most a few minutes to go 
through my 200.000 lines and find the already calculated preference. What am I 
doing wrong? :/ Should I store my whole datamodel in a file (how?) and then 
read through the file? I don't see how this could be faster than just reading 
the exact value I'm searching for...

Thanks again for your answers! Regards,

Pier Lorenzo

 Are you sure that the
 problem is writing the results?  It seems to me that
 the real problem is the use of a user-based
 For such a
 small data set, for instance, a search-based recommender
 will be
 able to make recommendations in less
 than a millisecond with multiple
 recommendations possible in parallel.  This
 should allow you to do 200,000
 recommendations in a few minutes on a single
 With such a small
 dataset, indicator-based methods may not be the best
 option.  To improve that, try using something
 larger such as the million
 song dataset. 
 See http://labrosa.ee.columbia.edu/millionsong/
 Also, using and estimating
 ratings is not a particularly good thing to be
 doing if you want to build a real
 On Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini
 > Hello
 > I'm new to mahout, to
 recommender systems and to the mailing list.
 > I''m trying
 to find a (fast) way to write back preferences to a file.
 > tried a few methods but I'm sure
 there must be a better approach.
 Here's the deal (you can find the same post in
 > I have a training
 dataset of 800.000 records from 6000 users rating 3900
 > movies. These are stored in a comma
 separated file like:
 userId,movieId,preference. I have another dataset (200.000
 records) in the
 > format: userId,movieId.
 My goal is to use the first dataset as a
 > training-set, in order to determine the
 missing preferences of the second
 > So far, I
 managed to load the training dataset and I generated
 > recommendations. This is
 pretty smooth and doesn't take too much time. But
 > I'm struggling when it comes to
 writing back the recommendations.
 > The first method I tried is:
 >  * read a line from
 the file and get the userId,movieId tuple.
 >  * retrieve the calculated preference
 with estimatePreference(userId,
 >  * append the preference to
 the line and save it in a new file
 > This
 works, but it's incredibly slow (I added a counter to
 print every
 > 10.000th iteration: after a
 couple of minutes it had only printed once. I
 > have 8GB-RAM with an i7-core... how long
 can it take to process 200.000
 > My second
 choise was:
 >  *
 create a new FileDataModel with the second dataset
 >  * do something like this:
 newDataModel.setPreference(userId, movieId,
 > recommender.estimatePreference(userId,
 > Here I
 get several problems:
 >  * at runtime:
 java.lang.UnsupportedOperationException (as I found out
 > [2], FileDataModel actually
 can't be updated. I don't understand why the
 > function setPreference exists in the first
 >  * The API of
 FileDataModel#setPreference states "This method should
 > be considered relatively
 > I read
 around that a solution would be to use delta files, but I
 > find out what that
 actually means. Any suggestion on how I could speed up
 > my writing-the-preferences process?
 > Thank you!
 > Pier Lorenzo
 > [1]
 > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file
 > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330

