Hi again, seeing the answers to this question and the other I had posted ("adjusted cosine similarity for item-based recommender?"), I think I should clarify a bit what I'm trying to achieve and why I (believe I should) do things the way I'm doing.
I'm doing a class called "Learning from User-Generated data". Our first assignment deals with analysing the results of various types of recommenders. I'll go as far as saying "old-school" recommenders, given the content of your answers. We have been introduced to: * Memory based: - user-based - item-based (*with* adjusted cosine similarity!) - slope-one - graph-based transitivity * Memory based - preprocessed item/user based (? this is unclear to me but I didn't reach this part of the assignment so I'll search for information before I ask questions; I also found an article where they mentioned slope-one amongst the model based; I guess I'll need to do more research on this) - matrix factorization-based (I saw that SVD is available in Mahout; my project partner is looking into that right now) We have a *static* training dataset (800.000 <user,movie,preference> triples) and another static dataset for which we have to extract the predicted preferences (200.000 <user,movie> tuples) and write them back to a movie (i.e. recompose the <user,movie,preference> triples). Note that this will never go in a production environment, as it is merely a university requirement. For the same reason, I would prefer not to mix up things too much and I'd rather do a step-by-step learning (i.e. focus on Mahout for now, before I dig deeper and check the search-based approach, which uses DB-mahout-solr-spark... maybe a bit too much to handle at once with the deadline we were given). So if I might get back to my original questions (again, I'm sorry for being stubborn but I'm under specific constraints - I'll really try to understand the search-based approach when I have more time) ;) 1. I'm guessing that to implement an adjusted cosine similarity I should extend AbstractSimilarity (or maybe even AbstractRecommender?). Is this right? 2. I still can't believe that it takes more than at-most a few minutes to go through my 200.000 lines and find the already calculated preference. What am I doing wrong? :/ Should I store my whole datamodel in a file (how?) and then read through the file? I don't see how this could be faster than just reading the exact value I'm searching for... Thanks again for your answers! Regards, Pier Lorenzo -------------------------------------------- On Fri, 4/3/15, Ted Dunning <ted.dunn...@gmail.com> wrote: Subject: Re: fast performance way of writing preferences to file? To: "user@mahout.apache.org" <user@mahout.apache.org> Date: Friday, April 3, 2015, 5:52 PM Are you sure that the problem is writing the results? It seems to me that the real problem is the use of a user-based recommender. For such a small data set, for instance, a search-based recommender will be able to make recommendations in less than a millisecond with multiple recommendations possible in parallel. This should allow you to do 200,000 recommendations in a few minutes on a single machine. With such a small dataset, indicator-based methods may not be the best option. To improve that, try using something larger such as the million song dataset. See http://labrosa.ee.columbia.edu/millionsong/ Also, using and estimating ratings is not a particularly good thing to be doing if you want to build a real recommender. On Fri, Apr 3, 2015 at 3:26 AM, PierLorenzo Bianchini < piell...@yahoo.com.invalid> wrote: > Hello everyone, > I'm new to mahout, to recommender systems and to the mailing list. > > I''m trying to find a (fast) way to write back preferences to a file. I > tried a few methods but I'm sure there must be a better approach. > Here's the deal (you can find the same post in stackoverflow[1]). > I have a training dataset of 800.000 records from 6000 users rating 3900 > movies. These are stored in a comma separated file like: > userId,movieId,preference. I have another dataset (200.000 records) in the > format: userId,movieId. My goal is to use the first dataset as a > training-set, in order to determine the missing preferences of the second > set. > > So far, I managed to load the training dataset and I generated user-based > recommendations. This is pretty smooth and doesn't take too much time. But > I'm struggling when it comes to writing back the recommendations. > > The first method I tried is: > > * read a line from the file and get the userId,movieId tuple. > * retrieve the calculated preference with estimatePreference(userId, > movieId) > * append the preference to the line and save it in a new file > This works, but it's incredibly slow (I added a counter to print every > 10.000th iteration: after a couple of minutes it had only printed once. I > have 8GB-RAM with an i7-core... how long can it take to process 200.000 > lines?!) > > My second choise was: > > * create a new FileDataModel with the second dataset > * do something like this: newDataModel.setPreference(userId, movieId, > recommender.estimatePreference(userId, movieId)); > > Here I get several problems: > * at runtime: java.lang.UnsupportedOperationException (as I found out in > [2], FileDataModel actually can't be updated. I don't understand why the > function setPreference exists in the first place...) > * The API of FileDataModel#setPreference states "This method should also > be considered relatively slow." > > I read around that a solution would be to use delta files, but I couldn't > find out what that actually means. Any suggestion on how I could speed up > my writing-the-preferences process? > Thank you! > > Pier Lorenzo > > > [1] > http://stackoverflow.com/questions/29423824/mahout-fast-performance-how-to-write-preferences-to-file > [2] http://comments.gmane.org/gmane.comp.apache.mahout.user/11330 >