You can use the Mahout text pipeline, which will give you weighted vectors based on TFIDF for each article. There is an example of this in Mahout in Action for clustering. Then run the RowSimilarityJob on them instead of clustering. This will give you a strength of similarity for each article pair. RSJ produces a DRM (distributed row matrix), which is keyed by the article id and so has a list of how similar every article is to the row article. The highest similarities will indicate most similar text content in the articles. I've done this before and it works pretty well. There might be something in the new knn (k-nearest-neighbors) framework that is more optimized.
Once you have the article similarities you could combine the most similar to the past articles the user has read and show some number of the ones user hasn't seen yet. Content-based recommenders are good for avoiding the cold start problem because even if the user has no read history you can show articles similar to the one she is looking at. Also content-based recs are good when your inventory changes a lot (new articles appear all the time and they go out of favor quickly). You may never generate enough read behavior to use collaborative filtering alone. BTW you might also look at Solr where you can use an article as a query against all articles indexed. This will also produce a list of ranked similar articles. Use the user's read history as queries and combine the lists somehow. On Aug 29, 2013, at 7:53 AM, Gokhan Capan <gkhn...@gmail.com> wrote: Hi Michael, Those are collaborative filtering examples, which would recommend a news article i, to a user u, based on: - A weighted average of other users' ratings on i (where weight is the similarity of two users' rating histories) - A weighted average of u's ratings on other items (where weight is the similarity of two items' rating histories, that is, the users rated the item and how they rated it) - A combination of the user and item vectors from user and item latent factor matrices, which are obtained by decomposing the original rating matrix. If you are expecting the system recommend to a user only the news articles those have similar content to the older news articles that the user had shown a positive interest before, this is content-based filtering. Also, the example you mentioned (recommending brand new articles) introduces a challenge called cold-start problem, and content-based filtering can generalize to cold-start articles, too. A search in user-list for content-based filtering/recommendation can help you (I am saying this because there were some great discussions on how to achieve this with Mahout, for example, with custom similarity measures). if you can't find anything satisfying, we can discuss that further. Best, Gokhan On Thu, Aug 29, 2013 at 4:21 PM, Michael Wechner <michael.wech...@wyona.com>wrote: > Hi > > I am looking for a recommender example for news articles which is making > suggestions based on a user profile (independent of other users/readers) or > more specific on the reading history of a user. > Let's say a specific user likes to read articles about cycling and > international politics and the content management system is saving the URL > history of all the articles which have been read by this specific user. > When the editorial stuff is creating new articles/stories, then the system > should make recommendations to this user when she/he is getting back online > or also when a new story has been created, then the recommender should > check whether this new story would be good fit/match for this particular > user and the system should send a notification. > > I guess developing such a recommender is possible with Mahout, but since I > am new Mahout, I would appreciate any pointers to examples which are > similar to the functionality described above. > > I am currently looking at the examples shipped with Mahout > > https://cwiki.apache.org/**confluence/display/MAHOUT/** > RecommendationExamples<https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples> > > but if I understand correctly these are based on what other people liked > and not what the person itself only liked, > or do I misunderstand? > > Thanks for your help > > Michael > >