I do see a lot of room for improvement -- but it would require
specializing several parts of the code for the "binary" preference
case you are currently looking at. I want to have a crack at that, let
you see how it flies, and then we can figure out how to incorporate it
into the more general code. Right now it is not really taking any
advantage of the fact that you do not really have preferences on some
scale of values.

I do have a few more tricks up my sleeve - or levers I can add to
trade speed / memory for accuracy. We can talk more about those.

In general I agree, from all the algorithms I know of, I just don't
see any of them scaling, on one machine, for real-time
recommendations, beyond a medium-size data set. I do think
pre-computing is the way to go in general. The code is written for a
real-time model, essentially, but of course you can use that to do
batch computations. I don't think much is traded off to retain this
model as opposed to completely assuming off-line, batch processing.

And again you are right that all of the algorithms I know do not
parallelize neatly. The best you can do is send n machines to process
1/n of the users' recommendations, in batch. But you still have to
deal with fitting the entire model on each of these n machines. (Some
pieces of algorithms parallelize, like the precomputation in
slope-one, but not in general.)

You can try the tree-based algorithms. I warn you that the original
one is dirt slow. The second one is an attempt to cut corners to
dramatically increase speed. I admit I never really evaluated these
approaches much but they seem sensible. They would not take advantage
of the fact that you have "binary" prefs now though.



On Fri, Oct 31, 2008 at 5:12 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Hello,
>
> I'm using TanimotoCoefficientSimilarity.  With or without Rescorer, virtually 
> all time gets spent in TanimotoCoefficientSimilarity.itemCorrelation (see 
> below).
> I have not profiled things yet, but looking at
> TanimotoCoefficientSimilarity.itemCorrelation I don't see much room for
> performance improvement.
>
> So how can this puppy scale?  From what I can tell so far, the only way
> to scale is to really pre-compute recommendations for all users ahead
> of time and simply store them somewhere (e.g. DB, FS, memcached) for a quick
> user->recommendations lookup.  It looks like real-time computation
> is out of question.  Since CF/Taste sort of requires access to all
> users' data in order to compute recommendations, I don't yet see how
> data could be broken into smaller chunks and processed
> in distributed MapReduce-style... or does anyone see how this could be done? 
> [1]
>
> I looked at Ian's emails again and see that he, too, says there is no 
> real-time aspect in their system, plus it looks like they do aggregation and 
> store aggregation summaries for quick lookup in a DB, but don't really use 
> Taste for recommending items to individual users.
>
> [1]
> But this really brings me back a thread from the end of August thread, whose 
> key messages are:
>
> http://markmail.org/message/jo66sxyyn2pklsgv
> http://markmail.org/message/cfntfbhshn5qz36n
> http://markmail.org/message/27ijhgs4ghpr6cjv
> http://markmail.org/message/eu3npmt7ggzc2jaq
>
> It sounds like the next step to try are TreeClusteringRecommender and 
> TreeClusteringRecommender2...
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> "qtp0-0" prio=10 tid=0x08af3000 nid=0x5b94 runnable [0x6c0c6000..0x6c0c6fc0]
>   java.lang.Thread.State: RUNNABLE
>    at
> org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity.itemCorrelation(TanimotoCoefficientSimilarity.java:161)
>    at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.doEstimatePreference(GenericItemBasedRecommender.java:206)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.access$400(GenericItemBasedRecommender.java:59)
>    at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender$Estimator.estimate(GenericItemBasedRecommender.java:265)
>    at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender$Estimator.estimate(GenericItemBasedRecommender.java:256)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopItems(TopItems.java:54)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.recommend(GenericItemBasedRecommender.java:101)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.AbstractRecommender.recommend(AbstractRecommender.java:52)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender$RecommendationRetriever.get(CachingRecommender.java:170)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender$RecommendationRetriever.get(CachingRecommender.java:158)
>    at 
> org.apache.mahout.cf.taste.impl.common.Cache.getAndCacheValue(Cache.java:102)
>    at org.apache.mahout.cf.taste.impl.common.Cache.get(Cache.java:76)
>    at 
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender.recommend(CachingRecommender.java:93)
>

Reply via email to