Quick report that, sadly, TreeClusteringRecommender (TreeClusteringRecommender2
actually!) is a no go, too. It's been running for well over an hour over the
same amount of data, and this is where its been spending its time:
"qtp0-0" prio=10 tid=0x6bada400 nid=0x7551 runnable [0x6bfec000..0x6bfed140]
java.lang.Thread.State: RUNNABLE
at
org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender2.findClosestClusters(TreeClusteringRecommender2.java:428)
at
org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender2.mergeClosestClusters(TreeClusteringRecommender2.java:331)
at
org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender2.buildClusters(TreeClusteringRecommender2.java:313)
at
org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender2.checkClustersBuilt(TreeClusteringRecommender2.java:230)
at
org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender2.recommend(TreeClusteringRecommender2.java:159)
at
org.apache.mahout.cf.taste.impl.recommender.CachingRecommender.recommend(CachingRecommender.java:110)
This is how I'm using it.
recommender = new TreeClusteringRecommender2(model,
new NearestNeighborClusterSimilarity(new
TanimotoCoefficientSimilarity(model)), 0.5);
recommender = new CachingRecommender(recommender);
Not sure, at this point, whether a number closer to 0.0 or 1.0 yields faster
computation (but suboptimal clustering).
So, I'm guessing that TreeClusteringRecommender2 may also not be an option when
working with a non-trivial dataset:
$ # number of distinct users
$ cut -d, -f1 input.txt | sort | uniq | wc -l
899308
$ # number of distinct items
$ cut -d, -f2 input.txt | sort | uniq | wc -l
60302
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, October 31, 2008 1:12:04 PM
> Subject: Taste: real-time no go - distributed pre-computing?
>
> Hello,
>
> I'm using TanimotoCoefficientSimilarity. With or without Rescorer, virtually
> all time gets spent in TanimotoCoefficientSimilarity.itemCorrelation (see
> below).
> I have not profiled things yet, but looking at
> TanimotoCoefficientSimilarity.itemCorrelation I don't see much room for
> performance improvement.
>
> So how can this puppy scale? From what I can tell so far, the only way
> to scale is to really pre-compute recommendations for all users ahead
> of time and simply store them somewhere (e.g. DB, FS, memcached) for a quick
> user->recommendations lookup. It looks like real-time computation
> is out of question. Since CF/Taste sort of requires access to all
> users' data in order to compute recommendations, I don't yet see how
> data could be broken into smaller chunks and processed
> in distributed MapReduce-style... or does anyone see how this could be done?
> [1]
>
> I looked at Ian's emails again and see that he, too, says there is no
> real-time
> aspect in their system, plus it looks like they do aggregation and store
> aggregation summaries for quick lookup in a DB, but don't really use Taste
> for
> recommending items to individual users.
>
> [1]
> But this really brings me back a thread from the end of August thread, whose
> key
> messages are:
>
> http://markmail.org/message/jo66sxyyn2pklsgv
> http://markmail.org/message/cfntfbhshn5qz36n
> http://markmail.org/message/27ijhgs4ghpr6cjv
> http://markmail.org/message/eu3npmt7ggzc2jaq
>
> It sounds like the next step to try are TreeClusteringRecommender and
> TreeClusteringRecommender2...
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> "qtp0-0" prio=10 tid=0x08af3000 nid=0x5b94 runnable [0x6c0c6000..0x6c0c6fc0]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity.itemCorrelation(TanimotoCoefficientSimilarity.java:161)
> at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.doEstimatePreference(GenericItemBasedRecommender.java:206)
> at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.access$400(GenericItemBasedRecommender.java:59)
> at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender$Estimator.estimate(GenericItemBasedRecommender.java:265)
> at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender$Estimator.estimate(GenericItemBasedRecommender.java:256)
> at
> org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopItems(TopItems.java:54)
> at
> org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender.recommend(GenericItemBasedRecommender.java:101)
> at
> org.apache.mahout.cf.taste.impl.recommender.AbstractRecommender.recommend(AbstractRecommender.java:52)
> at
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender$RecommendationRetriever.get(CachingRecommender.java:170)
> at
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender$RecommendationRetriever.get(CachingRecommender.java:158)
> at
> org.apache.mahout.cf.taste.impl.common.Cache.getAndCacheValue(Cache.java:102)
> at org.apache.mahout.cf.taste.impl.common.Cache.get(Cache.java:76)
> at
> org.apache.mahout.cf.taste.impl.recommender.CachingRecommender.recommend(CachingRecommender.java:93)