Hello, Some feedback from my Taste experience. Tanimoto was the bottleneck for me, too. I used the highly sophisticated kill -QUIT pid method to determine that. Such kills always caught Taste in Tanimoto part of the code.
Do you know, roughly, what that nontrivial amount might be? e.g. 10% or more? Also, does the "nearly instantaneous" refer to calling Taste with a single recommend request at a time? I'm asking because I recently did some heavy duty benchmarking and things were definitely not instantaneous when I increased the number of concurrent requests. To make things fast (e.g. under 100 ms avg.) and run in reasonable amount of memory, I had to resort to remove-noise-users-and-items-from-input-and-then-read-the-data-model.... which means users who look like noise to the system (and that's a lot of them in order to keep things fast and limit memory usage) will not get recommendations. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Sean Owen <[email protected]> > To: [email protected] > Sent: Thursday, April 30, 2009 7:18:28 PM > Subject: Re: Recommendations from flat data > > After digging in this evening I have some answers I think. > > First, can you use the very latest code from Subversion? Because the > DataModel you use has actually been removed and rolled into FileDataModel. > > This is also because I checked in a change tonight that should cut down peak > memory usage while constructing a FileDataModel by a nontrivial amount. > > I was able to run recommendations over 10M data points in 768M of memory > tonight. > > It does take some time to parse and build the model. After that the > recommendation is nearly instantaneous with any similarity metric. Are you > sure Tanimoto was taking a longer time - meaning did you test over a lot of > recommendations? > > Either way there are certainly some params you can tweak to trade a bit of > accuracy (maybe) for speed. Look at the sampling rate param on the user > neighborhood implementation. Set it to like 10% and it should get much > faster - of course this doesn't change startup overhead though. > > On Apr 30, 2009 7:52 PM, "Sean Owen" wrote: > > Hm, something is off indeed. Tanimoto should be notably faster than a > cosine measure correlation -- it's doing a simple, optimized set > intersection and union rather than iterating over a bunch of > preference values. While 5M data points is going to consume a > reasonable amount of memory, I would not guess it would exhaust a 1GB > heap -- should be in the hundreds of megs. > > If you can run only the recommender in the JVM, obviously that frees > up memory. I would probably remove the caching wrapper too if memory > is at a premium, but that's not your problem. If you are running on a > 64-bit machine in 64-bit mode, try 32-bit mode (-d32) to reduce the > object overhead in the JVM. > > From there, you could load the data in a DB instead and use a > JDBC-based DataModel, since that doesn't load in memory. You could > also try adapting my NetflixDataModel which reads from data organized > in directories on disk. > > But no something just doesn't seem right, your current setup should be > OK. I think I need to try replicate this with a similarly sized data > set and see what's up. > > On Thu, Apr 30, 2009 at 5:48 PM, Paul Loy wrote: > Hi > Sean, > > that worked f...
