Hi, ----- Original Message ----
> From: Sean Owen <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, October 23, 2008 1:29:15 PM > Subject: Re: Trimming Taste input (memory consumption) > > On Thu, Oct 23, 2008 at 6:09 PM, Otis Gospodnetic > wrote: > > DataModel model = new FileDataModel(new File("/tmp/input.txt")); > > recommender = new GenericItemBasedRecommender(model, new > TanimotoCoefficientSimilarity(model)); > > recommender = new CachingRecommender(recommender) > > OK in this case by far the best speedup I can imagine is by > specializing the framework for the "boolean" case you have. Right now, > the framework assumes every pref has a value, but in your case it's > just a yes/no. The generalized framework does almost nothing to take > advantage of it. It offers a "BooleanPreference" object which would at > least save you storing a redundant "1.0" in every pref object. But > you'd have to make a slightly customized FileDatamodel to use it. OK, BooleanPreference sounds right here, except I'm still not certain I'll really have only boolean preferences (actually, I only have "true" preferences, I never have "boolean" -- I only have 1.0, never 0.0). For instance, since I'm working with news, I can consider just viewing a news item as 1.0, but I can also consider emailing it as 1.5 or 2.0. Or saving it might be translated to a 3.0 preference, for example. > That's not hard. If you're willing to modify the code, there is far > more you can do to take advantage of this situation, that is also not > hard. my only quandary is how to include these in the general > framework, without just copying and pasting everything. That's the > only hold up. Yeah, I already started looking at FileDataModel to see how I'd parse extra data associated with each (user,item,pref) triple and where I'd store it. It looks like processLine(...) and buildPreference(). The processList() should really be protected, not private, ha? I see DetailedPreference now, that's kind of what I was thinking....except DetailedPreference is not used anywhere, so I can't see how that extra timestamp could be used, plus I wonder if it can be made generic. But even without a generic preference that can hold various meta data, it seems easy to write a custom Preference impl. > If you want to go this route, customization, at least just to try it, > I am most happy to help. You can save a ton of memory for your case. Just dropping the 1.0 and using BooleanPreference? I'll try it, it's a one-line change. > > Aha, Rescorer sounds useful. But say your rescoring logic requires > > additional > data, such as item publishing date or item price or whatever. Where does > such > data enter the system? > > Does one then have to have custom Item implementation? And then that > > implies > having a custom DataModel? > > Yes you must write an implementation of Rescorer, and you do whatever > you like to access data and write whatever logic you like. No you > don't need a custom Datamodel. Rescorer gets an item and current score > and you can just twiddle it however you like. You can return NaN to > drop an item from consideration. Right, but I think I need more than an item and a score. I need that other data (e.g. that timestamp from DetailedPreference or some other data) to rescore, and that means I either have to have it already read and in memory (e.g. via input fed into DetailedPreference during load time) or for each item that Rescorer considers I have to go get the data from an external store (e.g. a DB) at run-time, which is probably not a very scalable approach. So, if I now add extra data to my Taste input, I'll hit memory limits even sooner! :( Even with 1G heap I'm unable to read 1.2M data points, which for me represents less than one day's worth of data.... and I really need to have at least a few days worth of data in order to benefit from "historic overlap" of users' item consumption in order to figure out "people like you". I wonder how Ian is handling their data volume... Otis
