Hi, I've finally fed Taste some real data (in terms of volume, users, and item preference distribution) and quickly hit the memory limits of my development laptop. :). Now I'm trying to see what, if anything, I can trim from the input set (the user,item,rating triplets) to lower the memory consumption. N.b. I don't actually have rating information - my ratings are all just "1.0" indicating that the item has been seen/read/consumed.
I ran one of these to see the item popularity distribution: $ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less And quickly saw the expected zipfian distribution. Big head of several very popular items and a loooong tail of items that have been seen/read/consumed only a few times. So here are my questions: - Is there a point in keeping and loading very unpopular items (e.g. the ones read only once)? I think keeping those might help very few people discover very obscure items, so removing them will hurt this small subset of people a bit, but this will not affect the majority of people. Is this thinking correct? - I'm dealing with items where their freshness counts. I don't want to recommend items older than N days - think news stories. Assume I have the age of each item. I could certainly then remove old items as I don't ever want to recommend them, but if I remove them, won't that hurt the quality of recommendations, simply because I'll lose users' "item consumption history"? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
