Re: Trimming Taste input (memory consumption)

Grant Ingersoll Wed, 22 Oct 2008 11:10:04 -0700


On Oct 22, 2008, at 1:43 PM, Otis Gospodnetic wrote:

Hi,
In hopes of getting some feedback about possible improvements (e.g.which data to keep, which to trim, how far back to go, etc.), hereare some numbers I'm working with:
# number of unique users: $ cut -d, -f1 input.txt | sort | uniq | wc-l
705180
# number of unique items: $ cut -d, -f2 input.txt | sort | uniq | wc-l
65870
# total number of data points (the "user,item,1.0" triplets): $ wc -l input.txt
1664289 input.txt

Each triplet represents user->item view.  Here is their distribution:
1—10 98485
1—100 118047
1—200 119223
1—1000 120100
This means the top 10 most popular items account for 98485 views,and so on. So top 100 items account for vast majority of views.I'm working with about 1 day's worth of data. I think this is aproblem, because it doesn't give me info about user->item views frombefore, and I think that translates to losing some user-user overlapdata to compute better recommendations. Is this correct?
I'm dealing with news, so the least popular items for a given dayseem to really be old news items (they are past their prime, so tospeak).
Because I don't want to recommend old news, I *think* I can chop ofsome of the tail at some(?) expense of quality.Now that I see the distribution of items more clearly, I am alsowondering if feeding the most popular items into the recommendationengine is really valuable. Items are very popular because lots ofpeople consumed them. This produces a lot of overlap between users,which is good, but maybe it's too good for its own good (kind oflike the Harry Potter problem)? I wonder if it would make sense notto include (and thus not recommend) the most popular items? Hm,doesn't sound right, because of my 705K users only about 98K haveseen the top 10 items already. But would it make sense toartificially lower their rating, to put a damper on them?


Thinking out loud, too...

From what I understand of the problem, a damper is a reasonable thingto do. To some extent CF engines become self-fulfilling, right? Thatis, the only items that get recommended are those that everyonerecommends. One way around this is to allow the user to set somepreferences that factor in freshness and/or degrees of randomness orother factors. Alternatively, maybe there is a preference for howoften you would recommend an item to someone. Maybe after X views orsomething, it gets taken out of the list for that user. Not sure...

I remember in the early days of Amazon, there recommendations werealways crap, but now they are better, IMO. In the early days, itseems like it would recommend things that were almost exact matches.For instance, I would often look at books X, Y and Z on a topic andthen choose book Y. The next time in, Amazon would then turn aroundand recommend X and Z. Now, I think they do a better job of saying,"You bought Y", so here's books that complement Y instead of booksthat are substitutes for Y. I'm not completely sure how thistranslates into what you are doing, but it strikes me that a goodrecommendation goes beyond just strict CF. For instance, one ideamight be to also cluster and then recommend items that are in thecluster and highly rated. Or maybe, those that are on the edge of acluster and highly rated, or in a cluster that is fairly close to theitem in question.


BTW, how much memory does the above take?  Have you done any profiling?

I'm not sure how it would work exactly, but can you aggregate/foldresults together? Perhaps for older items, you collapse the ratingsinto an "average" item, or something.

I'm thinking out loud, so any thoughts and feedback would beappreciated.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, October 22, 2008 12:52:41 PM
Subject: Trimming Taste input (memory consumption)

Hi,
I've finally fed Taste some real data (in terms of volume, users,and itempreference distribution) and quickly hit the memory limits of mydevelopmentlaptop. :). Now I'm trying to see what, if anything, I can trimfrom the inputset (the user,item,rating triplets) to lower the memoryconsumption. N.b. Idon't actually have rating information - my ratings are all just"1.0"
indicating that the item has been seen/read/consumed.

I ran one of these to see the item popularity distribution:
$ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less
And quickly saw the expected zipfian distribution. Big head ofseveral verypopular items and a loooong tail of items that have been seen/read/consumed only
a few times.

So here are my questions:
- Is there a point in keeping and loading very unpopular items (e.g.
the ones read only once)?  I think keeping those might help very few
people discover very obscure items, so removing them will hurt this
small subset of people a bit, but this will not affect the majorityof
people.  Is this thinking correct?
- I'm dealing with items where their freshness counts. I don'twant torecommend items older than N days - think news stories. Assume Ihave the ageof each item. I could certainly then remove old items as I don'tever want to
recommend them, but if I remove them, won't that hurt the quality of
recommendations, simply because I'll lose users' "item consumptionhistory"?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Trimming Taste input (memory consumption)

Reply via email to