On Oct 22, 2008, at 1:43 PM, Otis Gospodnetic wrote:

Hi,

In hopes of getting some feedback about possible improvements (e.g. which data to keep, which to trim, how far back to go, etc.), here are some numbers I'm working with:

# number of unique users: $ cut -d, -f1 input.txt | sort | uniq | wc -l
705180

# number of unique items: $ cut -d, -f2 input.txt | sort | uniq | wc -l
65870

# total number of data points (the "user,item,1.0" triplets): $ wc - l input.txt
1664289 input.txt

Each triplet represents user->item view.  Here is their distribution:
1—10 98485
1—100 118047
1—200 119223
1—1000 120100

This means the top 10 most popular items account for 98485 views, and so on. So top 100 items account for vast majority of views. I'm working with about 1 day's worth of data. I think this is a problem, because it doesn't give me info about user->item views from before, and I think that translates to losing some user-user overlap data to compute better recommendations. Is this correct?

I'm dealing with news, so the least popular items for a given day seem to really be old news items (they are past their prime, so to speak).

Because I don't want to recommend old news, I *think* I can chop of some of the tail at some(?) expense of quality. Now that I see the distribution of items more clearly, I am also wondering if feeding the most popular items into the recommendation engine is really valuable. Items are very popular because lots of people consumed them. This produces a lot of overlap between users, which is good, but maybe it's too good for its own good (kind of like the Harry Potter problem)? I wonder if it would make sense not to include (and thus not recommend) the most popular items? Hm, doesn't sound right, because of my 705K users only about 98K have seen the top 10 items already. But would it make sense to artificially lower their rating, to put a damper on them?

Thinking out loud, too...

From what I understand of the problem, a damper is a reasonable thing to do. To some extent CF engines become self-fulfilling, right? That is, the only items that get recommended are those that everyone recommends. One way around this is to allow the user to set some preferences that factor in freshness and/or degrees of randomness or other factors. Alternatively, maybe there is a preference for how often you would recommend an item to someone. Maybe after X views or something, it gets taken out of the list for that user. Not sure...

I remember in the early days of Amazon, there recommendations were always crap, but now they are better, IMO. In the early days, it seems like it would recommend things that were almost exact matches. For instance, I would often look at books X, Y and Z on a topic and then choose book Y. The next time in, Amazon would then turn around and recommend X and Z. Now, I think they do a better job of saying, "You bought Y", so here's books that complement Y instead of books that are substitutes for Y. I'm not completely sure how this translates into what you are doing, but it strikes me that a good recommendation goes beyond just strict CF. For instance, one idea might be to also cluster and then recommend items that are in the cluster and highly rated. Or maybe, those that are on the edge of a cluster and highly rated, or in a cluster that is fairly close to the item in question.

BTW, how much memory does the above take?  Have you done any profiling?

I'm not sure how it would work exactly, but can you aggregate/fold results together? Perhaps for older items, you collapse the ratings into an "average" item, or something.




I'm thinking out loud, so any thoughts and feedback would be appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, October 22, 2008 12:52:41 PM
Subject: Trimming Taste input (memory consumption)

Hi,

I've finally fed Taste some real data (in terms of volume, users, and item preference distribution) and quickly hit the memory limits of my development laptop. :). Now I'm trying to see what, if anything, I can trim from the input set (the user,item,rating triplets) to lower the memory consumption. N.b. I don't actually have rating information - my ratings are all just "1.0"
indicating that the item has been seen/read/consumed.

I ran one of these to see the item popularity distribution:
$ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less

And quickly saw the expected zipfian distribution. Big head of several very popular items and a loooong tail of items that have been seen/read/ consumed only
a few times.

So here are my questions:
- Is there a point in keeping and loading very unpopular items (e.g.
the ones read only once)?  I think keeping those might help very few
people discover very obscure items, so removing them will hurt this
small subset of people a bit, but this will not affect the majority of
people.  Is this thinking correct?

- I'm dealing with items where their freshness counts. I don't want to recommend items older than N days - think news stories. Assume I have the age of each item. I could certainly then remove old items as I don't ever want to
recommend them, but if I remove them, won't that hurt the quality of
recommendations, simply because I'll lose users' "item consumption history"?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Reply via email to