On Oct 22, 2008, at 1:43 PM, Otis Gospodnetic wrote:
Hi,
In hopes of getting some feedback about possible improvements (e.g.
which data to keep, which to trim, how far back to go, etc.), here
are some numbers I'm working with:
# number of unique users: $ cut -d, -f1 input.txt | sort | uniq | wc
-l
705180
# number of unique items: $ cut -d, -f2 input.txt | sort | uniq | wc
-l
65870
# total number of data points (the "user,item,1.0" triplets): $ wc -
l input.txt
1664289 input.txt
Each triplet represents user->item view. Here is their distribution:
1—10 98485
1—100 118047
1—200 119223
1—1000 120100
This means the top 10 most popular items account for 98485 views,
and so on. So top 100 items account for vast majority of views.
I'm working with about 1 day's worth of data. I think this is a
problem, because it doesn't give me info about user->item views from
before, and I think that translates to losing some user-user overlap
data to compute better recommendations. Is this correct?
I'm dealing with news, so the least popular items for a given day
seem to really be old news items (they are past their prime, so to
speak).
Because I don't want to recommend old news, I *think* I can chop of
some of the tail at some(?) expense of quality.
Now that I see the distribution of items more clearly, I am also
wondering if feeding the most popular items into the recommendation
engine is really valuable. Items are very popular because lots of
people consumed them. This produces a lot of overlap between users,
which is good, but maybe it's too good for its own good (kind of
like the Harry Potter problem)? I wonder if it would make sense not
to include (and thus not recommend) the most popular items? Hm,
doesn't sound right, because of my 705K users only about 98K have
seen the top 10 items already. But would it make sense to
artificially lower their rating, to put a damper on them?
Thinking out loud, too...
From what I understand of the problem, a damper is a reasonable thing
to do. To some extent CF engines become self-fulfilling, right? That
is, the only items that get recommended are those that everyone
recommends. One way around this is to allow the user to set some
preferences that factor in freshness and/or degrees of randomness or
other factors. Alternatively, maybe there is a preference for how
often you would recommend an item to someone. Maybe after X views or
something, it gets taken out of the list for that user. Not sure...
I remember in the early days of Amazon, there recommendations were
always crap, but now they are better, IMO. In the early days, it
seems like it would recommend things that were almost exact matches.
For instance, I would often look at books X, Y and Z on a topic and
then choose book Y. The next time in, Amazon would then turn around
and recommend X and Z. Now, I think they do a better job of saying,
"You bought Y", so here's books that complement Y instead of books
that are substitutes for Y. I'm not completely sure how this
translates into what you are doing, but it strikes me that a good
recommendation goes beyond just strict CF. For instance, one idea
might be to also cluster and then recommend items that are in the
cluster and highly rated. Or maybe, those that are on the edge of a
cluster and highly rated, or in a cluster that is fairly close to the
item in question.
BTW, how much memory does the above take? Have you done any profiling?
I'm not sure how it would work exactly, but can you aggregate/fold
results together? Perhaps for older items, you collapse the ratings
into an "average" item, or something.
I'm thinking out loud, so any thoughts and feedback would be
appreciated.
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, October 22, 2008 12:52:41 PM
Subject: Trimming Taste input (memory consumption)
Hi,
I've finally fed Taste some real data (in terms of volume, users,
and item
preference distribution) and quickly hit the memory limits of my
development
laptop. :). Now I'm trying to see what, if anything, I can trim
from the input
set (the user,item,rating triplets) to lower the memory
consumption. N.b. I
don't actually have rating information - my ratings are all just
"1.0"
indicating that the item has been seen/read/consumed.
I ran one of these to see the item popularity distribution:
$ cut -d, -f2 input.txt | sort | uniq -c | sort -rn | less
And quickly saw the expected zipfian distribution. Big head of
several very
popular items and a loooong tail of items that have been seen/read/
consumed only
a few times.
So here are my questions:
- Is there a point in keeping and loading very unpopular items (e.g.
the ones read only once)? I think keeping those might help very few
people discover very obscure items, so removing them will hurt this
small subset of people a bit, but this will not affect the majority
of
people. Is this thinking correct?
- I'm dealing with items where their freshness counts. I don't
want to
recommend items older than N days - think news stories. Assume I
have the age
of each item. I could certainly then remove old items as I don't
ever want to
recommend them, but if I remove them, won't that hurt the quality of
recommendations, simply because I'll lose users' "item consumption
history"?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ