Re: generic latent variable recommender question

2014-01-24 Thread Sebastian Schelter

Case 1 is fine as is.

For Case 2 I would suggest to simply experiment, try different 
similarity measures like euclidean distance or cosine and see what gives 
the best results.


--sebastian

On 01/25/2014 04:08 AM, Koobas wrote:

A generic latent variable recommender question.
I passed the user-item matrix through a low rank approximation,
with either something like ALS or SVD, and now I have the feature
vectors for all users and all items.

Case 1:
I want to recommend items to a user.
I compute a dot product of the user’s feature vector with all feature
vectors of all the items.
I eliminate the ones that the user already has, and find the largest value
among the others, right?

Case 2:
I want to find similar items for an item.
Should I compute dot product of the item’s feature vector against feature
vectors of all the other items?
OR
Should I compute the ANGLE between each par of feature vectors?
I.e., compute the cosine similarity?
I.e., normalize the vectors before computing the dot products?

If “yes” for case 2, is that something I should also do for case 1?





Re: generic latent variable recommender question

2014-01-24 Thread Ted Dunning
On Fri, Jan 24, 2014 at 7:08 PM, Koobas  wrote:

> I eliminate the ones that the user already has, and find the largest value
> among the others, right?
>

yeah...

Unless you are selling razor blades in which case, you don't eliminate
repeats.

Also, you may want to pass the results through a dithering step and
possibly an anti-flood step.

But those aren't core.  The dithering, in particular, can have a huge
positive effect.


Re: generic latent variable recommender question

2014-01-25 Thread Tevfik Aytekin
Case 1 is fine, in case 2, I don't think that a dot product (without
normalization) will yield a meaningful distance measure. Cosine
distance or a Pearson correlation would be better. The situation is
similar to Latent Semantic Indexing in which documents are represented
by their low rank approximations and similarities between them (that
is, approximations) are computed using cosine similarity.
There is no need to make any normalization in case 1 since the values
in the feature vectors are formed to approximate the rating values.

On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
> A generic latent variable recommender question.
> I passed the user-item matrix through a low rank approximation,
> with either something like ALS or SVD, and now I have the feature
> vectors for all users and all items.
>
> Case 1:
> I want to recommend items to a user.
> I compute a dot product of the user’s feature vector with all feature
> vectors of all the items.
> I eliminate the ones that the user already has, and find the largest value
> among the others, right?
>
> Case 2:
> I want to find similar items for an item.
> Should I compute dot product of the item’s feature vector against feature
> vectors of all the other items?
>OR
> Should I compute the ANGLE between each par of feature vectors?
> I.e., compute the cosine similarity?
> I.e., normalize the vectors before computing the dot products?
>
> If “yes” for case 2, is that something I should also do for case 1?


Re: generic latent variable recommender question

2014-01-25 Thread Koobas
On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin wrote:

> Case 1 is fine, in case 2, I don't think that a dot product (without
> normalization) will yield a meaningful distance measure. Cosine
> distance or a Pearson correlation would be better. The situation is
> similar to Latent Semantic Indexing in which documents are represented
> by their low rank approximations and similarities between them (that
> is, approximations) are computed using cosine similarity.
> There is no need to make any normalization in case 1 since the values
> in the feature vectors are formed to approximate the rating values.
>
> That's exactly what I was thinking.
Thanks for your reply.


> On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
> > A generic latent variable recommender question.
> > I passed the user-item matrix through a low rank approximation,
> > with either something like ALS or SVD, and now I have the feature
> > vectors for all users and all items.
> >
> > Case 1:
> > I want to recommend items to a user.
> > I compute a dot product of the user’s feature vector with all feature
> > vectors of all the items.
> > I eliminate the ones that the user already has, and find the largest
> value
> > among the others, right?
> >
> > Case 2:
> > I want to find similar items for an item.
> > Should I compute dot product of the item’s feature vector against feature
> > vectors of all the other items?
> >OR
> > Should I compute the ANGLE between each par of feature vectors?
> > I.e., compute the cosine similarity?
> > I.e., normalize the vectors before computing the dot products?
> >
> > If “yes” for case 2, is that something I should also do for case 1?
>


Re: generic latent variable recommender question

2014-01-25 Thread Tevfik Aytekin
Hi Ted,
Could you explain what do you mean by a "dithering step" and an
"anti-flood step"?
By dithering I guess you mean adding some sort of noise in order not
to show the same results every time.
But I have no clue about the anti-flood step.

Tevfik

On Sat, Jan 25, 2014 at 11:05 PM, Koobas  wrote:
> On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin 
> wrote:
>
>> Case 1 is fine, in case 2, I don't think that a dot product (without
>> normalization) will yield a meaningful distance measure. Cosine
>> distance or a Pearson correlation would be better. The situation is
>> similar to Latent Semantic Indexing in which documents are represented
>> by their low rank approximations and similarities between them (that
>> is, approximations) are computed using cosine similarity.
>> There is no need to make any normalization in case 1 since the values
>> in the feature vectors are formed to approximate the rating values.
>>
>> That's exactly what I was thinking.
> Thanks for your reply.
>
>
>> On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
>> > A generic latent variable recommender question.
>> > I passed the user-item matrix through a low rank approximation,
>> > with either something like ALS or SVD, and now I have the feature
>> > vectors for all users and all items.
>> >
>> > Case 1:
>> > I want to recommend items to a user.
>> > I compute a dot product of the user’s feature vector with all feature
>> > vectors of all the items.
>> > I eliminate the ones that the user already has, and find the largest
>> value
>> > among the others, right?
>> >
>> > Case 2:
>> > I want to find similar items for an item.
>> > Should I compute dot product of the item’s feature vector against feature
>> > vectors of all the other items?
>> >OR
>> > Should I compute the ANGLE between each par of feature vectors?
>> > I.e., compute the cosine similarity?
>> > I.e., normalize the vectors before computing the dot products?
>> >
>> > If “yes” for case 2, is that something I should also do for case 1?
>>


Re: generic latent variable recommender question

2014-01-25 Thread Ted Dunning
Dithering is commonly done by re-ranking results using a noisy score.  Take
r to be the original rank (starting with 1).  Then compute a score as

  s = log r + N(0,log \epsilon)

and sort by this new score in ascending order.

Items will be shuffled by this method in such a way that the probability
that item 2k will appear before item k is nearly invariant with respect to
k.  Thus, item 3 will appear before item 1 about as often as item 30 will
appear before item 10.  The major effect here is to dredge deep results up
onto the first page (occasionally) so that the recommendation has broader
training data.

You can seed this with time in order to get the appearance of changing
recommendations even when no change in history is recorded.  Moreover, the
time varying seed can be held constant for a short period (a few minutes to
an hour or so) so that you also give the appearance of short-term
stability.  Both of these effects seem to entice users back to a
recommendation page.  Ironically, people seem more willing to return to the
first recommendation page than they are willing to click to the second page.

This addition of random noise obviously makes your best recommendation
results worse. The penalty is worthwhile to the extent that your
recommender learns enough to make results better tomorrow.  This has been
my universal experience for reasonable levels of dithering.

Anti-flood is quite a bit more heuristic and can be motivated by the idea
that recommenders are recommending individual items but users are being
shown an entire portfolio of items on the first page.  The probability of
making the user happy with any given page of recommendations is not
increased if you show items which are nearly identical because if they like
one item, they will very, very likely the others and if they don't like
one, they likely won't like the others.  On the other hand, if you were two
split the page between two groups of very distinctly different kinds of
items, if you miss on one group, you don't have a guaranteed miss on the
second group and thus you have hedged your bets and will have better user
satisfaction.

How you accomplish this is largely a UI question.  You could cluster the
items and show the users 1-2 items from each cluster with an option for
seeing the full cluster.  You can also use a synthetic score approach where
you penalize items that are too similar to items higher in the results
list.  The meaning of too similar is typically hand crafted to your domain.
 It might be a test for the same author, or the same genre or whatever you
have handy.





On Sat, Jan 25, 2014 at 1:42 PM, Tevfik Aytekin wrote:

> Hi Ted,
> Could you explain what do you mean by a "dithering step" and an
> "anti-flood step"?
> By dithering I guess you mean adding some sort of noise in order not
> to show the same results every time.
> But I have no clue about the anti-flood step.
>
> Tevfik
>
> On Sat, Jan 25, 2014 at 11:05 PM, Koobas  wrote:
> > On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin <
> tevfik.ayte...@gmail.com>wrote:
> >
> >> Case 1 is fine, in case 2, I don't think that a dot product (without
> >> normalization) will yield a meaningful distance measure. Cosine
> >> distance or a Pearson correlation would be better. The situation is
> >> similar to Latent Semantic Indexing in which documents are represented
> >> by their low rank approximations and similarities between them (that
> >> is, approximations) are computed using cosine similarity.
> >> There is no need to make any normalization in case 1 since the values
> >> in the feature vectors are formed to approximate the rating values.
> >>
> >> That's exactly what I was thinking.
> > Thanks for your reply.
> >
> >
> >> On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
> >> > A generic latent variable recommender question.
> >> > I passed the user-item matrix through a low rank approximation,
> >> > with either something like ALS or SVD, and now I have the feature
> >> > vectors for all users and all items.
> >> >
> >> > Case 1:
> >> > I want to recommend items to a user.
> >> > I compute a dot product of the user’s feature vector with all feature
> >> > vectors of all the items.
> >> > I eliminate the ones that the user already has, and find the largest
> >> value
> >> > among the others, right?
> >> >
> >> > Case 2:
> >> > I want to find similar items for an item.
> >> > Should I compute dot product of the item’s feature vector against
> feature
> >> > vectors of all the other items?
> >> >OR
> >> > Should I compute the ANGLE between each par of feature vectors?
> >> > I.e., compute the cosine similarity?
> >> > I.e., normalize the vectors before computing the dot products?
> >> >
> >> > If “yes” for case 2, is that something I should also do for case 1?
> >>
>


Re: generic latent variable recommender question

2014-01-25 Thread Pat Ferrel
For anti-flood and in the vein of “UI” you can build a recommender that 
recommends categories or genres then get recommendations weighted or filtered 
by those categories. A simple version of this is to just look at preference 
frequency by category for the current user. This is a lot like what Amazon does 
on their front page.

BTW can you explain your notation? s = log r + N(0,log \epsilon) 
N?, \epsilon? 

Showing my ignorance probably but why stop now?

On Jan 25, 2014, at 3:56 PM, Ted Dunning  wrote:

Dithering is commonly done by re-ranking results using a noisy score.  Take
r to be the original rank (starting with 1).  Then compute a score as

 s = log r + N(0,log \epsilon)

and sort by this new score in ascending order.

Items will be shuffled by this method in such a way that the probability
that item 2k will appear before item k is nearly invariant with respect to
k.  Thus, item 3 will appear before item 1 about as often as item 30 will
appear before item 10.  The major effect here is to dredge deep results up
onto the first page (occasionally) so that the recommendation has broader
training data.

You can seed this with time in order to get the appearance of changing
recommendations even when no change in history is recorded.  Moreover, the
time varying seed can be held constant for a short period (a few minutes to
an hour or so) so that you also give the appearance of short-term
stability.  Both of these effects seem to entice users back to a
recommendation page.  Ironically, people seem more willing to return to the
first recommendation page than they are willing to click to the second page.

This addition of random noise obviously makes your best recommendation
results worse. The penalty is worthwhile to the extent that your
recommender learns enough to make results better tomorrow.  This has been
my universal experience for reasonable levels of dithering.

Anti-flood is quite a bit more heuristic and can be motivated by the idea
that recommenders are recommending individual items but users are being
shown an entire portfolio of items on the first page.  The probability of
making the user happy with any given page of recommendations is not
increased if you show items which are nearly identical because if they like
one item, they will very, very likely the others and if they don't like
one, they likely won't like the others.  On the other hand, if you were two
split the page between two groups of very distinctly different kinds of
items, if you miss on one group, you don't have a guaranteed miss on the
second group and thus you have hedged your bets and will have better user
satisfaction.

How you accomplish this is largely a UI question.  You could cluster the
items and show the users 1-2 items from each cluster with an option for
seeing the full cluster.  You can also use a synthetic score approach where
you penalize items that are too similar to items higher in the results
list.  The meaning of too similar is typically hand crafted to your domain.
It might be a test for the same author, or the same genre or whatever you
have handy.





On Sat, Jan 25, 2014 at 1:42 PM, Tevfik Aytekin wrote:

> Hi Ted,
> Could you explain what do you mean by a "dithering step" and an
> "anti-flood step"?
> By dithering I guess you mean adding some sort of noise in order not
> to show the same results every time.
> But I have no clue about the anti-flood step.
> 
> Tevfik
> 
> On Sat, Jan 25, 2014 at 11:05 PM, Koobas  wrote:
>> On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin <
> tevfik.ayte...@gmail.com>wrote:
>> 
>>> Case 1 is fine, in case 2, I don't think that a dot product (without
>>> normalization) will yield a meaningful distance measure. Cosine
>>> distance or a Pearson correlation would be better. The situation is
>>> similar to Latent Semantic Indexing in which documents are represented
>>> by their low rank approximations and similarities between them (that
>>> is, approximations) are computed using cosine similarity.
>>> There is no need to make any normalization in case 1 since the values
>>> in the feature vectors are formed to approximate the rating values.
>>> 
>>> That's exactly what I was thinking.
>> Thanks for your reply.
>> 
>> 
>>> On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
 A generic latent variable recommender question.
 I passed the user-item matrix through a low rank approximation,
 with either something like ALS or SVD, and now I have the feature
 vectors for all users and all items.
 
 Case 1:
 I want to recommend items to a user.
 I compute a dot product of the user’s feature vector with all feature
 vectors of all the items.
 I eliminate the ones that the user already has, and find the largest
>>> value
 among the others, right?
 
 Case 2:
 I want to find similar items for an item.
 Should I compute dot product of the item’s feature vector against
> feature
 vectors of all the other items?
   OR
 Sh

Re: generic latent variable recommender question

2014-01-25 Thread Ted Dunning
On Sat, Jan 25, 2014 at 4:33 PM, Pat Ferrel  wrote:

> BTW can you explain your notation? s = log r + N(0,log \epsilon)
> N?, \epsilon?
>

r is rank
N is normal distribution
\epsilon is an arbitrary constant that drives the amount of mixing.
 Typical values are <=4.


Re: generic latent variable recommender question

2014-01-25 Thread Suneel Marthi
N(0, log\epsilon) =>   Normal Distribution with Mean = 0 and Variance = 
log(epsilon)







On Saturday, January 25, 2014 7:33 PM, Pat Ferrel  
wrote:
 
For anti-flood and in the vein of “UI” you can build a recommender that 
recommends categories or genres then get recommendations weighted or filtered 
by those categories. A simple version of this is to just look at preference 
frequency by category for the current user. This is a lot like what Amazon does 
on their front page.

BTW can you explain your notation? s = log r + N(0,log \epsilon) 
N?, \epsilon? 

Showing my ignorance probably but why stop now?


On Jan 25, 2014, at 3:56 PM, Ted Dunning  wrote:

Dithering is commonly done by re-ranking results using a noisy score.  Take
r to be the original rank (starting with 1).  Then compute a score as

     s = log r + N(0,log \epsilon)

and sort by this new score in ascending order.

Items will be shuffled by this method in such a way that the probability
that item 2k will appear before item k is nearly invariant with respect to
k.  Thus, item 3 will appear before item 1 about as often as item 30 will
appear before item 10.  The major effect here is to dredge deep results up
onto the first page (occasionally) so that the recommendation has broader
training data.

You can seed this with time in order to get the appearance of changing
recommendations even when no change in history is recorded.  Moreover, the
time varying seed can be held constant for a short period (a few minutes to
an hour or so) so that you also give the appearance of short-term
stability.  Both of these effects seem to entice users back to a
recommendation page.  Ironically, people seem more willing to return to the
first recommendation page than they are willing to click to the second page.

This addition of random noise obviously makes your best recommendation
results worse. The penalty is worthwhile to the extent that your
recommender learns enough to make results better tomorrow.  This has been
my universal experience for reasonable levels of dithering.

Anti-flood is quite a bit more heuristic and can be motivated by the idea
that recommenders are recommending individual items but users are being
shown an entire portfolio of items on the first page.  The probability of
making the user happy with any given page of recommendations is not
increased if you show items which are nearly identical because if they like
one item, they will very, very likely the others and if they don't like
one, they likely won't like the others.  On the other hand, if you were two
split the page between two groups of very distinctly different kinds of
items, if you miss on one group, you don't have a guaranteed miss on the
second group and thus you have hedged your bets and will have better user
satisfaction.

How you accomplish this is largely a UI question.  You could cluster the
items and show the users 1-2 items from each cluster with an option for
seeing the full cluster.  You can also use a synthetic score approach where
you penalize items that are too similar to items higher in the results
list.  The meaning of too similar is typically hand crafted to your domain.
It might be a test for the same author, or the same genre or whatever you
have handy.





On Sat, Jan 25, 2014 at 1:42 PM, Tevfik Aytekin wrote:

> Hi Ted,
> Could you explain what do you mean by a "dithering step" and an
> "anti-flood step"?
> By dithering I guess you mean adding some sort of noise in order not
> to show the same results every time.
> But I have no clue about the anti-flood step.
> 
> Tevfik
> 
> On Sat, Jan 25, 2014 at 11:05 PM, Koobas  wrote:
>> On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin <
> tevfik.ayte...@gmail.com>wrote:
>> 
>>> Case 1 is fine, in case 2, I don't think that a dot product (without
>>> normalization) will yield a meaningful distance measure. Cosine
>>> distance or a Pearson correlation would be better. The situation is
>>> similar to Latent Semantic Indexing in which documents are represented
>>> by their low rank approximations and similarities between them (that
>>> is, approximations) are computed using cosine similarity.
>>> There is no need to make any normalization in case 1 since the values
>>> in the feature vectors are formed to approximate the rating values.
>>> 
>>> That's exactly what I was thinking.
>> Thanks for your reply.
>> 
>> 
>>> On Sat, Jan 25, 2014 at 5:08 AM, Koobas  wrote:
 A generic latent variable recommender question.
 I passed the user-item matrix through a low rank approximation,
 with either something like ALS or SVD, and now I have the feature
 vectors for all users and all items.
 
 Case 1:
 I want to recommend items to a user.
 I compute a dot product of the user’s feature vector with all feature
 vectors of all the items.
 I eliminate the ones that the user already has, and find the largest
>>> value
 among the others, right?
 
 Case 2:
 I want to find similar

Re: generic latent variable recommender question

2014-01-26 Thread Pat Ferrel
Well I’m happy to raise the bar for stupid questions…

thanks

Some more anti-flood ideas. The video browser/recommender demo site for the 
solr-recommender has user preference training data and genre metadata (it would 
be nice to have other things). It makes this set of recommendations for each 
user where there is enough preference data:
1) user preference history based recs (the query is the user's thumbs-up or 
add-to-watchlist history)
2) user preference history based where recent item detail views is the query
3) user preference history based but weighted by all genres preferred. So the 
query is history on training data and preferred genres on item genre metadata. 
This is not the same as filtering by category.
4) user preference history based but weighted by your favorite non-trivial 
genre.
5) user preference history based but weighted by your second most favorite 
genre.
6) metadata alone recs based on using your preferred genres as the query.

#2 and #6 are probably the least successful and they are all just experiments. 
Many people filter by category/genre, here I chose to weight the query using 
Solr boosting on the category field. This results in fairly different sets of 
recs but since it’s only me looking at them I’ll reserve judgement on quality.

On an item detail page you get the following sets:
1) item similarity based recs, taken directly from the training item-item 
similarity matrix
2) item similarity based recs from using #1 as the query, subtly different than 
#1 for various reasons
3) #2 weighted by the genres of the video being viewed
4) #2 from the first list--click path based recs
5) user preference history based recs weighted to the genre of the current 
video (similar to #4 i the top list)

Here the eyeball test says #3 & #5 look best at least to me.

I think I’ll leave dithering out until it goes live because it would seem to 
make the eyeball test easier. I doubt all these experiments will survive.

On Jan 25, 2014, at 4:37 PM, Suneel Marthi  wrote:

N(0, log\epsilon) =>   Normal Distribution with Mean = 0 and Variance = 
log(epsilon)







On Saturday, January 25, 2014 7:33 PM, Pat Ferrel  
wrote:

For anti-flood and in the vein of “UI” you can build a recommender that 
recommends categories or genres then get recommendations weighted or filtered 
by those categories. A simple version of this is to just look at preference 
frequency by category for the current user. This is a lot like what Amazon does 
on their front page.

BTW can you explain your notation? s = log r + N(0,log \epsilon) 
N?, \epsilon? 

Showing my ignorance probably but why stop now?


On Jan 25, 2014, at 3:56 PM, Ted Dunning  wrote:

Dithering is commonly done by re-ranking results using a noisy score.  Take
r to be the original rank (starting with 1).  Then compute a score as

 s = log r + N(0,log \epsilon)

and sort by this new score in ascending order.

Items will be shuffled by this method in such a way that the probability
that item 2k will appear before item k is nearly invariant with respect to
k.  Thus, item 3 will appear before item 1 about as often as item 30 will
appear before item 10.  The major effect here is to dredge deep results up
onto the first page (occasionally) so that the recommendation has broader
training data.

You can seed this with time in order to get the appearance of changing
recommendations even when no change in history is recorded.  Moreover, the
time varying seed can be held constant for a short period (a few minutes to
an hour or so) so that you also give the appearance of short-term
stability.  Both of these effects seem to entice users back to a
recommendation page.  Ironically, people seem more willing to return to the
first recommendation page than they are willing to click to the second page.

This addition of random noise obviously makes your best recommendation
results worse. The penalty is worthwhile to the extent that your
recommender learns enough to make results better tomorrow.  This has been
my universal experience for reasonable levels of dithering.

Anti-flood is quite a bit more heuristic and can be motivated by the idea
that recommenders are recommending individual items but users are being
shown an entire portfolio of items on the first page.  The probability of
making the user happy with any given page of recommendations is not
increased if you show items which are nearly identical because if they like
one item, they will very, very likely the others and if they don't like
one, they likely won't like the others.  On the other hand, if you were two
split the page between two groups of very distinctly different kinds of
items, if you miss on one group, you don't have a guaranteed miss on the
second group and thus you have hedged your bets and will have better user
satisfaction.

How you accomplish this is largely a UI question.  You could cluster the
items and show the users 1-2 items from each cluster with an option for
seeing the full cluster. 

Re: generic latent variable recommender question

2014-01-26 Thread Ted Dunning
On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel  wrote:

> I think I’ll leave dithering out until it goes live because it would seem
> to make the eyeball test easier. I doubt all these experiments will survive.
>

With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon)
= 0), then no re-ordering is done.

I like knobs that go to 11, but also have an "off" position.


Re: generic latent variable recommender question

2014-01-26 Thread Tevfik Aytekin
Thanks for the answers, actually I worked on a similar issue,
increasing the diversity of top-N lists
(http://link.springer.com/article/10.1007%2Fs10844-013-0252-9).
Clustering-based approaches produce good results and they are very
fast compared to some optimization based techniques. Also it turned
out that introducing randomization (such as choosing random 20 items
among the top 100 items) might decrease diversity if the diversity of
the top-N lists is better than the diversity of a set of random items,
which might sometimes be the case.

On Sun, Jan 26, 2014 at 8:49 PM, Ted Dunning  wrote:
> On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel  wrote:
>
>> I think I’ll leave dithering out until it goes live because it would seem
>> to make the eyeball test easier. I doubt all these experiments will survive.
>>
>
> With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon)
> = 0), then no re-ordering is done.
>
> I like knobs that go to 11, but also have an "off" position.