RE: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread Gruszowska Natalia
Mario, 
I think in terms of correctness. In similarities like Euclidean, Pearson 
correlation or Cosine Similarity better results are if we consider only common 
users (users who rated both compared items). This assumption let to find 
similar item for those which are unpopular, otherwise we recommend only very 
popular items. For my data it is unacceptable.

But if you take, for example, the cosine similarity, you shouldn't throw away 
the data. - you should, it result in dimension reduction and it is good. 
Everything is still in the same space but for each pair the space is reduced. 

My question is why someone who wrote this code ignored this so important 
assumption? It was by accident or due to some important reasons like 
effectiveness or computational complexity?  


Natalia


-Original Message-
From: mario.al...@gmail.com [mailto:mario.al...@gmail.com] 
Sent: Wednesday, December 10, 2014 7:05 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating 
users

Hi Natalia

Regarding example 1, if you think in terms of likelihood that the two products 
have been bought together because they are similar (opposed to by chance), the 
similarity is undefined. As everyone buys 12, of course the person who bought 
11 bough also 12, right?

This if you compute the similarity through a co-occurence matrix (and 
loglikelihood ratio)

But you say In the theory, similarity between two items should be calculated 
only for users who ranked both items.

I guess you mean: Users [1,2,4] don't know about item 11, therefore they do 
not collaborate in building the similarity between the two items. User [3], on 
the contrary, does, and gives the same rating to the two products, therefore 
the similarity is 1.

But if you take, for example, the cosine similarity, you shouldn't throw away 
the data. Here, you build a space with four dimensions -the ratings of four 
users. You can't say product 11 is on another space when it relates with user 
1,2,4 because hasn't been rated by those users. They all are there. They are 
dimensions, like in physics. Therefore you must use this information too. Items 
are in the user-space... all.

Even intuitively, items 11 and 12 are not similar at all -one has been bought 
by every customer, the other by just one customer. How could you tell the next 
customer who buys 12 (everyone does...) that she would really like 11...?

Mario


On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia  
natalia.gruszow...@grupaonet.pl wrote:

 Hi All,

 In mahout there is implemented method for item based Collaborative 
 filtering called itemsimilarity, which returns the similarity 
 between each two items.
 In the theory, similarity between two items should be calculated only 
 for users who ranked both items. During testing I realized that in 
 mahout it works different.
 Below two examples.

 Example 1. items are 11-12
 In below example the similarity between item 11 and 12 should be equal 
 1, but mahout output is 0.36. It looks like mahout treats null as 0.
 Similarity between items:
 101 102 0.36602540378443865

 Matrix with preferences:
 11   12
 1 1
 2 1
 3   1 1
 4 1

 Example 2. items are 101-103.
 Similarity between items 101 and 102 should be calculated using only 
 ranks for users 4 and 5, and the same for items 101 and 103 (that 
 should be based on theory). Here (101,103) is more similar than 
 (101,102), and it shouldn't be.
 Similarity between items:
 101 102 0.2612038749637414
 101 103 0.4340578302732228
 102 103 0.2600070276638468

 Matrix with preferences:
 101  102103
 1 1 0.1
 2 1 0.1
 3 1 0.1
 4   1 1 0.1
 5   1 1 0.1
 6 1 0.1
 7 1 0.1
 8 1 0.1
 9 1 0.1
 101 0.1


 Both examples were run without any additional parameters.
 Is this problem solved somewhere, somehow? Any ideas? Why null is 
 treated as 0?
 Source: http://files.grouplens.org/papers/www10_sarwar.pdf



 Kind regards,
 Natalia Gruszowska





Re: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread mario . alemi
 otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia 
natalia.gruszow...@grupaonet.pl wrote:

 Mario,
 I think in terms of correctness. In similarities like Euclidean, Pearson
 correlation or Cosine Similarity better results are if we consider only
 common users (users who rated both compared items). This assumption let to
 find similar item for those which are unpopular, otherwise we recommend
 only very popular items. For my data it is unacceptable.

 But if you take, for example, the cosine similarity, you shouldn't throw
 away the data. - you should, it result in dimension reduction and it is
 good. Everything is still in the same space but for each pair the space is
 reduced.

 My question is why someone who wrote this code ignored this so important
 assumption? It was by accident or due to some important reasons like
 effectiveness or computational complexity?


 Natalia


 -Original Message-
 From: mario.al...@gmail.com [mailto:mario.al...@gmail.com]
 Sent: Wednesday, December 10, 2014 7:05 PM
 To: user@mahout.apache.org
 Subject: Re: Collaborative filtering item-based in mahout - without
 isolating users

 Hi Natalia

 Regarding example 1, if you think in terms of likelihood that the two
 products have been bought together because they are similar (opposed to by
 chance), the similarity is undefined. As everyone buys 12, of course the
 person who bought 11 bough also 12, right?

 This if you compute the similarity through a co-occurence matrix (and
 loglikelihood ratio)

 But you say In the theory, similarity between two items should be
 calculated only for users who ranked both items.

 I guess you mean: Users [1,2,4] don't know about item 11, therefore they
 do not collaborate in building the similarity between the two items. User
 [3], on the contrary, does, and gives the same rating to the two products,
 therefore the similarity is 1.

 But if you take, for example, the cosine similarity, you shouldn't throw
 away the data. Here, you build a space with four dimensions -the ratings of
 four users. You can't say product 11 is on another space when it relates
 with user 1,2,4 because hasn't been rated by those users. They all are
 there. They are dimensions, like in physics. Therefore you must use this
 information too. Items are in the user-space... all.

 Even intuitively, items 11 and 12 are not similar at all -one has been
 bought by every customer, the other by just one customer. How could you
 tell the next customer who buys 12 (everyone does...) that she would really
 like 11...?

 Mario


 On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia 
 natalia.gruszow...@grupaonet.pl wrote:

  Hi All,
 
  In mahout there is implemented method for item based Collaborative
  filtering called itemsimilarity, which returns the similarity
  between each two items.
  In the theory, similarity between two items should be calculated only
  for users who ranked both items. During testing I realized that in
  mahout it works different.
  Below two examples.
 
  Example 1. items are 11-12
  In below example the similarity between item 11 and 12 should be equal
  1, but mahout output is 0.36. It looks like mahout treats null as 0.
  Similarity between items:
  101 102 0.36602540378443865
 
  Matrix with preferences:
  11   12
  1 1
  2 1
  3   1 1
  4 1
 
  Example 2. items are 101-103.
  Similarity between items 101 and 102 should be calculated using only
  ranks for users 4 and 5, and the same for items 101 and 103 (that
  should be based on theory). Here (101,103) is more similar than
  (101,102), and it shouldn't be.
  Similarity between items:
  101 102 0.2612038749637414
  101 103 0.4340578302732228
  102 103 0.2600070276638468
 
  Matrix with preferences:
  101  102103
  1 1 0.1
  2 1 0.1
  3 1 0.1
  4   1 1 0.1
  5   1 1 0.1
  6 1 0.1
  7 1 0.1
  8 1 0.1
  9 1 0.1
  101 0.1
 
 
  Both examples were run without any additional parameters.
  Is this problem solved somewhere, somehow? Any ideas? Why null is
  treated as 0?
  Source: http://files.grouplens.org/papers/www10_sarwar.pdf
 
 
 
  Kind regards,
  Natalia Gruszowska
 
 
 



RE: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread Gruszowska Natalia
To be honest I haven't seen the code of this similarity (do you have?). But 
then as I see it, it ignore other side - this time popular items and additional 
it looks like it ignore value of ratig - has only 1 or 0.

N.

-Original Message-
From: mario.al...@gmail.com [mailto:mario.al...@gmail.com] 
Sent: Thursday, December 11, 2014 12:00 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating 
users

 otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia  
natalia.gruszow...@grupaonet.pl wrote:

 Mario,
 I think in terms of correctness. In similarities like Euclidean, 
 Pearson correlation or Cosine Similarity better results are if we 
 consider only common users (users who rated both compared items). This 
 assumption let to find similar item for those which are unpopular, 
 otherwise we recommend only very popular items. For my data it is 
 unacceptable.

 But if you take, for example, the cosine similarity, you shouldn't 
 throw away the data. - you should, it result in dimension reduction 
 and it is good. Everything is still in the same space but for each 
 pair the space is reduced.

 My question is why someone who wrote this code ignored this so 
 important assumption? It was by accident or due to some important 
 reasons like effectiveness or computational complexity?


 Natalia


 -Original Message-
 From: mario.al...@gmail.com [mailto:mario.al...@gmail.com]
 Sent: Wednesday, December 10, 2014 7:05 PM
 To: user@mahout.apache.org
 Subject: Re: Collaborative filtering item-based in mahout - without 
 isolating users

 Hi Natalia

 Regarding example 1, if you think in terms of likelihood that the two 
 products have been bought together because they are similar (opposed 
 to by chance), the similarity is undefined. As everyone buys 12, of 
 course the person who bought 11 bough also 12, right?

 This if you compute the similarity through a co-occurence matrix (and 
 loglikelihood ratio)

 But you say In the theory, similarity between two items should be 
 calculated only for users who ranked both items.

 I guess you mean: Users [1,2,4] don't know about item 11, therefore 
 they do not collaborate in building the similarity between the two 
 items. User [3], on the contrary, does, and gives the same rating to 
 the two products, therefore the similarity is 1.

 But if you take, for example, the cosine similarity, you shouldn't 
 throw away the data. Here, you build a space with four dimensions -the 
 ratings of four users. You can't say product 11 is on another space 
 when it relates with user 1,2,4 because hasn't been rated by those 
 users. They all are there. They are dimensions, like in physics. 
 Therefore you must use this information too. Items are in the user-space... 
 all.

 Even intuitively, items 11 and 12 are not similar at all -one has been 
 bought by every customer, the other by just one customer. How could 
 you tell the next customer who buys 12 (everyone does...) that she 
 would really like 11...?

 Mario


 On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia  
 natalia.gruszow...@grupaonet.pl wrote:

  Hi All,
 
  In mahout there is implemented method for item based Collaborative 
  filtering called itemsimilarity, which returns the similarity
  between each two items.
  In the theory, similarity between two items should be calculated 
  only for users who ranked both items. During testing I realized that 
  in mahout it works different.
  Below two examples.
 
  Example 1. items are 11-12
  In below example the similarity between item 11 and 12 should be 
  equal 1, but mahout output is 0.36. It looks like mahout treats null as 0.
  Similarity between items:
  101 102 0.36602540378443865
 
  Matrix with preferences:
  11   12
  1 1
  2 1
  3   1 1
  4 1
 
  Example 2. items are 101-103.
  Similarity between items 101 and 102 should be calculated using only 
  ranks for users 4 and 5, and the same for items 101 and 103 (that 
  should be based on theory). Here (101,103) is more similar than 
  (101,102), and it shouldn't be.
  Similarity between items:
  101 102 0.2612038749637414
  101 103 0.4340578302732228
  102 103 0.2600070276638468
 
  Matrix with preferences:
  101  102103
  1 1 0.1
  2 1 0.1
  3 1 0.1
  4   1 1 0.1
  5   1 1 0.1
  6 1 0.1
  7 1 0.1
  8 1 0.1
  9 1 0.1
  101 0.1
 
 
  Both examples were run without any additional parameters.
  Is this problem solved somewhere, somehow? Any ideas

Re: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread Pat Ferrel
Using LLR ratings are ignored. It is only interested in whether there was an 
interaction between the user and the item. LLR calculates its own weights based 
on a probabilistic measure of cooccurrence importance. Cooccurrences are all it 
looks at so 0 is ignored, it does not indicate a negative preference it mean 
any preference is undefined or non-existant. In fact those implied 0s in a 
particular user’s history are exactly where recommendations will come from 
since we don’t want to recommend something the user already know about.

The root of your question is a bit hard to explain since it requires a 
knowledge of cooccurrence recommenders and the LLR calculation itself. So you 
can read these for more explanation:
A short ebook here that talks about LLR: 
https://www.mapr.com/practical-machine-learning
a blog post here: 
http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
wikipedia: http://en.wikipedia.org/wiki/Likelihood_function

The spark version “spark-itemsimilarity” can take in multiple actions/events, 
calculate a cross-cooccurrence with the primary action to determine the 
strength of correlation, and use the secondary data to improve recs. This is a 
better way to handle thumbs up/thumbs down or other user actions in a 
recommender since it automatically determines correlation strength, not relying 
on user or developer supplied weights.

Ratings are often problematic, people rate on different scales at different 
times on different subjects. There have been many algorithms proposed to deal 
with this but most new research deals with optimizing the ranking order of 
recommendations which is usually more important in the application.

On Dec 11, 2014, at 4:23 AM, Gruszowska Natalia 
natalia.gruszow...@grupaonet.pl wrote:

To be honest I haven't seen the code of this similarity (do you have?). But 
then as I see it, it ignore other side - this time popular items and additional 
it looks like it ignore value of ratig - has only 1 or 0.

N.

-Original Message-
From: mario.al...@gmail.com [mailto:mario.al...@gmail.com] 
Sent: Thursday, December 11, 2014 12:00 PM
To: user@mahout.apache.org
Subject: Re: Collaborative filtering item-based in mahout - without isolating 
users

 otherwise we recommend only very popular items

this is why you have loglikelihood ratio, right?
m

On Thu, Dec 11, 2014 at 11:51 AM, Gruszowska Natalia  
natalia.gruszow...@grupaonet.pl wrote:

 Mario,
 I think in terms of correctness. In similarities like Euclidean, 
 Pearson correlation or Cosine Similarity better results are if we 
 consider only common users (users who rated both compared items). This 
 assumption let to find similar item for those which are unpopular, 
 otherwise we recommend only very popular items. For my data it is 
 unacceptable.
 
 But if you take, for example, the cosine similarity, you shouldn't 
 throw away the data. - you should, it result in dimension reduction 
 and it is good. Everything is still in the same space but for each 
 pair the space is reduced.
 
 My question is why someone who wrote this code ignored this so 
 important assumption? It was by accident or due to some important 
 reasons like effectiveness or computational complexity?
 
 
 Natalia
 
 
 -Original Message-
 From: mario.al...@gmail.com [mailto:mario.al...@gmail.com]
 Sent: Wednesday, December 10, 2014 7:05 PM
 To: user@mahout.apache.org
 Subject: Re: Collaborative filtering item-based in mahout - without 
 isolating users
 
 Hi Natalia
 
 Regarding example 1, if you think in terms of likelihood that the two 
 products have been bought together because they are similar (opposed 
 to by chance), the similarity is undefined. As everyone buys 12, of 
 course the person who bought 11 bough also 12, right?
 
 This if you compute the similarity through a co-occurence matrix (and 
 loglikelihood ratio)
 
 But you say In the theory, similarity between two items should be 
 calculated only for users who ranked both items.
 
 I guess you mean: Users [1,2,4] don't know about item 11, therefore 
 they do not collaborate in building the similarity between the two 
 items. User [3], on the contrary, does, and gives the same rating to 
 the two products, therefore the similarity is 1.
 
 But if you take, for example, the cosine similarity, you shouldn't 
 throw away the data. Here, you build a space with four dimensions -the 
 ratings of four users. You can't say product 11 is on another space 
 when it relates with user 1,2,4 because hasn't been rated by those 
 users. They all are there. They are dimensions, like in physics. 
 Therefore you must use this information too. Items are in the user-space... 
 all.
 
 Even intuitively, items 11 and 12 are not similar at all -one has been 
 bought by every customer, the other by just one customer. How could 
 you tell the next customer who buys 12 (everyone does...) that she 
 would really like 11...?
 
 Mario
 
 
 On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska

Re: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread Ted Dunning
Natalia,

It sounds like you are starting from the assumption that ratings are being
done.

This can happen, but in production recommendation settings, ratings is
typically a very low value input because the meaning of a rating is very
complex and because so few users actually do ratings unless forced into
unnatural acts.

Instead, you typically wind up using other kinds of actions.  If you do use
ratings, it is often better to ignore the value of the rating and use the
mere fact of the rating.  It is also common to assume that all users
*could* have interacted with any item even if they didn't.  This assumption
is suspect, but it is better than assuming that lack of interaction really
means lack of opportunity.

Adjusting your assumptions to fit these leads, I think, to the approach
used by Mahout.



On Thu, Dec 11, 2014 at 2:51 AM, Gruszowska Natalia 
natalia.gruszow...@grupaonet.pl wrote:

 Mario,
 I think in terms of correctness. In similarities like Euclidean, Pearson
 correlation or Cosine Similarity better results are if we consider only
 common users (users who rated both compared items). This assumption let to
 find similar item for those which are unpopular, otherwise we recommend
 only very popular items. For my data it is unacceptable.

 But if you take, for example, the cosine similarity, you shouldn't throw
 away the data. - you should, it result in dimension reduction and it is
 good. Everything is still in the same space but for each pair the space is
 reduced.

 My question is why someone who wrote this code ignored this so important
 assumption? It was by accident or due to some important reasons like
 effectiveness or computational complexity?


 Natalia


 -Original Message-
 From: mario.al...@gmail.com [mailto:mario.al...@gmail.com]
 Sent: Wednesday, December 10, 2014 7:05 PM
 To: user@mahout.apache.org
 Subject: Re: Collaborative filtering item-based in mahout - without
 isolating users

 Hi Natalia

 Regarding example 1, if you think in terms of likelihood that the two
 products have been bought together because they are similar (opposed to by
 chance), the similarity is undefined. As everyone buys 12, of course the
 person who bought 11 bough also 12, right?

 This if you compute the similarity through a co-occurence matrix (and
 loglikelihood ratio)

 But you say In the theory, similarity between two items should be
 calculated only for users who ranked both items.

 I guess you mean: Users [1,2,4] don't know about item 11, therefore they
 do not collaborate in building the similarity between the two items. User
 [3], on the contrary, does, and gives the same rating to the two products,
 therefore the similarity is 1.

 But if you take, for example, the cosine similarity, you shouldn't throw
 away the data. Here, you build a space with four dimensions -the ratings of
 four users. You can't say product 11 is on another space when it relates
 with user 1,2,4 because hasn't been rated by those users. They all are
 there. They are dimensions, like in physics. Therefore you must use this
 information too. Items are in the user-space... all.

 Even intuitively, items 11 and 12 are not similar at all -one has been
 bought by every customer, the other by just one customer. How could you
 tell the next customer who buys 12 (everyone does...) that she would really
 like 11...?

 Mario


 On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia 
 natalia.gruszow...@grupaonet.pl wrote:

  Hi All,
 
  In mahout there is implemented method for item based Collaborative
  filtering called itemsimilarity, which returns the similarity
  between each two items.
  In the theory, similarity between two items should be calculated only
  for users who ranked both items. During testing I realized that in
  mahout it works different.
  Below two examples.
 
  Example 1. items are 11-12
  In below example the similarity between item 11 and 12 should be equal
  1, but mahout output is 0.36. It looks like mahout treats null as 0.
  Similarity between items:
  101 102 0.36602540378443865
 
  Matrix with preferences:
  11   12
  1 1
  2 1
  3   1 1
  4 1
 
  Example 2. items are 101-103.
  Similarity between items 101 and 102 should be calculated using only
  ranks for users 4 and 5, and the same for items 101 and 103 (that
  should be based on theory). Here (101,103) is more similar than
  (101,102), and it shouldn't be.
  Similarity between items:
  101 102 0.2612038749637414
  101 103 0.4340578302732228
  102 103 0.2600070276638468
 
  Matrix with preferences:
  101  102103
  1 1 0.1
  2 1 0.1
  3 1 0.1
  4   1 1 0.1
  5   1 1 0.1
  6 1 0.1
  7 1 0.1
  8

Re: Collaborative filtering item-based in mahout - without isolating users

2014-12-10 Thread mario . alemi
Hi Natalia

Regarding example 1, if you think in terms of likelihood that the two
products have been bought together because they are similar (opposed to by
chance), the similarity is undefined. As everyone buys 12, of course the
person who bought 11 bough also 12, right?

This if you compute the similarity through a co-occurence matrix (and
loglikelihood ratio)

But you say In the theory, similarity between two items should be
calculated only for users who ranked both items.

I guess you mean: Users [1,2,4] don't know about item 11, therefore they
do not collaborate in building the similarity between the two items. User
[3], on the contrary, does, and gives the same rating to the two products,
therefore the similarity is 1.

But if you take, for example, the cosine similarity, you shouldn't throw
away the data. Here, you build a space with four dimensions -the ratings of
four users. You can't say product 11 is on another space when it relates
with user 1,2,4 because hasn't been rated by those users. They all are
there. They are dimensions, like in physics. Therefore you must use this
information too. Items are in the user-space... all.

Even intuitively, items 11 and 12 are not similar at all -one has been
bought by every customer, the other by just one customer. How could you
tell the next customer who buys 12 (everyone does...) that she would really
like 11...?

Mario


On Wed, Dec 10, 2014 at 4:40 PM, Gruszowska Natalia 
natalia.gruszow...@grupaonet.pl wrote:

 Hi All,

 In mahout there is implemented method for item based Collaborative
 filtering called itemsimilarity, which returns the similarity between
 each two items.
 In the theory, similarity between two items should be calculated only for
 users who ranked both items. During testing I realized that in mahout it
 works different.
 Below two examples.

 Example 1. items are 11-12
 In below example the similarity between item 11 and 12 should be equal 1,
 but mahout output is 0.36. It looks like mahout treats null as 0.
 Similarity between items:
 101 102 0.36602540378443865

 Matrix with preferences:
 11   12
 1 1
 2 1
 3   1 1
 4 1

 Example 2. items are 101-103.
 Similarity between items 101 and 102 should be calculated using only ranks
 for users 4 and 5, and the same for items 101 and 103 (that should be based
 on theory). Here (101,103) is more similar than (101,102), and it shouldn't
 be.
 Similarity between items:
 101 102 0.2612038749637414
 101 103 0.4340578302732228
 102 103 0.2600070276638468

 Matrix with preferences:
 101  102103
 1 1 0.1
 2 1 0.1
 3 1 0.1
 4   1 1 0.1
 5   1 1 0.1
 6 1 0.1
 7 1 0.1
 8 1 0.1
 9 1 0.1
 101 0.1


 Both examples were run without any additional parameters.
 Is this problem solved somewhere, somehow? Any ideas? Why null is treated
 as 0?
 Source: http://files.grouplens.org/papers/www10_sarwar.pdf



 Kind regards,
 Natalia Gruszowska