[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-12-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3536


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-12-30 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-168112394
  
I'm going to close this pull request. If this is still relevant and you are 
interested in pushing it forward, please open a new pull request. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-07-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-121073727
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-24 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-105026856
  
Let's continue the validation discussion on 
https://github.com/apache/spark/pull/6213. The PR introduces batch gemm based 
similarity computation in MatrixFactorizationModel using kernel abstraction. Do 
need the online version as well that Steven added or it can be extracted out of 
batch results ? My focus was more on speeding up batch computation...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-05 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-99098372
  
@MLnick yes that's what I did...I have to convince users why use factor 
vectors :-) For user->item recommendation, convincing is easy by showing the 
ranking improvement through ALS

@srowen without coming up with a validation strategy, someone might propose 
to run a different algorithm (KMeans on raw feature space followed by 
(item->cluster) join (cluster->items)) and claims his item->item results are 
better...how do we know whether ALS based flow is producing better result or 
KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these 
flows are very similar.

I am focused on implicit feedback here because then only we can run either 
KMeans or Similarity on raw feature space...With explicit feedback, I agree 
that cosine similarity is not valid in original feature space. But in most 
practical datasets, we are dealing with implicit feedback. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-98982486
  
I have not benchmarked these since neither is a "correct" answer to 
benchmark against the other. The cosine similarity isn't really that valid in 
the original feature space. It might still be interesting to know how different 
the answers are but they're probably going to be fairly different on purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-04 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-98969251
  
Not sure I follow completely - do you mean you compared cosine sim between 
raw (ie "rating") item vectors, and cosine sim computed from item factor 
vectors? I would imagine they would be quite different...

I always just use factor vectors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-02 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-98425139
  
@MLnick @srowen I did an experiment where I computed brute force topK 
similar items using cosine distance and compared the intersection with item 
factor based brute force topK similar items using cosine distance after running 
implicit factorization...intersection is only 42%...this is inline with Google 
Correlate paper where they have to do an additional reorder step in real 
feature space to increase the recall (intersect)...did you guys also see 
similar results for item->item validation ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-96770048
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-93861608
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-03-19 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-83755168
  
agreed...Dense BLAS will be a common optimization to item->item, user->user 
and user->item APIs...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-03-19 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-83753751
  
Agree, although this is no worse than the existing implementation for 
recommendation (it reuses it even).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-03-19 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-83753525
  
@srowen we have to be a bit careful since dense blas has to be used...I 
have a internal version with dot and it needs to be more faster..also one at a 
time is not a good idea...there has to be block dense matrix * block dense 
matrix operation...that way we can reuse native dgemm...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-03-19 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-83749073
  
I kind of like this functionality; @sbourke are you in a position to 
continue with this? I think it needs a few typo fixes and commentary about its 
function. This is really about computing similar users and items rather than 
recommending items to users or users to items. I also think it has to include 
dividing by the norms to be cosine similarity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-10 Thread debasish83
Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-66503238
  
Can we discuss it more on the JIRA ? I updated it with my comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-09 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-66331985
  
I'd agree that cosine similarity is preferred. Can't really think of a case 
where I've *not* used cosine sim for a similar items or similar users 
computation. Of course, it could be added as an option potentially (whether to 
use cosine sim - default - or plain dot product.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-03 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-65393291
  
I wasn't necessarily suggesting changing the similarity metric although I
ended up using cosine too. Note you can skip normalizing by the target
item's norm.

I suppose my point is that the recommendation computation does not use a
dot product because it is performing a similarity computation. Those
vectors are not even in the same space. So I wouldn't reuse that logic on
the grounds that it is reusing a similarity computation.
On Dec 3, 2014 5:03 AM, "Steven"  wrote:

> Re: Explaining similarity metric [image: :+1:] I'll do that.
>
> Re: Cosine - no biggie to add. I used dot product because 1) Taking the
> logic that CF is finding "similar" items based on the latent space for a
> user when recommending products and 2) Using dot product would reduce the
> new code added to MatrixFactorizationModel ( I don't want to create 
clutter
> :)) So [image: :+1:] will change to cosine
>
> Re: Popularity, I'll look into that as well then.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-03 Thread sbourke
Github user sbourke commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-65391077
  

Re: Explaining similarity metric :+1: I'll do that. 

Re: Cosine - no biggie to add. I used dot product because 1) Taking the 
logic that CF is finding "similar" items based on the latent space for a user 
when recommending products and 2) Using dot product would reduce the new code 
added to MatrixFactorizationModel ( I don't want to create clutter :)) So :+1: 
will change to cosine

Re: Popularity, I'll look into that as well then.  




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-01 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-65042436
  
I think it's essential to explain (even in internal comments, or this PR) 
what the similarity metric is. It's just ranking by dot product, which makes it 
something like cosine similarity. The differences are that it isn't in [-1,1], 
and the result doesn't normalize away the length of the feature vectors. This 
tends to favor popular items, or mean that somewhat less similar items may rank 
higher because they're popular. I had traditionally viewed that as a negative, 
and preferred the more standard cosine similarity, but it's certainly up for 
debate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-65042076
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-01 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3536#discussion_r21078944
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -95,6 +95,35 @@ class MatrixFactorizationModel(
   }
 
   /**
+   * Recommends similar products
+   *
+   * @param user the user to find similar users for
+   * @param num how many products to return. The number returned may be 
less than this.
+   * @return [[Rating]] objects, each of which contains the given user ID, 
a user ID, and a
+   *  "score" in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most similar
+   *  user to the specified user ID. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+  def recommendSimilariUsers(user: Int, num: Int): Array[Rating] =
--- End diff --

Typo: `Similari`, also below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2014-12-01 Thread sbourke
GitHub user sbourke opened a pull request:

https://github.com/apache/spark/pull/3536

[MLLIB][SPARK-4675] Find similar products and similar users in 
MatrixFactorizationModel

Using the latent feature space that is learnt in MatrixFactorizationModel, 
I have added 2 new functions to find similar products and similar users. A user 
of the API can for example pass a product ID, and get the closest products 
based on the feature space.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sbourke/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3536


commit 956ca1b86aacb22fabd52740ce0c6fef5524bae8
Author: Senior Stefano El Bour-que 
Date:   2014-11-28T08:40:40Z

added functionality to find similar users and similar products

commit 12e6b6b3a2cbfa1baa29449396e7e85bed1dec56
Author: Steven Bourke 
Date:   2014-11-30T23:22:46Z

added unit test to make sure id isnt teh same




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org