[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82458/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #82458 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)**
 for PR 18748 at commit 
[`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #82458 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82458/testReport)**
 for PR 18748 at commit 
[`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
Jenkins retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82456/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #82456 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)**
 for PR 18748 at commit 
[`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-10-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #82456 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82456/testReport)**
 for PR 18748 at commit 
[`526675d`](https://github.com/apache/spark/commit/526675d009a0f800d62e0e0334e87fef15bdd86c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81825/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #81825 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)**
 for PR 18748 at commit 
[`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #81825 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81825/testReport)**
 for PR 18748 at commit 
[`8ed91ab`](https://github.com/apache/spark/commit/8ed91ab283ccaa0b47ebe8467acc186aeca20c54).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-04 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-09-04 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
Any further comments on this? @srowen @mpjlu @jkbradley?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-21 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
@srowen not really sure about which of `Set` vs the `Dataset` would be more 
common. I'm inclined to stick with `Dataset` to keep the API consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-21 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
Ah ok - that clears things up. Yes that `predict` method is very 
inefficient relative to the `recommendForAll` setup.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-20 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Thanks @MLnick . I have double checked my test.
Since there is no  recommendForUserSubset , my previous test is MLLIB 
MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating of 
many users for many products. The performance of this function is low comparing 
with recommendForAll. 
This PR calls recommendForAll with a subset of the users, I agree with your 
test results. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-18 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
Ok, so I did some larger-scale test on a cluster (3x workers, each with 48 
cores / 100GB allocated RAM with 1 executor).

On same `movielens-latest` datasets (~250,000 users and ~33,000 movies), 
using a **30% sample** of user ids:

```
scala> // all users
scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 25104 ms

scala> // user sample
scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ 
=> Unit) }
Time taken: 8963 ms

scala> 8963 / 25104.0
res16: Double = 0.35703473550031867
```

On a much larger dataset - Amazon books ratings data (8 million users, 2.3 
million items) also using a **30% user sample**:

```
scala> // all users
scala> spark.time { model.recommendForAllUsers(k).foreach(_ => Unit) }
Time taken: 32985936 ms
=> 9.16 hours

scala> // user sample
scala> spark.time { model.recommendForUserSubset(userSample, k).foreach(_ 
=> Unit) }
Time taken: 8164421 ms
=> 2.26 hours

scala> 8164421 / 32985936.0
res7: Double = 0.24751218216151272
```

So it's a reasonably consistent range *25-35%* of time for a *30%* user 
sample (I found broadly similar results with a 70% user sample, taking about 
60% of the recommend-for-all time).

@mpjlu could you double check your results? What I find is consistent with 
my expectations that computing for a subset should take time roughly 
proportional to the ratio of the ids in the subset to the total. It appears to 
me the extra  `distinct` and `join` don't have too much impact on overall 
runtime.

However your results are very different so we should understand why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-01 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Thanks.
This is my test setting:
3 workers, each: 40 cores, 196G memory,  1 executor.
Data Size: user 480,000, item 17,000


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-01 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
I don't get similar results to you (granted I have just tested locally). 

```
scala> spark.time { userRecsAll.foreach(_ => Unit) }
Time taken: 122422 ms

scala> spark.time { userRecsPart.foreach(_ => Unit) }
Time taken: 50228 ms
```

Here, `userRecsPart` is a 30% sample, and the time is ~40% of the 
`recommendForAllUsers` time. I will try some larger-scale tests. It could be 
that the `join` and `distinct` causes the underperformance. 

However, those operations would increase the number of partitions in the 
computation a lot due to `spark.sql.shuffle.partitions` setting if using 
defaults. Setting this to say `8` (the number of threads I have locally), I get 

```
scala> spark.time { userRecsPart.foreach(_ => Unit) }
Time taken: 37362 ms
```

So, about 30% of the full time for the 30% sample.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-31 Thread mpjlu
Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/18748
  
Did you test the performance of this, I tested the performance of MLLIB  
recommendForUserSubset some days ago, the performance is not good. Suppose the 
time of recommendForAll is 35s, recommend for 1/3 Users use this may need 90s. 
Maybe it is faster to use recommendForAll then select 1/3 users.  But if 
recommend tens or hundreds of users, this is faster than recommendForAll. So 
should we add come commends in the code about when it is better to use 
recommendForUserSubset. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79998/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #79998 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)**
 for PR 18748 at commit 
[`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/18748
  
Seems reasonable; would it be more or less common/natural for someone to 
specify the users as a simple set, rather than a Dataset? not sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18748
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79997/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #79997 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)**
 for PR 18748 at commit 
[`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #79998 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79998/testReport)**
 for PR 18748 at commit 
[`4bd91f1`](https://github.com/apache/spark/commit/4bd91f12e1b15657d92fea6d7b91dae2e6e68c29).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18748
  
**[Test build #79997 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79997/testReport)**
 for PR 18748 at commit 
[`5a8c421`](https://github.com/apache/spark/commit/5a8c4216ce636dea3ba67baa9b169db7486f37f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-27 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/18748
  
**Note 1** this implementation must perform a `distinct` on the input data 
frame id column to guarantee correct results, since otherwise multiple "copies" 
of the same recommendations would be generated for duplicate ids, and the 
resulting recommendations contain duplicates. This could alternatively be left 
to the user to handle, and assume that the input data frame contains no 
duplicates. But for now I've opted for the safest option even if it introduces 
this inefficiency.

**Note 2** This does not support `coldStartStrategy`. Therefore no 
recommendations will be returned for ids in the input dataframe that are not 
contained in the model (this is analogous to `coldStartStrategy=drop` for 
`transform`). I believe this makes most sense, since supporting something like 
the `na` option would be a bit involved and not add that much value. However it 
could be done (but would need to return `null` rows in the `recommendation` 
column for these cases). Later, when other cold start strategies might be 
supported (e.g. average factor vectors), this method could return 
recommendations even for ids that are not contained in the model.

cc @srowen @jkbradley @yanboliang @mpjlu @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org