[jira] [Updated] (SPARK-33487) Let ML ALS recommend for BOTH subsets - users and items

2020-12-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-33487:
-
Priority: Minor  (was: Major)

> Let ML ALS recommend for BOTH subsets - users and items
> ---
>
> Key: SPARK-33487
> URL: https://issues.apache.org/jira/browse/SPARK-33487
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Rose Aysina
>Priority: Minor
>
> Currently ALS in Spark ML supports next methods for getting recommendations:
>  * {{recommendForAllUsers(numItems: Int): DataFrame}}
>  * {{recommendForAllItems(numUsers: Int): DataFrame}}
>  * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
>  * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}
>  
> *Feature request:* to add a method that recommends subset of items for subset 
> of users, i.e. both users and items are selected from provided subsets. 
> *Why it is important:* in real-time recommender systems you usually make 
> predict for current users (that's why we need subset of users). And you can 
> just recommend all items that you have, but only those who satisfy some 
> business filters (that's why we need subset of items). 
> *For example:* consider real-time news recommender system. Predict is done 
> for small subset of users (say, for example, visitors for last minute), but 
> it is not allowed to recommend old news or news not related to user country 
> or etc, so at each predict we have some "white" list of items.
> So that's why it will be extremely useful to control *BOTH* for which users 
> make recommendations *AND* which items include in these recommendations. 
> *Related issues:* -SPARK-20679- , but there is just subsets either on users 
> *OR* items. 
> *What we do now:* just make additional filtering after 
> {{recommendForUserSubset}} call, but this method has significant cost - we 
> must receive all items recommendations, i.e. *{{numItems = # all available 
> items}}* and then filter and only then select top-k among them.
> *Why it is bad:* usually subset of items allowed to recommend right now is 
> much smaller than the amount of all seen items in an origin data (in my real 
> dataset it is 220k vs 500). 
> *Design:* I am sorry - I am not familiar with Spark internals so I offer 
> solution based just on my human logic :) 
> {code:scala}
> def recommendForUserItemSubsets(userDataset: Dataset[_], 
> itemDataset: Dataset[_], 
>   numItems: Int): DataFrame = {
> val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
> $(userCol))
> val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
> $(itemCol))
> recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), 
> $(itemCol), numItems, $(blockSize))
> }
> {code}
>  
> I will be glad to receive some feedback, is it reasonable request or not and 
> maybe more efficient workarounds. 
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33487) Let ML ALS recommend for BOTH subsets - users and items

2020-11-19 Thread Rose Aysina (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rose Aysina updated SPARK-33487:

Description: 
Currently ALS in Spark ML supports next methods for getting recommendations:
 * {{recommendForAllUsers(numItems: Int): DataFrame}}
 * {{recommendForAllItems(numUsers: Int): DataFrame}}
 * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
 * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}

 

*Feature request:* to add a method that recommends subset of items for subset 
of users, i.e. both users and items are selected from provided subsets. 

*Why it is important:* in real-time recommender systems you usually make 
predict for current users (that's why we need subset of users). And you can 
just recommend all items that you have, but only those who satisfy some 
business filters (that's why we need subset of items). 

*For example:* consider real-time news recommender system. Predict is done for 
small subset of users (say, for example, visitors for last minute), but it is 
not allowed to recommend old news or news not related to user country or etc, 
so at each predict we have some "white" list of items.

So that's why it will be extremely useful to control *BOTH* for which users 
make recommendations *AND* which items include in these recommendations. 

*Related issues:* -SPARK-20679- , but there is just subsets either on users 
*OR* items. 

*What we do now:* just make additional filtering after 
{{recommendForUserSubset}} call, but this method has significant cost - we must 
receive all items recommendations, i.e. *{{numItems = # all available items}}* 
and then filter and only then select top-k among them.

*Why it is bad:* usually subset of items allowed to recommend right now is much 
smaller than the amount of all seen items in an origin data (in my real dataset 
it is 220k vs 500). 

*Design:* I am sorry - I am not familiar with Spark internals so I offer 
solution based just on my human logic :) 
{code:scala}
def recommendForUserItemSubsets(userDataset: Dataset[_], 
itemDataset: Dataset[_], 
numItems: Int): DataFrame = {
val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
$(userCol))
val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
$(itemCol))
recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), $(itemCol), 
numItems, $(blockSize))
}
{code}
 

I will be glad to receive some feedback, is it reasonable request or not and 
maybe more efficient workarounds. 

 

Thanks!
 

  was:
Currently ALS in Spark ML supports next methods for getting recommendations:
 * {{recommendForAllUsers(numItems: Int): DataFrame}}
 * {{recommendForAllItems(numUsers: Int): DataFrame}}
 * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
 * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}

 

*Feature request:* to add a method that recommends subset of items for subset 
of items, i.e. both users and items are selected from provided subsets. 

*Why it is important:* in real-time recommender systems you usually make 
predict for current users (that's why we need subset of users). And you can 
just recommend all items that you have, but only those who satisfy some 
business filters (that's why we need subset of items). 

*For example:* consider real-time news recommender system. Predict is done for 
small subset of users (say, for example, visitors for last minute), but it is 
not allowed to recommend old news or news not related to user country or etc, 
so at each predict we have some "white" list of items.

So that's why it will be extremely useful to control *BOTH* for which users 
make recommendations *AND* which items include in these recommendations. 

*Related issues:* -SPARK-20679- , but there is just subsets either on users 
*OR* items. 

*What we do now:* just make additional filtering after 
{{recommendForUserSubset}} call, but this method has significant cost - we must 
receive all items recommendations, i.e. *{{numItems = # all available items}}* 
and then filter and only then select top-k among them.

*Why it is bad:* usually subset of items allowed to recommend right now is much 
smaller than the amount of all seen items in an origin data (in my real dataset 
it is 220k vs 500). 

*Design:* I am sorry - I am not familiar with Spark internals so I offer 
solution based just on my human logic :) 
{code:scala}
def recommendForUserItemSubsets(userDataset: Dataset[_], 
itemDataset: Dataset[_], 
numItems: Int): DataFrame = {
val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
$(userCol))
val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
$(itemCol))
recommendForAll(userFactorSubset, 

[jira] [Updated] (SPARK-33487) Let ML ALS recommend for BOTH subsets - users and items

2020-11-19 Thread Rose Aysina (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rose Aysina updated SPARK-33487:

Summary: Let ML ALS recommend for BOTH subsets - users and items  (was: Let 
ML ALS recommend for BOTH subsets - users nd items)

> Let ML ALS recommend for BOTH subsets - users and items
> ---
>
> Key: SPARK-33487
> URL: https://issues.apache.org/jira/browse/SPARK-33487
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Rose Aysina
>Priority: Major
>
> Currently ALS in Spark ML supports next methods for getting recommendations:
>  * {{recommendForAllUsers(numItems: Int): DataFrame}}
>  * {{recommendForAllItems(numUsers: Int): DataFrame}}
>  * {{recommendForUserSubset(dataset: Dataset[_], numItems: Int): DataFrame}}
>  * {{recommendForItemSubset(dataset: Dataset[_], numUsers: Int): DataFrame}}
>  
> *Feature request:* to add a method that recommends subset of items for subset 
> of items, i.e. both users and items are selected from provided subsets. 
> *Why it is important:* in real-time recommender systems you usually make 
> predict for current users (that's why we need subset of users). And you can 
> just recommend all items that you have, but only those who satisfy some 
> business filters (that's why we need subset of items). 
> *For example:* consider real-time news recommender system. Predict is done 
> for small subset of users (say, for example, visitors for last minute), but 
> it is not allowed to recommend old news or news not related to user country 
> or etc, so at each predict we have some "white" list of items.
> So that's why it will be extremely useful to control *BOTH* for which users 
> make recommendations *AND* which items include in these recommendations. 
> *Related issues:* -SPARK-20679- , but there is just subsets either on users 
> *OR* items. 
> *What we do now:* just make additional filtering after 
> {{recommendForUserSubset}} call, but this method has significant cost - we 
> must receive all items recommendations, i.e. *{{numItems = # all available 
> items}}* and then filter and only then select top-k among them.
> *Why it is bad:* usually subset of items allowed to recommend right now is 
> much smaller than the amount of all seen items in an origin data (in my real 
> dataset it is 220k vs 500). 
> *Design:* I am sorry - I am not familiar with Spark internals so I offer 
> solution based just on my human logic :) 
> {code:scala}
> def recommendForUserItemSubsets(userDataset: Dataset[_], 
> itemDataset: Dataset[_], 
>   numItems: Int): DataFrame = {
> val userFactorSubset = getSourceFactorSubset(dataset, userFactors, 
> $(userCol))
> val itemFactorSubset = getSourceFactorSubset(dataset, itemFactors, 
> $(itemCol))
> recommendForAll(userFactorSubset, itemFactorSubset, $(userCol), 
> $(itemCol), numItems, $(blockSize))
> }
> {code}
>  
> I will be glad to receive some feedback, is it reasonable request or not and 
> maybe more efficient workarounds. 
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org